Major change
[u/philim/db2osl_thesis.git] / bootstrapping_iris.tex
... / ...
CommitLineData
1\section{Generating unique IRIs for OBDA specification map fields}
2\label{iris}
3As explained in Section~\fullref{back_basic}, IRIs play a central role in
4diverse topics related to ontology-based data access.
5They provide the means to uniquely identify entities, which of course is
6a necessity for data retrieval.
7As also explained in Section~\ref{back_basic}, every URI is also a IRI, so
8although Skjæveland et al. use the term ``URI'' in the introduction of
9their approach of using OBDA specifications for ontology bootstrapping --
10and that term is also used in Section~\ref{bootstrap_spec}, which describes
11this approach -- in this section the general term ``IRI'' is used, marking
12that the introduced concepts are valid for all types of IRIs.
13
14When dealing with ontology bootstrapping using OBDA specifications, it is
15important to differentiate between the three types of IRIs occurring
16in this matter, which will be underlined by the following unambiguous naming:
17\begin{itemize}
18 \item \emph{Data IRIs} identify entities in the bootstrapped ontology
19 \item \emph{OBDA IRIs} are used as values for the fields of
20 OBDA specification entities
21 \item \emph{OSL IRIs} identify components in serialized OBDA specifications,
22 using the \oslboth{} introduced in Chapter~\ref{osl} for serialization
23\end{itemize}
24
25Skjæveland et al. do not define or assume a particular scheme for IRI generation
26in their introduction of OBDA specifications \cite{eng}.
27Instead, the IRI generation strategy is only adumbrated by giving examples
28of entities having IRIs.
29The examplified scheme was used for the implementation of the \myprog{} software
30bootstrapping OBDA specifications from relational database schemata, which is
31described in this thesis
32(see Section~\fullref{program} and Section~\fullref{impl}).
33The direct mapping approach for ontology bootstrapping described in
34Section~\ref{dirm}, on the other hand, introduces a scheme for
35IRI generation \cite{dirm}, but with this scheme, name clashes can occur,
36as explained in Section~\ref{dirm_iris}.
37The \oslboth{}, finally, defines a proper scheme for OSL IRIs,
38as is explained in Section~TODO.
39
40In the following, an enhanced scheme for the generation of OBDA IRIs
41is proposed, which resembles the previously mentioned scheme used for
42OSL IRIs and which also may serve as a blueprint for
43other IRI generation strategies.
44
45\subsection{Requirements for the IRI scheme}
46\label{iris_req}
47As explained in Section~\fullref{back_basic}, the main requirement on a IRI
48generation scheme is uniqueness of the IRIs: no two entities must be possibly
49assigned the same IRI, regardless of their kind, of how low the probability
50of a name clash (IRI collision) is or of the conditions leading
51to a name clash.
52Additionally, IRI uniqueness shall be independent from the base IRIs,
53thus a base IRI shall be arbitrarily selectable for each generation process
54without introducing name clashes even with IRIs having other base IRIs.
55
56As to OBDA specification entities, the following kinds of IRIs
57have to be available, including IRI patterns:
58\begin{itemize}
59 \item Entity map OWL class IRIs
60 \item Identifier map IRI patterns
61 \item Attribute map OWL property IRIs
62 \item Attribute map IRI patterns
63 \item Relation map OWL property IRIs
64 \item Subtype map (IRI) prefixes
65 \item Subtype map (IRI) suffixes
66 \item Subtype map OWL superclass IRIs
67\end{itemize}
68
69As Subtype map OWL superclass IRIs are IRIs of data entities
70already existing in the target ontology by some means
71(see Section~\fullref{bootstrap_spec_using}), they do not have to be
72generated and thus are ignored in the following.
73Exactly the same holds for Attribute map IRI patterns.
74Furthermore, this approach creates Subtype map IRI prefixes already leading to
75unique IRIs for Subtype map subclasses and so Subtype map IRI suffixes are
76ignored in the following.
77Since an IRI generation scheme cannot avoid collisions with existing IRIs
78out of its outreach and these collisions can easily be prevented,
79for example, by giving them another base IRI (see Setion~\fullref{back_basic}),
80this case is excluded from the requirement that no two URIs must collide
81under any circumstances.
82However, the user shall be able to chose such externally generated IRIs
83from an infinite set of IRIs, while being sure that no name clashes
84will occur.
85
86So compendious, the requirements on the IRI generation scheme are that
87Entity map OWL class IRIs, Identifier map IRI patterns,
88Attribute map OWL property IRIs, Relation map OWL property IRIs and
89Subtype map IRI prefixes can be generated that, regardless of the
90chosen base IRIs, don't clash among another, while leaving an
91infinite set of predictable IRIs that don't clash with any of the
92generated IRIs.
93
94\subsection{Avoiding name clashes in the IRI scheme}
95\label{iris_clashes}
96Generating unique Entity map OWL class IRIs ignoring base IRIs is not much of
97a problem, assuming database table names are distinct, which is guaranteed in a
98common database system like SQL \cite{sql}.
99Including the table name into an Entity map OWL class IRI is sufficient to
100prevent it from colliding with other IRIs with the same base IRI.
101However, when taking two different base IRIs into account that are used for
102two IRIs created according to this scheme, things get more complicated.\\
103Consider, for example, a database table named ``\code{Persons}'' and a
104table named \\ ``\code{Persons\_\_TABLE\_\_Persons}''.
105Generating an IRI according to the scheme
106``<base:>\code{TABLE\_\_} <table name>'' for each of these tables, using the
107base IRI ``\code{TABLE\_\_Persons\_\_}'' for the first one and the
108empty base IRI for the second one, both tables will get the IRI\\
109``\code{TABLE\_\_Persons\_\_TABLE\_\_Persons}'', although
110the table name was included into the IRI in both cases.
111The problem is that the ``\code{TABLE\_\_}'' string occurring in the
112table name cannot be discriminated from the ``\code{TABLE\_\_}'' string
113added in the course of IRI generation or the ``\code{TABLE\_\_}'' string
114occurring in the base IRI.
115To solve the problem, a marker has to be included in the URI which definitely
116indicates the beginning of the table name. In addition, this marker will
117uniquely identify Entity map OWL class IRIs.
118For both aims to be achieved, an escape symbol must be used, which makes
119the marker unique at least outside the base IRI part, by escaping the marker
120whenever it occurs in the table name.
121
122Regarding Identifier map IRI patterns, the IRI resulting from the expansion of
123the pattern will contain the column names of the primary key represented by
124the respective Identifier map \cite{eng}.
125Further on, the table name of the database table containing that primary
126key has to be included in the IRI pattern, since
127two distinct tables may have primary keys with equally named columns.
128This will make the IRI pattern a unique Identifier map IRI pattern,
129since a database table can be assumed to only have one primary key,
130as is the case in common database systems like SQL \cite{sql}.
131The fact that primary key values are unique for each dataset ensures that
132unique Identifier map IRI patterns expand to unique IRIs.
133Moreover, it has to be ensured, that IRIs resulting from the expansion of
134Identifier map IRI patterns do not collide with IRIs of other kinds.\\
135Taking arbitrary and particularly varying base IRIs into account,
136a definite marker has to be included in the IRI pattern and other occurrences
137of this marker in the IRI pattern have to be escaped. This uniquely
138identifies Identifier map IRI patterns and unambiguously
139distinguishes the table name from the rest of the IRI.
140
141Concerning Attribute map OWL property IRIs, they will be unique among their
142kind when they include the column name of the database column they
143represent besides the table name of the table containing it,
144since database table names can be assumed to be distinct and
145column names can be assumed to be unique within a table, which is
146guaranteed in a common database system like SQL \cite{sql}.
147Furthermore, Attribute map OWL property IRIs have to be prevented from
148colliding with IRIs of other kinds.\\
149Taking arbitrary and particularly varying base IRIs into account,
150definite markers have to be included in the IRI and other occurrences
151of this marker in the IRI have to be escaped. This uniquely
152identifies Attribute map OWL property IRIs and unambiguously
153distinguishes the table name and the column name from the rest of the IRI
154and from one another.
155
156Regarding Relation map OWL property IRIs, including the table name and
157the column names of both the foreign key represented by the Relation map
158(or the containing table, respectively) and the referenced key (or its
159containing table, respectively) in the IRI will make it a unique
160Relation map OWL property IRI.
161Note that including only the table name and the column names of the foreign
162key (or its containing table, respectively) would not be sufficient, since
163several distinct foreign keys covering exactly the same columns can exist
164in a table (this is what the IRI generation scheme of the direct mapping
165approach misses). The same applies of course for the referenced table and
166its columns -- several foreign keys can reference them.
167Moreover, these Regarding Relation map OWL property IRIs have to be
168prevented from colliding with IRIs of other kinds.\\
169Taking arbitrary and particularly varying base IRIs into account,
170definite markers have to be included in the IRI pattern and other occurrences
171of this marker in the IRI have to be escaped. This uniquely
172identifies Relation map OWL property IRIs and in particular their parts
173providing the table and column names.
174
175Concerning Subtype map IRI prefixes, they must include the column name of
176the database column containing the values to be declared belonging to
177the subclass. Further on, since another database table could contain a
178column of the same name, the IRI must include the table name of the
179database table containing the column.
180This will make Subtype map IRI prefixes unique among their kind.
181Note that a Subtype map IRI prefix, similarly to a IRI pattern,
182does not specify the final IRI but is subject to expansion.
183This expansion can yield the same IRI for different data records, which,
184however, is not considered a collision, since this behavior is
185intentional -- every two data records having the same value in
186the respective column, and only those, will get the same IRI.
187Additionally, it has to be ensured, that IRIs resulting from such an
188expansion do not collide with IRIs of other kinds.\\
189Taking arbitrary and particularly varying base IRIs into account,
190definite markers have to be included in the IRI pattern and other occurrences
191of this marker in the IRI prefix have to be escaped. This uniquely
192identifies Subtype map IRI prefixes and unambiguously
193distinguishes the table name and the column name from the rest of the IRI
194and from one another.
195
196For an example that makes awkwardly -- or fraudulently -- chosen base
197IRIs introduce name clashes, see the paragraph about Entity map
198OWL class IRIs at the beginning of this section.
199
200\subsection{The proposed IRI generation scheme}
201\label{iris_scheme}
202This section introduces an IRI generation scheme meeting the requirements
203formulated in Section~\ref{iris_req}.
204
205In this section, the following strings are subsumed under the term
206\emph{marking strings}:\\
207``\code{TABLE\_\_}'', ``\code{TBL\_\_}'', ``\code{PROP\_\_}'',
208``\code{REF\_\_}'' and ``\code{SUBTYPE\_\_}''.\\
209The string built by escaping (prefixing) all occurrences of marking strings or
210`\textasciitilde' characters in a string $s$ with a `\textasciitilde'
211character will be called the \emph{IRI-safe version} of $s$.
212
213The IRI generation scheme is presented in Table~\ref{bootstrap_tab_iris}.
214Here, \\ \emph{<base:>} refers to the base IRI to be used for
215the generated IRI (see Section~\fullref{back_basic}),\\
216\emph{<cl. tbl name>} refers to the table name of the database table
217concerning (see Section~\ref{iris_clashes}) in its IRI-safe version,\\
218\emph{<cl. name 1st pk col>} refers to the name of the first primary key
219column concerning (see Section~\ref{iris_clashes}) in its IRI-safe version,\\
220\emph{<\code{/}...>} refers to the continuation of the previous pattern
221using the remaining primary key or foreign key columns,\\
222\emph{<cl. col name>} refers to the name of the column in question
223(see Section~\ref{iris_clashes}) in its IRI-safe version,\\
224\emph{<cl. src tbl>} refers to the table name of the database table
225containing the respective foreign key (see Section~\ref{iris_clashes})
226in its IRI-safe version,\\
227\emph{<cl. 1st src col>} refers to the name of the first foreign key
228column of the respective foreign key (see Section~\ref{iris_clashes})
229in its IRI-safe version,\\
230\emph{<cl. tgt tbl>} refers to the table name of the table referenced by
231the respective foreign key (see Section~\ref{iris_clashes})
232in its IRI-safe version and\\
233\emph{<cl. 1st tgt col>} refers to the name of the first column referenced
234by the respective foreign key (see Section~\ref{iris_clashes})
235in its IRI-safe version.\\
236
237\begin{table}[H]\begin{centering}
238 \begin{tabular}{p{6.3cm}|p{9.7cm}}
239 \textbf{IRI type} & \textbf{Proposed IRI} \\ \hline
240 Entity map OWL class IRI & <base:>\code{TABLE\_\_}<cl. tbl name>\\
241 Identifier map IRI pattern & <base:>\code{TBL\_\_}<cl. tbl name>\code{/}<cl. name 1st pk col>\newline \ind{} \code{/\{\$1\}/}<\code{/}...>\\
242 Attribute map OWL property IRI & <base:>\code{PROP\_\_}<cl. tbl name>\code{\_\_}<cl. col name>\\
243 Relation map OWL property IRI & <base:>\code{REF\_\_}<cl. src tbl>\code{/}<cl. 1st src col>\newline \ind{} <\code{/}...>\code{/}<cl. tgt tbl>\code{/}<cl. 1st tgt col><\code{/}...>\\
244 Subtype map IRI prefixes & <base:>\code{SUBTYPE\_\_}<cl. tbl name>\code{\_\_}\newline \ind{} <cl. col name>\code{/}\\
245% % Slanted and with underscores and without "clean":
246% Entity map OWL class IRI & \textit{<base>}\code{:TABLE\_\_}\textit{<table\_name>}\\
247% Identifier map IRI pattern & \textit{<base>}\code{:TBL\_\_}\textit{<table\_name>}\code{/}\textit{<name\_1st\_pk\_col>}\newline \ind{} \code{/\{\$1\}/}\textit{<\code{/}...>}\\
248% Attribute map OWL property IRI & \textit{<base>}\code{:PROP\_\_}\textit{<table\_name>}\code{\_\_}\textit{<col\_name>}\\
249% Relation map OWL property IRI & \textit{<base>}\code{:REF\_\_}\textit{<src\_table>}\code{/}\textit{<1st\_fk\_src\_col>}\newline \ind{} \textit{<\code{/}...>}\code{/}\textit{<tgt\_table>}\code{/}\textit{<1st\_fk\_tgt\_col>}\textit{<\code{/}...>}\\
250% Subtype map IRI prefixes & \textit{<base>}\code{:SUBTYPE\_\_}\textit{<table\_name>}\code{\_\_}\textit{<col\_name>}\code{/}\\
251 \end{tabular}
252 \caption{Proposed IRIs to be used in OBDA specification map fields}
253 \label{bootstrap_tab_iris}
254\end{centering}\end{table}
255
256It is easily verified that the proposed IRI scheme is correct
257regarding the requirements described in Section~\ref{iris_req}:
258it provides unique IRIs for all types of IRIs it allows to create,
259regardless of the chosen base IRI
260(see proof in Section~\ref{iris_proof}).
261Furthermore, the IRI scheme is expressive: ignoring the base
262IRI part, the kind of entity identified by the IRI can be
263determined by beginning of the IRI.
264In addition, it is regular in that
265the name of the containing table always occurs before the
266name of the first database column.
267
268Taking the information in Section~\ref{iris_clashes} into account,
269it is trivial to observe that the suggested IRI scheme, leaving
270out the demand of IRI-safe versions, is still correct, given that
271all IRIs are generated using the same base IRI.
272
273Note that it is in any case necessary that the beginnings
274of all kinds of IRIs be mutually different:
275if, for example, an Identifier map IRI pattern also would
276commence with ``<base:>\code{TABLE\_\_}'', a table named
277``\code{PERSONS/\{17\}}'' -- which is a valid table name for
278example in SQL \cite{sql} -- possibly could get an Entity map OWL
279class IRI assigned which clashes with the IRI resulting from the
280expansion of the IRI pattern
281``<base:>\code{TABLE\_\_PERSONS/\{\$1\}}''.
282
283\subsection{Proof of correctness of the proposed IRI scheme}
284\label{iris_proof}
285As described in Section~\ref{iris_req}, the previously described
286IRI schema is required to generate several types of IRIs without
287introducing name clashes, thus two equal IRIs for two distinct
288entities, independently of the chosen base IRIs.
289Additionally, the user shall be able to chose additional IRIs
290he can be sure won't collide with IRIs generated with the scheme
291from an infinite set.
292
293In this proof, like in Section~\ref{iris_clashes}, the strings
294``\code{TABLE\_\_}'', ``\code{TBL\_\_}'', ``\code{PROP\_\_}'',
295``\code{REF\_\_}'' and ``\code{SUBTYPE\_\_}'' are called
296\emph{marking strings}.\\
297Strings prefixed by `\textasciitilde' are referred to as
298\emph{escaped}, while strings not prefixed by `\textasciitilde'
299are referred to as \emph{unescaped}.
300
301In the following, it is proven that the IRIs of each type do not
302clash, neither among themselves nor with IRIs of other types.
303Since all generated IRIs begin with a marking string, every IRI \emph{not}
304beginning with a marking string, thus an infinite quantity,
305is sure not to collide with any of the generated IRIs, and so,
306the correctness regarding to the stated requirements is then proven.
307
308Each Entity map OWL class IRI (including its base IRI) is of the form
309$\alpha$\code{TABLE\_\_}$\beta$, with $\alpha$ not ending with `\textasciitilde'
310and $\alpha$ and $\beta$ not containing any unescaped marking strings.\\
311Thus, $\beta$ is the table name, making the IRI unique among all
312other Entity map OWL class IRIs (see considerations in
313Section~\ref{iris_clashes}).
314Because the IRI does not contain any unescaped marking strings,
315it cannot collide with any IRI of another type and thus is indeed unique.
316
317The proof for Identifier map IRI patterns, Attribute map OWL property IRIs,
318Relation map OWL property IRIs and Subtype map IRI prefixes is exactly
319analog.
320
321\hfill $\Box$