Major change
[u/philim/db2osl_thesis.git] / bootstrapping_spec.tex
CommitLineData
b96bb723
PM
1\section{Ontology bootstrapping using OBDA specifications}
2\label{bootstrap_spec}
45d598e9 3TODO: only mapping, no duplication, r2rml
b96bb723
PM
4
5\subsection{Structure of OBDA specifications}
45d598e9 6\label{bootstrap_spec_struc}
b96bb723
PM
7An OBDA specification consists of several so-called ``maps'', which are
8data records containing data and references to each other describing
9parts of the OBDA specification in statically defined fields \cite{eng}.
10For different aspects of the specification, there are different map types,
11while usually several maps exist for each type.
12Namely, these are \emph{Entity maps} describing database tables,
b96bb723 13\emph{Identifier maps} describing database primary keys,
45d598e9 14\emph{Attribute maps} describing database columns,
b96bb723
PM
15\emph{Relation maps} describing database foreign keys,
16\emph{Subtype maps} describing ``is-a'' relationships in the data and
17\emph{Translation tables} describing desired translations of data.
18
19The fields of the several types of maps and their interconnection via
20references is shown in Figure~\ref{spec_fig_structure}.
45d598e9
PM
21Here, each field specifies, in that order, the field label, the field's
22long name, the bootstrapping steps in which the field is used and the
23field's short name. Fields storing a set of values have both their short
24and their long name suffixed with ``\code{...}''.
25Note that each reference between two fields is denoted with a short field
26name contained in the source of the reference, specifying the field
27in which the reference is stored.
28What the values of the fields of the several types of maps express
29exactly and how they can be used is described in
30Section~\ref{bootstrap_spec_using}.
31For a full description as well of the structure of OBDA specifications
32as of their application and the general idea behind them,
33refer to \cite{eng}.
34How OBDA specifications can in turn be automatically bootstrapped,
35excluding Subtype maps and Translation tables is
36described in Section~\ref{bootstrap_bootstrap}.
b96bb723
PM
37
38\begin{figure}[H]\begin{center}
39 \includegraphics[scale=1.0]{Images/specification_structure.pdf}
40 \caption[Constitution of an OBDA specification]{
41 Constitution of an OBDA specification.
42 ``$\rightarrow$'' means ``references''. (from \cite{eng})}
43 \label{spec_fig_structure}
44\end{center}\end{figure}
45
45d598e9
PM
46Entity maps, Identifier maps, Attribute maps and Relation maps
47directly relate to database concepts and each of them describes
48exactly one database table, primary key, column or foreign key,
49respectively, and vice versa.
50Subtype maps and Translation tables, on the other hand,
51represent concepts of the bootstrapping process or data to be added to
52the target ontology and are somewhat harder to obtain:
53Subtype maps represent ``is-a'' relationships in the target ontology
54to be determined from the source data \cite{eng} -- heuristically or
55semi-automatically --, while Translation tables allow for the
56transformation of data values,
57for example from \code{TRUE} to \code{true} or from \code{No} to
58\code{false} \cite{eng}.
59Therefore, they also have to be determined heuristically or semi-
60automatically from the source data --
61considering the database schema only is not sufficient \cite{eng}.
62Note that, because of this, special care has to be taken to keep the maps
63synchronized with the data in case of Subtype maps and Translation tables.
64
65The structural description of OBDA specifications in \cite{eng}
66does not propose a serialization format in which OBDA specifications can
67be stored or read and written by software and human agents.
68How this can be done is subject to this thesis, which introduces the
69\oslboth{} designed exactly for this purpose in Chapter~\ref{osl}.
b96bb723
PM
70
71\subsection{Using OBDA specifiations}
45d598e9
PM
72\label{bootstrap_spec_using}
73TODO: dirm
74TODO: r2rml
75
76As described in in Section~\fullref{back_obdaspecs}, using OBDA
77specifications provides several benefits when concerned in ontology
78bootstrapping.
79Principally, information about the bootstrapping process is collected in
80one place and can be used to manage the tools involved.
81This includes the availability of the URIs to be used in the constructed
82ontology from a central place, which is a great advantage, since URIs
83are central to an ontology TODO.
84Additionally, all information on the database schema of the source
85database is available.
86Using Translation tables, all these information can at will be made
87subject to transformations normalizing or correcting the data
88changing the database \cite{eng}.
89
90By the use of a single specification language like the \oslboth{}
91to store OBDA specifications,
92the expense of converting between different data formats can be
93reduced significantly:
94assumed that there are $n$ different formats to be handled with no means
95provided to convert between them, the converting costs decrease from
96$\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ by introducing a single central
97language TODO.
98
99Besides the structure of OBDA specifications, described in
100Section~\ref{bootstrap_spec_struc}, Skjæveland et al. introduce a set
101of formal rules defining the bootstrapping process and the mapping of the
102source data to the generated ontology \cite{eng}.
103Using an OBDA specification, tools implementing these rules can
104easily and in a well-defined manner bootstrap an ontology and
105a mapping from the source data onto this ontology (or duplicate that
106source data, if the materialized OBDA approach is used, see
107Section~\ref{back_obda}).
108The mapping rules produce RDF triples which can be interpreted by R2RML
109to establish the mapping (see Section~\fullref{back_basic}).
110They are accompanied by SQL statements specifying the
111queries over the source database used to link the ontology to the data.
112If the data source is not a relational database but another form of
113structured data, like \name{CSV} files, a database schema can be
114bootstrapped first by applying additional ``database rules'' \cite{eng}.
115Afterwards, the proceeding can continue as if the data source were a
116database, so this case is neglected in the following.
117
118The rest of this section contains description of the information contained
119in the various types of OBDA specification maps and how it is used during
120bootstrapping, based on the description in \cite{eng}.
121Keep in mind that ``bootstrapping'' here refers to the ontology
122bootstrapping process specified by the OBDA specification, yielding a
123target ontology and mappings
124relating it to the source database -- it does not refer to the
125bootstrapping of the OBDA specification itself.
126The text is meant to give a brief explanatory overview over the bootstrapping
127process using OBDA specifications.
128Thus, it focuses on the information they provide and how they are used to
129link the bootstrapped ontology to the source data, leaving out the
130SQL statements to be used to gain the datasets.
131How exactly the ontology is created is also left out, since this would have
132involved introducing too many technical details, and moreover, the topic is
133also comprehensible by describing the mapping only.
134For a detailed description of the bootstrapping process, including ontology
135creation and all formal rules to be applied, see \cite{eng}.
136For an explanation about how an OBDA specification containing Entity maps,
137Identifier maps, Attribute maps and Relation maps can be bootstrapped from
138a relational database schema, see Section~\ref{bootstrap_bootstrap}.
139For details on URI generation, refer to Section~\fullref{iris}.
140
141\subsubsection{Entity maps}
142Entity maps provide information about the tables contained in the
143source database or in the intermediate database schema to be constructed,
144if the data source is not a database but some other source of structured
145information (see Section~\fullref{back_obda}).
146The information provided by an Entity map includes the table name,
147a label describing the table and a (more detailed) description
148of the table.
149Furthermore, each Entity map references an Identifier map representing
150its primary key and a set of Attribute maps representing its columns.
151Finally, an Entity map provides an OWL class URI identifying the
152represented table uniquely in the resulting ontology.
153As the name suggests, this URI is given to an OWL class which serves as
154OWL type (or more precisely: \code{rdf:type}) for all OWL individuals
155representing the datasets from the respective table in the target ontology
156(see Paragraph~``\nameref{iden}'').
157
158Suppose, for example, that an Entity map for a table ``\code{persons}''
159provides the OWL class URI ``\code{mydb:persons}''.
160Then, in the target ontology, all OWL individuals representing rows in the
161``\code{persons}'' table, will be of \code{rdf:type}
162``\code{mydb:persons}''.
163If a data record in the ``\code{persons}'' table has the identifying URI
164pattern ``\code{mydb:person/\{pno\}}'' (see Paragraph~\nameref{iden}),
165this type information will be expressed by the following RDF triple:\\
166\code{mydb:person/\{pno\} rdf:type mydb:persons}.
167
168\subsubsection{Identifier maps}
169\label{iden}
170Identifier maps describe database primary keys contained in the
171source database or in the intermediate database schema to be constructed,
172if the data source is not a database (see Section~\ref{back_obda}).
173Each Identifier map contains a reference back to the respective
174Entity map, representing the database table the primary key belongs to.
175Furthermore, each Identifier map references a set of Attribute maps,
176representing the database columns the primary key consists of
177(see next paragraph).
178Finally, an Identifier map provides a URI pattern, allowing OWL
179individuals in the bootstrapped target ontology that represent
180datasets to be identified.
181A URI pattern contains placeholders like ``\code{\{\$1\}}'' for
182all primary key columns, which are replaced with the respective column
183names, surrounded by curly braces, during the bootstrapping process,
184to yield a valid R2RML template \cite{r2rml}.
185Since the column name substituted in is still a placeholder,
186the result of this substitution is still a URI pattern and not a URI.
187Since such a URI pattern uniquely identifies a dataset from a given
188database table -- when data values are substituted in --,
189it will be called \emph{identifying URI pattern} in the following.
190
191Consider the URI pattern ``\code{mydb:person/\{\$1\}}''.
192Replacing the placeholder ``\code{\{\$1\}}'' with the column name
193``\code{pno}'' in curly braces yields the following identifying URI
194pattern: \\ ``\code{mydb:person/\{pno\}}''.
195
196\subsubsection{Attribute maps}
197Attribute maps provide information about database columns contained in the
198source database.
199Each Attribute map carries the column name, information whether having a
200value in this column is mandatory for a dataset (in SQL terms: whether it
201has a \code{NOT NULL} constraint), a label describing the column as well as
202an extended description of the column.
203A database column is represented as a relation in the final ontology, thus,
204in OWL terms, as an \code{owl:DataProperty} or an \code{owl:ObjectProperty}.
205The Attribute map provides the URI for this OWL property.
206
207Additionally, it specifies the datatype of values in the
208represented column in the following manner:
209Three fields are provided for this purpose, \code{SQL datatype},
210\code{RDF language} and \code{XSD datatype}.
211If the \code{XSD datatype} field is nonempty, its value is specified to be
212the datatype for values in the column the Attribute map represents
213(note that OWL only knows XSD datatypes TODO).
214Otherwise, if the value of the \code{SQL datatype} is a standard SQL type,
215it will be mapped to an XSD datatype and the resulting type is
216specified as datatype for values in the respective column.
217If neither of the above is the case and the \code{RDF language} field is
218nonempty, values in the respective column will be interpreted as strings
219with the value of the \code{RDF language} field applied as RDF language tag
220(TODO).
221If neither of the above is the case, values in the respective column will be
222interpreted as strings without an RDF language tag.
223
224Finally, an Attribute map specifies whether the column shall be represented
225as an \\ \code{owl:DataProperty} (for non-foreign-key columns) or as an
226\code{owl:ObjectProperty} (for foreign key columns).
227The field specifying that -- \code{Property type} -- also allows, as a
228further distinction of object properties, whether the property's
229target URI should be the column name placed in an URI pattern
230provided by the Attribute map,
231similarly to URI patterns in Identifier maps (see example at the end of this
232paragraph), or if it shall simply be the column name, possibly with a
233translation from a Translation table, specified by the Attribute map,
234applied.
235The former option is useful for example when using custom property URIs
236to express relations between source data and the target ontology.
237If an \code{owl:DataProperty} is generated, it always has the column name
238as target URI, without the use of an URI pattern.
239Note that it is sufficient to have the column name be the target of the
240property, since only a \emph{mapping} to the source data is generated.
241
242Consider an Attribute map representing a column named ``\code{name}'' to
243be mapped to an \\ \code{owl:DataProperty}.
244Suppose the OWL property URI the Attribute map specifies is\\
245``\code{mynamespace:lastName}'' and each dataset containing the
246\code{name} column has the identifying URI pattern
247``\code{mydb:person/\{pno\}}'' (see Paragraph~``\nameref{iden}'').
248Then, during the bootstrapping process
249the following RDF triple will be produced:\\
250\code{mydb:person/\{pno\} mynamespace:lastName "\{name\}"}.\\
251This triple can easily be interpreted by R2RML, which on request then
252retrieves the queried \code{name} from the data source.
253
254Consider, as a more elaborate example,
255an Attribute map representing a column named
256``\code{company}'' and to be mapped to an \code{owl:ObjectProperty} with the
257use of the URI pattern ``\code{http://otherdb/\{\$1\}}''.
258Suppose the OWL property URI the Attribute map specifies is
259``\code{mynamespace:hasSameOwnerAs}'', there is no datatype specified
260and each dataset containing the
261\code{company} column has the identifying URI pattern
262``\code{mydb:company/\{cmpno\}}''
263(see Paragraph~``\nameref{iden}'').
264Then, the following RDF triple will be produced during the bootstrapping
265process:\\
266\code{mydb:company/\{cmpno\} mynamespace:hasSameOwnerAs
267 http://otherdb/\{company\}}.\\
268Note that this only makes sense if R2RML can expand
269``\code{http://otherdb/\{company\}}'' to a valid subject
270for each value in the \code{company} column of the database,
271and if all rows in the respective database table are
272indeed entities having the same owner as the company specified in the
273\code{company} column.
274
275\subsubsection{Relation maps}
276Relation maps represent foreign keys contained in the
277source database.
278Each Relation map references the Entity maps representing the foreign key's
279child table and parent table, respectively.
280Furthermore, it provides the column names of both the foreign key columns
281and the referenced columns and specifies an OWL property URI.
282Relation maps allow the relations expressed in the source data via
283foreign key relationships to be included into the bootstrapped ontology.
284This happens in a simple and straight-forward manner:
285for each foreign key relationship, exactly one triple is generated which
286contains the two identifying
287URI patterns representing the source and the target dataset of the foreign
288key, respectively, and the OWL property URI specified by the Relation map.
289
290Suppose, for example, that a Relation map expressing a relation between
291datasets with the identifying URI patterns (see Paragraph~``\nameref{iden}'')
292``\code{mydb:persons/pno/\{pno\}}'' and
293``\code{mydb:companies/cmpno/\{cmpno\}}'', respectively, and specifying
294the OWL property URI ``\code{mynamespace:isEmployedAt}''.
295This will result in the following RDF triple to be generated during the
296bootstrapping process:\\
297\code{mydb:persons/pno/\{pno\} mynamespace:isEmployedAt
298 mydb:companies/cmpno/\{cmpno\}}
299
300\subsubsection{Subtype maps}
301Subtype maps provide a means to automatically add subclass-superclass
302relationships to the target ontology during the bootstrapping process.
303They specify an Entity map and a column name defining a table and a
304column, respectively, that exist in the source database and contain the
305values to be declared as belonging to the subclass.
306Furthermore, they store a prefix, a suffix and possibly a reference to
307a Translation table which are used to generate a URI for that subclass.
308Finally, they provide the URI of the superclass.
309This can be, for instance, some OWL class being created during the
310bootstrapping process or already existing in some imported ontology.
311The URI generated for the subclass contains the data value of the
312respective database column, thus every dataset gets its own (sub)class.
313During bootstrapping, an RDF triple declaring the value to belong to that
314subclass is generated for each data value of the respective table column in
315the source database.
316The limitation to the desired value only thereby happens by a restriction
317on the SQL statement accompanying the respective triple, not by
318limiting the triple to only cover a specific data value.
319The actual subclass-superclass relationship is expressed during the
320creation of the ontology.
321Note the difference from the previously described mapping rules, which
322produced triples independent from the data values in the source database.
323
324Consider a Subtype map specifying a table column containing the values
325``\code{Purchase}'' and ``\code{Sales}'' with the datasets having the
326identifying URI pattern (see Paragraph~``\nameref{iden}'')
327``\code{mydb:managers/mno/\{mno\}}''.
328Suppose, the Subtype map specifies the prefix\\
329``\code{mydb:manager/of\_department/}'', no suffix, no translation table
330and the supertype URI ``\code{mynamespace:persons}''.
331This will result in the generation of the following triples during the
332bootstrapping process:\\
333\code{mydb:managers/mno/\{mno\} rdf:type mydb:manager/of\_department/Purchase}\\
334\code{mydb:managers/mno/\{mno\} rdf:type mydb:manager/of\_department/Sales},\\
335while ``\code{mydb:manager/of\_department/Purchase}'' \\ and
336``\code{mydb:manager/of\_department/Sales}'' will be subclasses of class\\
337``\code{mynamespace:persons}'' in the target ontology.
338The accompanying SQL statement will ensure that, despite the use of the
339R2RML template ``\code{mydb:managers/mno/\{mno\}}'', not \emph{every}
340manager will be declared as the manager of \emph{every} department.
341
342\subsubsection{Translation tables}
343Translation tables allow for transforming URIs or other strings in
344arbitrary ways, by simply mapping each string to be translated to a
345target string.
346
347They don't reflect in the target ontology in any form but are used
348only during the bootstrapping process.
b96bb723
PM
349
350\subsection{Bootstrapping OBDA specifications}
45d598e9
PM
351\label{bootstrap_bootstrap}
352How OBDA specifications can in turn be bootstrapped from database schemata
353is subject to this thesis and is explained in this section.
354For the description of the software developed to automate this, see
355Chapter~\fullref{program}.
356The description in this section assumes an SQL database as data source.
357However, ontology-based data access and OBDA specifications are not limited
358to SQL databases, as mentioned in Section~\fullref{back_obda} and in
359Section~\fullref{bootstrap_spec_using}.
360Furthermore, this section refers to OBDA specifications without assuming
361any specific format in which they are represented.
362How OBDA specifications are represented internally by the \myprog{}
363software, is described in Section~\fullref{fine}.
364For the description of a format to serialize OBDA specifications
365-- the output format of \myprog{} --,
366refer to Chapter~\fullref{osl}.
367
368Subtype maps and Translation tables are not considered in this approach,
369since they cannot be bootstrapped from schema information only but have to
370be determined from the input data (see Section~\fullref{bootstrap_spec_struc}).
371Thus, the bootstrapped OBDA specification does not contain maps of these
372types. Including them is a significant challenge in its own right and,
373since the use of heuristics or user decisions would be necessary,
374would make the process involve human supervision at least.
375Apart from that, the bootstrapping is an easy and straight-forward task
376which can be carried out fully automatic TODO.
377
378Recall the structure of an OBDA specification explained in
379Section~\fullref{bootstrap_spec_struc}.
380The map types considered in this approach are Entity maps, Attribute maps,
381Identifier maps and Relation maps.
382The assignment of values to their fields is summarized in
383Table~\ref{bootstrap_tab_mapping}, only hinting at how maps are
384generated. Both the generation of the maps and the assignment of values
385to their fields are described in the rest of this section, with one
386exception:
387since the generation of URIs (or IRIs) in the context of OBDA specifications
388is an essential topic which requires some conceptual efforts, it is
389described in a separate section, Section~\ref{iris}.
390
391\begin{table}[H]\begin{centering}
392 \begin{tabular}{p{3.2cm}|p{4.0cm}|p{8.3cm}}
393 \textbf{Map type} & \textbf{Field name} & \textbf{Value} \\ \hline
394 Entity map & Table name & SQL table name \\
395 Entity map & Label & <empty> \\
396 Entity map & \emph{Identifier map} & Identifier map for table \\
397 Entity map & \emph{Attribute maps...} & Attribute maps for table columns \\
398 Entity map & OWL class URI & URI(table) \\
399 Entity map & Description & SQL table description \\ \hline
400 Identifier map & \emph{Entity map} & Entity map for corresponding table \\
401 Identifier map & \emph{Attribute maps...} & Attribute maps for primary key columns \\
402 Identifier map & URI pattern & URIpattern(table) \\ \hline
403 Attribute map & Column name & SQL column name \\
404 Attribute map & SQL datatype & SQL datatype of column \\
405 Attribute map & Mandatory & SQL \code{NOT NULL} property of column\newline(\code{true} or \code{false}) \\
406 Attribute map & Label & <empty> \\
407 Attribute map & OWL property URI & <empty> for foreign key columns,\newline else URI(table, column) \\
408 Attribute map & Property type & ``\code{ObjectProperty}'' for foreign key columns,\newline else ``\code{DataProperty}'' \\
409 Attribute map & \emph{Translation} & <empty> \\
410 Attribute map & URI pattern & <empty> \\
411 Attribute map & RDF language & <empty> \\
412 Attribute map & XSD datatype & <empty> \\
413 Attribute map & Description & SQL column description \\ \hline
414 Relation map & \emph{Source entity map} & Entity map for foreign key child table \\
415 Relation map & Source column & Foreign key child columns\newline(SQL column names) \\
416 Relation map & \emph{Target entity map} & Entity map for foreign key parent table \\
417 Relation map & Target column & Foreign key parent columns\newline (SQL column names) \\
418 Relation map & OWL property URI & URI(table, foreignKey) \\
419 \end{tabular}
420 \caption{Assignment of values to fields of OBDA specification maps}
421 \label{bootstrap_tab_mapping}
422\end{centering}\end{table}
423
424\subsubsection{Entity maps}
425Exactly one Entity map and one Identifier map is generated per table
426contained in the source database.
427The generated Identifier map is referenced by the Entity map's
428\code{Identifier map} field.
429Similarly, exactly one Attribute map is generated per table column
430and these Attribute maps are referenced by the Entity map's
431\code{Attribute maps...} field.
432The Entity map's \code{Table name} field is set to the SQL name of
433the table, the \code{Label} field remains empty.
434An URI identifying the table is generated and stored in the
435Entity map's \code{OWL class URI} field.
436The SQL table description is copied into the Entity map's
437\code{Description} field.
438
439\subsubsection{Identifier maps}
440An Identifier map represents exactly one primary key in the source
441database and is referenced by the Entity map representing the table
442containing the primary key constraint.
443In addition, it references this table in its
444\code{Entity map} field, so that there is a bidirectional referencing.
445The Attribute maps representing the columns constituting the primary
446key are referenced by the Identifier map's \code{Attribute maps...}
447field.
448An URI pattern, allowing datasets (thus, rows in the source database)
449to be identified in the target ontology, is generated and put in
450the Identifier map's \code{URI pattern} field.
451
452\subsubsection{Attribute maps}
453An Attribute map represents exactly one column in the source database
454and is referenced by the Entity map representing the table
455containing the column.
456The Attribute map's \code{Column name} field is set to the SQL column
457name of the column, the \code{SQL datatype} field is set
458to its SQL datatype.
459The \code{Mandatory} field is set to \code{true} if the column has
460the SQL \code{NOT NULL} constraint, otherwise to \code{false}.
461If the column is part of a foreign key, the \code{OWL propert URI}
462field remains empty. Otherwise,
463an URI identifying the column is generated and stored in the
464Attribute map's \code{OWL property URI} field.
465The \code{Property type} field is set to ``\code{ObjectProperty}''
466if the column is part of a foreign key, otherwise to
467``\code{DataProperty}''.
468The SQL column description is copied into the Attribute map's
469\code{Description} field.
470The remaining columns, \code{Label}, \code{Translation},
471\code{URI pattern}, \code{RDF language} and \code{XSD datatype}
472remain empty.
473
474\subsubsection{Relation maps}
475A Relation map represents exactly one foreign key in the source database.
476It contains fields storing the parent and child table of the foreign
477key: the \code{Source entity map} field, referencing the
478Entity map representing the child table of the foreign key, and
479the \code{Target entity map} field, referencing the
480Entity map representing the parent table of the foreign key.
481The SQL column names of the foreign key columns (thus, column names
482in the child table) are copied into the \code{Source column}
483field of the Relation map, and the SQL column names of the referenced
484columns in the parent table are copied into its \code{Target column}
485field.
486Note that, in contrast to Identifier maps representing primary keys,
487it is not referenced by any Entity map (or any other map).