[u/philim/db2osl_thesis.git] / bootstrapping_spec.tex

\section{Ontology bootstrapping using OBDA specifications}
\label{bootstrap_spec}
TODO: only mapping, no duplication, r2rml

\subsection{Structure of OBDA specifications}
\label{bootstrap_spec_struc}
An OBDA specification consists of several so-called ``maps'', which are
data records containing data and references to each other describing
parts of the OBDA specification in statically defined fields \cite{eng}.
For different aspects of the specification, there are different map types,
while usually several maps exist for each type.
Namely, these are \emph{Entity maps} describing database tables,
\emph{Identifier maps} describing database primary keys,
\emph{Attribute maps} describing database columns,
\emph{Relation maps} describing database foreign keys,
\emph{Subtype maps} describing ``is-a'' relationships in the data and
\emph{Translation tables} describing desired translations of data.

The fields of the several types of maps and their interconnection via
references is shown in Figure~\ref{spec_fig_structure}.
Here, each field specifies, in that order, the field label, the field's
long name, the bootstrapping steps in which the field is used and the
field's short name. Fields storing a set of values have both their short
and their long name suffixed with ``\code{...}''.
Note that each reference between two fields is denoted with a short field
name contained in the source of the reference, specifying the field
in which the reference is stored.
What the values of the fields of the several types of maps express
exactly and how they can be used is described in
Section~\ref{bootstrap_spec_using}.
For a full description as well of the structure of OBDA specifications
as of their application and the general idea behind them,
refer to \cite{eng}.
How OBDA specifications can in turn be automatically bootstrapped,
excluding Subtype maps and Translation tables is
described in Section~\ref{bootstrap_bootstrap}.

\begin{figure}[H]\begin{center}
		\includegraphics[scale=1.0]{Images/specification_structure.pdf}
		\caption[Constitution of an OBDA specification]{
			Constitution of an OBDA specification.
			``$\rightarrow$'' means ``references''. (from \cite{eng})}
		\label{spec_fig_structure}
\end{center}\end{figure}

Entity maps, Identifier maps, Attribute maps and Relation maps
directly relate to database concepts and each of them describes
exactly one database table, primary key, column or foreign key,
respectively, and vice versa.
Subtype maps and Translation tables, on the other hand,
represent concepts of the bootstrapping process or data to be added to
the target ontology and are somewhat harder to obtain:
Subtype maps represent ``is-a'' relationships in the target ontology
to be determined from the source data \cite{eng} -- heuristically or
semi-automatically --, while Translation tables allow for the
transformation of data values,
for example from \code{TRUE} to \code{true} or from \code{No} to
\code{false} \cite{eng}.
Therefore, they also have to be determined heuristically or semi-
automatically from the source data --
considering the database schema only is not sufficient \cite{eng}.
Note that, because of this, special care has to be taken to keep the maps
synchronized with the data in case of Subtype maps and Translation tables.

The structural description of OBDA specifications in \cite{eng}
does not propose a serialization format in which OBDA specifications can
be stored or read and written by software and human agents.
How this can be done is subject to this thesis, which introduces the
\oslboth{} designed exactly for this purpose in Chapter~\ref{osl}.

\subsection{Using OBDA specifiations}
\label{bootstrap_spec_using}
TODO: dirm
TODO: r2rml

As described in in Section~\fullref{back_obdaspecs}, using OBDA
specifications provides several benefits when concerned in ontology
bootstrapping.
Principally, information about the bootstrapping process is collected in
one place and can be used to manage the tools involved.
This includes the availability of the URIs to be used in the constructed
ontology from a central place, which is a great advantage, since URIs
are central to an ontology TODO.
Additionally, all information on the database schema of the source
database is available.
Using Translation tables, all these information can at will be made
subject to transformations normalizing or correcting the data
changing the database \cite{eng}.

By the use of a single specification language like the \oslboth{}
to store OBDA specifications,
the expense of converting between different data formats can be
reduced significantly:
assumed that there are $n$ different formats to be handled with no means
provided to convert between them, the converting costs decrease from
$\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ by introducing a single central
language TODO.

Besides the structure of OBDA specifications, described in
Section~\ref{bootstrap_spec_struc}, Skjæveland et al. introduce a set
of formal rules defining the bootstrapping process and the mapping of the
source data to the generated ontology \cite{eng}.
Using an OBDA specification, tools implementing these rules can
easily and in a well-defined manner bootstrap an ontology and
a mapping from the source data onto this ontology (or duplicate that
source data, if the materialized OBDA approach is used, see
Section~\ref{back_obda}).
The mapping rules produce RDF triples which can be interpreted by R2RML
to establish the mapping (see Section~\fullref{back_basic}).
They are accompanied by SQL statements specifying the
queries over the source database used to link the ontology to the data.
If the data source is not a relational database but another form of
structured data, like \name{CSV} files, a database schema can be
bootstrapped first by applying additional ``database rules'' \cite{eng}.
Afterwards, the proceeding can continue as if the data source were a
database, so this case is neglected in the following.

The rest of this section contains description of the information contained
in the various types of OBDA specification maps and how it is used during
bootstrapping, based on the description in \cite{eng}.
Keep in mind that ``bootstrapping'' here refers to the ontology
bootstrapping process specified by the OBDA specification, yielding a
target ontology and mappings
relating it to the source database -- it does not refer to the
bootstrapping of the OBDA specification itself.
The text is meant to give a brief explanatory overview over the bootstrapping
process using OBDA specifications.
Thus, it focuses on the information they provide and how they are used to
link the bootstrapped ontology to the source data, leaving out the
SQL statements to be used to gain the datasets.
How exactly the ontology is created is also left out, since this would have
involved introducing too many technical details, and moreover, the topic is
also comprehensible by describing the mapping only.
For a detailed description of the bootstrapping process, including ontology
creation and all formal rules to be applied, see \cite{eng}.
For an explanation about how an OBDA specification containing Entity maps,
Identifier maps, Attribute maps and Relation maps can be bootstrapped from
a relational database schema, see Section~\ref{bootstrap_bootstrap}.
For details on URI generation, refer to Section~\fullref{iris}.

\subsubsection{Entity maps}
Entity maps provide information about the tables contained in the
source database or in the intermediate database schema to be constructed,
if the data source is not a database but some other source of structured
information (see Section~\fullref{back_obda}).
The information provided by an Entity map includes the table name,
a label describing the table and a (more detailed) description
of the table.
Furthermore, each Entity map references an Identifier map representing
its primary key and a set of Attribute maps representing its columns.
Finally, an Entity map provides an OWL class URI identifying the
represented table uniquely in the resulting ontology.
As the name suggests, this URI is given to an OWL class which serves as
OWL type (or more precisely: \code{rdf:type}) for all OWL individuals
representing the datasets from the respective table in the target ontology
(see Paragraph~``\nameref{iden}'').

Suppose, for example, that an Entity map for a table ``\code{persons}''
provides the OWL class URI ``\code{mydb:persons}''.
Then, in the target ontology, all OWL individuals representing rows in the
``\code{persons}'' table, will be of \code{rdf:type}
``\code{mydb:persons}''.
If a data record in the ``\code{persons}'' table has the identifying URI
pattern ``\code{mydb:person/\{pno\}}'' (see Paragraph~\nameref{iden}),
this type information will be expressed by the following RDF triple:\\
\code{mydb:person/\{pno\} rdf:type mydb:persons}.

\subsubsection{Identifier maps}
\label{iden}
Identifier maps describe database primary keys contained in the
source database or in the intermediate database schema to be constructed,
if the data source is not a database (see Section~\ref{back_obda}).
Each Identifier map contains a reference back to the respective
Entity map, representing the database table the primary key belongs to.
Furthermore, each Identifier map references a set of Attribute maps,
representing the database columns the primary key consists of
(see next paragraph).
Finally, an Identifier map provides a URI pattern, allowing OWL
individuals in the bootstrapped target ontology that represent
datasets to be identified.
A URI pattern contains placeholders like ``\code{\{\$1\}}'' for
all primary key columns, which are replaced with the respective column
names, surrounded by curly braces, during the bootstrapping process,
to yield a valid R2RML template \cite{r2rml}.
Since the column name substituted in is still a placeholder,
the result of this substitution is still a URI pattern and not a URI.
Since such a URI pattern uniquely identifies a dataset from a given
database table -- when data values are substituted in --,
it will be called \emph{identifying URI pattern} in the following.

Consider the URI pattern ``\code{mydb:person/\{\$1\}}''.
Replacing the placeholder ``\code{\{\$1\}}'' with the column name
``\code{pno}'' in curly braces yields the following identifying URI
pattern: \\ ``\code{mydb:person/\{pno\}}''.

\subsubsection{Attribute maps}
Attribute maps provide information about database columns contained in the
source database.
Each Attribute map carries the column name, information whether having a
value in this column is mandatory for a dataset (in SQL terms: whether it
has a \code{NOT NULL} constraint), a label describing the column as well as
an extended description of the column.
A database column is represented as a relation in the final ontology, thus,
in OWL terms, as an \code{owl:DataProperty} or an \code{owl:ObjectProperty}.
The Attribute map provides the URI for this OWL property.

Additionally, it specifies the datatype of values in the
represented column in the following manner:
Three fields are provided for this purpose, \code{SQL datatype},
\code{RDF language} and \code{XSD datatype}.
If the \code{XSD datatype} field is nonempty, its value is specified to be
the datatype for values in the column the Attribute map represents
(note that OWL only knows XSD datatypes TODO).
Otherwise, if the value of the \code{SQL datatype} is a standard SQL type,
it will be mapped to an XSD datatype and the resulting type is
specified as datatype for values in the respective column.
If neither of the above is the case and the \code{RDF language} field is
nonempty, values in the respective column will be interpreted as strings
with the value of the \code{RDF language} field applied as RDF language tag
(TODO).
If neither of the above is the case, values in the respective column will be
interpreted as strings without an RDF language tag.

Finally, an Attribute map specifies whether the column shall be represented
as an \\ \code{owl:DataProperty} (for non-foreign-key columns) or as an
\code{owl:ObjectProperty} (for foreign key columns).
The field specifying that -- \code{Property type} -- also allows, as a
further distinction of object properties, whether the property's
target URI should be the column name placed in an URI pattern
provided by the Attribute map,
similarly to URI patterns in Identifier maps (see example at the end of this
paragraph), or if it shall simply be the column name, possibly with a
translation from a Translation table, specified by the Attribute map,
applied.
The former option is useful for example when using custom property URIs
to express relations between source data and the target ontology.
If an \code{owl:DataProperty} is generated, it always has the column name
as target URI, without the use of an URI pattern.
Note that it is sufficient to have the column name be the target of the
property, since only a \emph{mapping} to the source data is generated.

Consider an Attribute map representing a column named ``\code{name}'' to
be mapped to an \\ \code{owl:DataProperty}.
Suppose the OWL property URI the Attribute map specifies is\\
``\code{mynamespace:lastName}'' and each dataset containing the
\code{name} column has the identifying URI pattern
``\code{mydb:person/\{pno\}}'' (see Paragraph~``\nameref{iden}'').
Then, during the bootstrapping process
the following RDF triple will be produced:\\
\code{mydb:person/\{pno\} mynamespace:lastName "\{name\}"}.\\
This triple can easily be interpreted by R2RML, which on request then
retrieves the queried \code{name} from the data source.

Consider, as a more elaborate example,
an Attribute map representing a column named
``\code{company}'' and to be mapped to an \code{owl:ObjectProperty} with the
use of the URI pattern ``\code{http://otherdb/\{\$1\}}''.
Suppose the OWL property URI the Attribute map specifies is
``\code{mynamespace:hasSameOwnerAs}'', there is no datatype specified
and each dataset containing the
\code{company} column has the identifying URI pattern
``\code{mydb:company/\{cmpno\}}''
(see Paragraph~``\nameref{iden}'').
Then, the following RDF triple will be produced during the bootstrapping
process:\\
\code{mydb:company/\{cmpno\} mynamespace:hasSameOwnerAs
	http://otherdb/\{company\}}.\\
Note that this only makes sense if R2RML can expand
``\code{http://otherdb/\{company\}}'' to a valid subject
for each value in the \code{company} column of the database,
and if all rows in the respective database table are
indeed entities having the same owner as the company specified in the
\code{company} column.

\subsubsection{Relation maps}
Relation maps represent foreign keys contained in the
source database.
Each Relation map references the Entity maps representing the foreign key's
child table and parent table, respectively.
Furthermore, it provides the column names of both the foreign key columns
and the referenced columns and specifies an OWL property URI.
Relation maps allow the relations expressed in the source data via
foreign key relationships to be included into the bootstrapped ontology.
This happens in a simple and straight-forward manner:
for each foreign key relationship, exactly one triple is generated which
contains the two identifying
URI patterns representing the source and the target dataset of the foreign
key, respectively, and the OWL property URI specified by the Relation map.

Suppose, for example, that a Relation map expressing a relation between
datasets with the identifying URI patterns (see Paragraph~``\nameref{iden}'')
``\code{mydb:persons/pno/\{pno\}}'' and
``\code{mydb:companies/cmpno/\{cmpno\}}'', respectively, and specifying
the OWL property URI ``\code{mynamespace:isEmployedAt}''.
This will result in the following RDF triple to be generated during the
bootstrapping process:\\
\code{mydb:persons/pno/\{pno\} mynamespace:isEmployedAt
	 mydb:companies/cmpno/\{cmpno\}}

\subsubsection{Subtype maps}
Subtype maps provide a means to automatically add subclass-superclass
relationships to the target ontology during the bootstrapping process.
They specify an Entity map and a column name defining a table and a
column, respectively, that exist in the source database and contain the
values to be declared as belonging to the subclass.
Furthermore, they store a prefix, a suffix and possibly a reference to
a Translation table which are used to generate a URI for that subclass.
Finally, they provide the URI of the superclass.
This can be, for instance, some OWL class being created during the
bootstrapping process or already existing in some imported ontology.
The URI generated for the subclass contains the data value of the
respective database column, thus every dataset gets its own (sub)class.
During bootstrapping, an RDF triple declaring the value to belong to that
subclass is generated for each data value of the respective table column in
the source database.
The limitation to the desired value only thereby happens by a restriction
on the SQL statement accompanying the respective triple, not by
limiting the triple to only cover a specific data value.
The actual subclass-superclass relationship is expressed during the
creation of the ontology.
Note the difference from the previously described mapping rules, which
produced triples independent from the data values in the source database.

Consider a Subtype map specifying a table column containing the values
``\code{Purchase}'' and ``\code{Sales}'' with the datasets having the
identifying URI pattern (see Paragraph~``\nameref{iden}'')
``\code{mydb:managers/mno/\{mno\}}''.
Suppose, the Subtype map specifies the prefix\\
``\code{mydb:manager/of\_department/}'', no suffix, no translation table
and the supertype URI ``\code{mynamespace:persons}''.
This will result in the generation of the following triples during the
bootstrapping process:\\
\code{mydb:managers/mno/\{mno\} rdf:type mydb:manager/of\_department/Purchase}\\
\code{mydb:managers/mno/\{mno\} rdf:type mydb:manager/of\_department/Sales},\\
while ``\code{mydb:manager/of\_department/Purchase}'' \\ and
``\code{mydb:manager/of\_department/Sales}'' will be subclasses of class\\
``\code{mynamespace:persons}'' in the target ontology.
The accompanying SQL statement will ensure that, despite the use of the
R2RML template ``\code{mydb:managers/mno/\{mno\}}'', not \emph{every}
manager will be declared as the manager of \emph{every} department.

\subsubsection{Translation tables}
Translation tables allow for transforming URIs or other strings in
arbitrary ways, by simply mapping each string to be translated to a
target string.

They don't reflect in the target ontology in any form but are used
only during the bootstrapping process.

\subsection{Bootstrapping OBDA specifications}
\label{bootstrap_bootstrap}
How OBDA specifications can in turn be bootstrapped from database schemata
is subject to this thesis and is explained in this section.
For the description of the software developed to automate this, see
Chapter~\fullref{program}.
The description in this section assumes an SQL database as data source.
However, ontology-based data access and OBDA specifications are not limited
to SQL databases, as mentioned in Section~\fullref{back_obda} and in
Section~\fullref{bootstrap_spec_using}.
Furthermore, this section refers to OBDA specifications without assuming
any specific format in which they are represented.
How OBDA specifications are represented internally by the \myprog{}
software, is described in Section~\fullref{fine}.
For the description of a format to serialize OBDA specifications
-- the output format of \myprog{} --,
refer to Chapter~\fullref{osl}.

Subtype maps and Translation tables are not considered in this approach,
since they cannot be bootstrapped from schema information only but have to
be determined from the input data (see Section~\fullref{bootstrap_spec_struc}).
Thus, the bootstrapped OBDA specification does not contain maps of these
types. Including them is a significant challenge in its own right and,
since the use of heuristics or user decisions would be necessary,
would make the process involve human supervision at least.
Apart from that, the bootstrapping is an easy and straight-forward task
which can be carried out fully automatic TODO.

Recall the structure of an OBDA specification explained in
Section~\fullref{bootstrap_spec_struc}.
The map types considered in this approach are Entity maps, Attribute maps,
Identifier maps and Relation maps.
The assignment of values to their fields is summarized in
Table~\ref{bootstrap_tab_mapping}, only hinting at how maps are
generated. Both the generation of the maps and the assignment of values
to their fields are described in the rest of this section, with one
exception:
since the generation of URIs (or IRIs) in the context of OBDA specifications
is an essential topic which requires some conceptual efforts, it is
described in a separate section, Section~\ref{iris}.

\begin{table}[H]\begin{centering}
		\begin{tabular}{p{3.2cm}|p{4.0cm}|p{8.3cm}}
			\textbf{Map type} & \textbf{Field name} & \textbf{Value} \\ \hline
			Entity map & Table name & SQL table name \\
			Entity map & Label & <empty> \\
			Entity map & \emph{Identifier map} & Identifier map for table \\
			Entity map & \emph{Attribute maps...} & Attribute maps for table columns \\
			Entity map & OWL class URI & URI(table) \\
			Entity map & Description & SQL table description \\ \hline
			Identifier map & \emph{Entity map} & Entity map for corresponding table \\
			Identifier map & \emph{Attribute maps...} & Attribute maps for primary key columns \\
			Identifier map & URI pattern & URIpattern(table) \\ \hline
			Attribute map & Column name & SQL column name \\
			Attribute map & SQL datatype & SQL datatype of column \\
			Attribute map & Mandatory & SQL \code{NOT NULL} property of column\newline(\code{true} or \code{false}) \\
			Attribute map & Label & <empty> \\
			Attribute map & OWL property URI & <empty> for foreign key columns,\newline else URI(table, column) \\
			Attribute map & Property type & ``\code{ObjectProperty}'' for foreign key columns,\newline else ``\code{DataProperty}'' \\
			Attribute map & \emph{Translation} & <empty> \\
			Attribute map & URI pattern & <empty> \\
			Attribute map & RDF language & <empty> \\
			Attribute map & XSD datatype & <empty> \\
			Attribute map & Description & SQL column description \\ \hline
			Relation map & \emph{Source entity map} & Entity map for foreign key child table \\
			Relation map & Source column & Foreign key child columns\newline(SQL column names) \\
			Relation map & \emph{Target entity map} & Entity map for foreign key parent table \\
			Relation map & Target column & Foreign key parent columns\newline (SQL column names) \\
			Relation map & OWL property URI & URI(table, foreignKey) \\
		\end{tabular}
			\caption{Assignment of values to fields of OBDA specification maps}
			\label{bootstrap_tab_mapping}
\end{centering}\end{table}

\subsubsection{Entity maps}
Exactly one Entity map and one Identifier map is generated per table
contained in the source database.
The generated Identifier map is referenced by the Entity map's
\code{Identifier map} field.
Similarly, exactly one Attribute map is generated per table column
and these Attribute maps are referenced by the Entity map's
\code{Attribute maps...} field.
The Entity map's \code{Table name} field is set to the SQL name of
the table, the \code{Label} field remains empty.
An URI identifying the table is generated and stored in the
Entity map's \code{OWL class URI} field.
The SQL table description is copied into the Entity map's
\code{Description} field.

\subsubsection{Identifier maps}
An Identifier map represents exactly one primary key in the source
database and is referenced by the Entity map representing the table
containing the primary key constraint.
In addition, it references this table in its
\code{Entity map} field, so that there is a bidirectional referencing.
The Attribute maps representing the columns constituting the primary
key are referenced by the Identifier map's \code{Attribute maps...}
field.
An URI pattern, allowing datasets (thus, rows in the source database)
to be identified in the target ontology, is generated and put in
the Identifier map's \code{URI pattern} field.

\subsubsection{Attribute maps}
An Attribute map represents exactly one column in the source database
and is referenced by the Entity map representing the table
containing the column.
The Attribute map's \code{Column name} field is set to the SQL column
name of the column, the \code{SQL datatype} field is set
to its SQL datatype.
The \code{Mandatory} field is set to \code{true} if the column has
the SQL \code{NOT NULL} constraint, otherwise to \code{false}.
If the column is part of a foreign key, the \code{OWL propert URI}
field remains empty. Otherwise,
an URI identifying the column is generated and stored in the
Attribute map's \code{OWL property URI} field.
The \code{Property type} field is set to ``\code{ObjectProperty}''
if the column is part of a foreign key, otherwise to
``\code{DataProperty}''.
The SQL column description is copied into the Attribute map's
\code{Description} field.
The remaining columns, \code{Label}, \code{Translation},
\code{URI pattern}, \code{RDF language} and \code{XSD datatype}
remain empty.

\subsubsection{Relation maps}
A Relation map represents exactly one foreign key in the source database.
It contains fields storing the parent and child table of the foreign
key: the \code{Source entity map} field, referencing the
Entity map representing the child table of the foreign key, and
the \code{Target entity map} field, referencing the
Entity map representing the parent table of the foreign key.
The SQL column names of the foreign key columns (thus, column names
in the child table) are copied into the \code{Source column}
field of the Relation map, and the SQL column names of the referenced
columns in the parent table are copied into its \code{Target column}
field.
Note that, in contrast to Identifier maps representing primary keys,
it is not referenced by any Entity map (or any other map).