[u/philim/db2osl_thesis.git] / background.tex

\chapter{Background and related work}

\section{Background}
\subsection{Basic concepts}
\label{back_basic}
TODO: r2rml, rdf, rdfs, owl, xml, iris, baseiri, end with :

\subsection{Ontology-based data access (OBDA)}
\label{back_obda}
TODO: References

Storing data in relational databases is a very common proceeding, since the
notion of a relational database is comprehensible and widely known, while the
required software is widely available both commercially and as open-source
software.
Thus, it is easy for a domain expert to set up and populate a database.
Furthermore, relational databases provide significant advantages concerning
performance, data consistency and integrity, integration abilities, support
and general prominence.
These topics definitely played a major role in the success and the extensive
exploration that databases discovered, up to the degree that these are the
main fields the strengths of databases are seen in.
Many -- if not all -- of these strengths trace back to the relatively fixed
and rigid schema databases embody: a well-defined database schema imposing
strong and clear cut constraints on the contained data.

However, this principle also induces notable disadvantages.
The database schema, although theoretically changeable, constitutes a
significant burden on TODO the representation of data of dynamic
environments, incomplete data or changing requirements,
especially when dealing with large amounts of data.
The resulting representation of data in unintuitive, suboptimal schemata
makes the use of prolonged and complex query constructs inevitable, which
lets more elaborate queries quickly become unmanageable for non-experts
and time-consuming and error-prone even for experts.
Ontologies, on the other hand, are much more flexible regarding incomplete
data or changing requirements or environments and allow for much more
intuitive and abstract query systems, while still being a quite
comprehensible formalism
(see Section~\ref{ontologies} for publications describing
ontologies and the semantic web).
Besides, ontologies provide support for different data records referencing
the same entity, while databases do not \cite{eng}, and for the deduction of
implicit information \cite{eng}, which with common database systems, if at
all, is at least not possible out of the box and in an easy manner.
Often, however, relational databases are preferred for their advantages
(although the availability of cheaper yet more powerful hardware in some
cases offsets these) or simply erroneously.
Besides, in some cases, the migration to ontology-based systems,
even if beneficial, is to costly to be seriously considered.
%, often due to the large amounts of data that would have to be converted.

Ontology-based data access often provides a solution to this collision of
interests:
By adding an ontology-based front-end processing the queries that is
sensibly mapped to the data representation,
the querying facilities of ontology-based systems are introduced,
and changes in the data representation most often can be carried out
without breaking existing queries; only the mapping has to be changed
-- in a one-time effort -- and only when it introduces changes in the way
it presents existing structures to the user, existing queries have to be
modified.
The creation of these mappings in turn can happen computer-aided or, in
simple cases or to a certain degree of completeness and accuracy,
the mappings can be completely bootstrapped.

Moreover, in cases where it is to costly or for other cases infeasible
to carry out a complete data duplication, the data can remain in the
underlying database as is and the query front-end merely acts as an
interface transforming the query into a database query \cite{eng} by
making use of a backward-chaining technique called \emph{query rewriting}
\cite{eng}.
This approach is called \emph{virtual OBDA} or \emph{virtual RDF view}
\cite{eng} and is illustrated in Figure~\ref{back_subfig_virtual}.
The oppositional approach of duplicating all data is called
\emph{materialized OBDA} or \emph{materialized RDF view} and is
illustrated in Figure~\ref{back_subfig_materialized}.
Virtual OBDA provides limited abilities compared to materialized OBDA
in that it does not allow for decoupling from the source data by for
example adding inferred information or applying elaborate transformations
and does not support ``fragments of OWL for which query
rewriting is not a complete deduction method'' \cite{eng}.
However, the response time of systems is hard to predict solely from the
architectural approach used, so if it is critical, several systems should
be prototyped and evaluated upfront on what are expected to be
typical queries \cite{eng}.

\begin{figure}[H]\begin{center}
	\begin{subfigure}[b]{0.8\textwidth}
			\includegraphics[scale=1.2]{Images/bootstrapping_materialized.pdf}
			\caption{OBDA system architecture with materialized \name{RDF} view}
			\label{back_subfig_materialized}
	\end{subfigure}\\~\\
	\begin{subfigure}[b]{0.8\textwidth}
			\includegraphics[scale=1.2]{Images/bootstrapping_virtual.pdf}
			\caption{OBDA system architecture with a virtual \name{RDF} view}
			\label{back_subfig_virtual}
	\end{subfigure}
	\caption[The two basic OBDA system architectures]
		{The two basic OBDA system architectures:
		materialized and virtual \name{RDF} views (from \cite{eng})}
	\label{back_fig_bootstrapping}
\end{center}\end{figure}

Finally, it has to be mentioned, that ontology-based data access is not
limited to databases.
Although this is the most common scenario and the only one this thesis
deals with, ontology-based data access also works with other sources of
structured information, like \name{ODS}, \name{XLS} or \name{CSV} files,
though some additional preparation might be necessary in these cases \cite{eng}.

\subsection{OBDA specifications}
\label{back_obdaspecs}
TODO: more, maybe shorten introduction

As mentioned in Section~\fullref{motivation}, the sole bootstrapping of
\name{RDF} triples \cite{rdf} or other forms of structured information
from relational database schemata is a relatively well understood topic.
This is outlined more comprehensively in Section~\fullref{related}.

Nonetheless, bootstrapping remains an elaborate process involving complex
tools to be invoked -- possibly in different versions and configurations
and processing different formats -- and working on changing input data
\cite{eng}.
This is why Skjæveland and others proposed the introduction of OBDA
specifications centralizing the task of driving these tools and
to gather in one place all the information describing the desired mapping
between the source database and the target ontology \cite{eng}.
As described in Section~\fullref{related}, their approach is the
foundation of this thesis, which describes and specifies a format for
storing and exchanging such OBDA specifications -- the \oslboth{} -- and
introduces a tool that in turn automatically bootstraps OBDA Specifications
from relational database schemata -- the \myprog{} software.

The bootstrapping process using OBDA specifications and \myprog{} is
illustrated in Figure~\ref{intro_fig_bootstrapping}
in Section~\fullref{approach}.

\subsection{The \name{OPTIQUE} project}
The problems addressed in Section~\fullref{motivation} are a big issue
inter alia TODO in the oil and gas industry:
$30 \%$ to $70 \%$ of the working time of engineers is spent on collecting
data or assessing its quality \cite{crompton}.
This led to the origination of the \name{OPTIQUE} project in TODO which
``advocates for a next generation of the well known Ontology-Based Data Access
(OBDA) approach to address the data access problem [...] [aiming] at
solutions that reduce the cost of data access dramatically'' \cite{optique}.
Thus, the \name{OPTIQUE} project tries to reach exactly the benefits a
well-developed OBDA system can provide (explained in
Section~\ref{back_obda}):
an easy end-user access to data without knowing about its structuring
while taking advantage of automatic translations \cite{optique2}.
In doing so, ascertained shortcomings of existing OBDA systems were addressed:
\emph{usability} (for example the need to use formal query languages),
\emph{costly prerequisites} (consider, for example, the disadvantages
of materialized OBDA described in Section~\ref{back_obda}) and
\emph{efficiency} (which was perceived as being insufficiently addressed
in previous approaches) \cite{optique}.

\section{Related work}
\label{related}
\subsection{Ontologies and the semantic web -- publications}
\label{ontologies}

\subsection{OBDA specifications -- publications}
A publication building the foundation of the work presented in this thesis, is
the summarizing and benchmarking work on OBDA specifications by Skjæveland et al.
\cite{eng}, the group that developed them in their present form.

\subsection{OBDA systems -- publications}

\subsection{General ontology bootstrapping -- publications}
Skjæveland, Lian and Horrocks \cite{npd} provided an exemplifying description of
the transformation of the \emph{NPD FactPages}, an enormous collection of data
related to oil drilling on the Norwegian continental shelf, provided by
the Norwegian Petroleum Dictorate (NPD).

Sequeda et al. \cite{survey} provided an overview over different
direct mapping approaches.

Sequeda, Arenas and Miranker \cite{ondirm} \cite{autodirm}
describe the direct mapping of relational databases to \name{RDF}
and \name{OWL} formally.

Stojanovic, Stojanovic and Volz \cite{sac} published a formal description of
the mapping of relational databases onto ontology-based structures, describing
concepts preceding and/or supplementing \name{OWL} and using \name{F-LOGIC} TODO
as target language.

\subsection{The \name{OPTIQUE} project -- publications}
Calvanese et al. \cite{optique2} presented the \name{OPTIQUE} project including
its underlying OBDA system and showed limitations of current OBDA systems.

Kharlamov et al. \cite{optique} described the first version of the \name{OPTIQUE}
system, customized for use with the \emph{NPD FactPages} of the
Norwegian Petroleum Dictorate (NPD).

Skjæveland and Lian \cite{benefits} summarized the benefits of and the
proceeding for converting the \name{NPD FactPages} to linked data \cite{linked}
and discuss associated terms like linked data, URIs, \name{RDF} and
\name{SPARQL}.

\subsection{Alternative approaches -- publications}
Barrasa, Corcho and P{\'e}rez \cite{r2o} proposed a declarative mapping
language -- \name{r2o} -- able to express a mapping between a relational
database and ontologies represented in the \name{OWL} and \name{RDF} formats.
This approach however aims at connecting existing databases
and existing ontologies.

TODO: R2RML, SQL2SW
Commit	Line	Data
c31df1ed	1	\chapter{Background and related work}
c31df1ed	2
002fa020	3	\section{Background}
45d598e9 PM	4	\subsection{Basic concepts}
	5	\label{back_basic}
	6	TODO: r2rml, rdf, rdfs, owl, xml, iris, baseiri, end with :
	7
b96bb723	8	\subsection{Ontology-based data access (OBDA)}
45d598e9	9	\label{back_obda}
b96bb723 PM	10	TODO: References
	11
	12	Storing data in relational databases is a very common proceeding, since the
	13	notion of a relational database is comprehensible and widely known, while the
	14	required software is widely available both commercially and as open-source
	15	software.
	16	Thus, it is easy for a domain expert to set up and populate a database.
	17	Furthermore, relational databases provide significant advantages concerning
	18	performance, data consistency and integrity, integration abilities, support
	19	and general prominence.
	20	These topics definitely played a major role in the success and the extensive
	21	exploration that databases discovered, up to the degree that these are the
	22	main fields the strengths of databases are seen in.
	23	Many -- if not all -- of these strengths trace back to the relatively fixed
	24	and rigid schema databases embody: a well-defined database schema imposing
	25	strong and clear cut constraints on the contained data.
	26
	27	However, this principle also induces notable disadvantages.
	28	The database schema, although theoretically changeable, constitutes a
	29	significant burden on TODO the representation of data of dynamic
	30	environments, incomplete data or changing requirements,
	31	especially when dealing with large amounts of data.
	32	The resulting representation of data in unintuitive, suboptimal schemata
	33	makes the use of prolonged and complex query constructs inevitable, which
	34	lets more elaborate queries quickly become unmanageable for non-experts
	35	and time-consuming and error-prone even for experts.
	36	Ontologies, on the other hand, are much more flexible regarding incomplete
	37	data or changing requirements or environments and allow for much more
	38	intuitive and abstract query systems, while still being a quite
	39	comprehensible formalism
	40	(see Section~\ref{ontologies} for publications describing
	41	ontologies and the semantic web).
	42	Besides, ontologies provide support for different data records referencing
	43	the same entity, while databases do not \cite{eng}, and for the deduction of
	44	implicit information \cite{eng}, which with common database systems, if at
	45	all, is at least not possible out of the box and in an easy manner.
	46	Often, however, relational databases are preferred for their advantages
	47	(although the availability of cheaper yet more powerful hardware in some
	48	cases offsets these) or simply erroneously.
	49	Besides, in some cases, the migration to ontology-based systems,
	50	even if beneficial, is to costly to be seriously considered.
	51	%, often due to the large amounts of data that would have to be converted.
	52
	53	Ontology-based data access often provides a solution to this collision of
	54	interests:
	55	By adding an ontology-based front-end processing the queries that is
	56	sensibly mapped to the data representation,
	57	the querying facilities of ontology-based systems are introduced,
	58	and changes in the data representation most often can be carried out
	59	without breaking existing queries; only the mapping has to be changed
	60	-- in a one-time effort -- and only when it introduces changes in the way
	61	it presents existing structures to the user, existing queries have to be
	62	modified.
	63	The creation of these mappings in turn can happen computer-aided or, in
	64	simple cases or to a certain degree of completeness and accuracy,
	65	the mappings can be completely bootstrapped.
	66
	67	Moreover, in cases where it is to costly or for other cases infeasible
	68	to carry out a complete data duplication, the data can remain in the
	69	underlying database as is and the query front-end merely acts as an
	70	interface transforming the query into a database query \cite{eng} by
	71	making use of a backward-chaining technique called \emph{query rewriting}
	72	\cite{eng}.
	73	This approach is called \emph{virtual OBDA} or \emph{virtual RDF view}
74	\cite{eng} and is illustrated in Figure~\ref{back_subfig_virtual}.
75	The oppositional approach of duplicating all data is called
76	\emph{materialized OBDA} or \emph{materialized RDF view} and is
77	illustrated in Figure~\ref{back_subfig_materialized}.
78	Virtual OBDA provides limited abilities compared to materialized OBDA
79	in that it does not allow for decoupling from the source data by for
80	example adding inferred information or applying elaborate transformations
81	and does not support ``fragments of OWL for which query
82	rewriting is not a complete deduction method'' \cite{eng}.
83	However, the response time of systems is hard to predict solely from the
84	architectural approach used, so if it is critical, several systems should
85	be prototyped and evaluated upfront on what are expected to be
86	typical queries \cite{eng}.
87
88	\begin{figure}[H]\begin{center}
89	\begin{subfigure}[b]{0.8\textwidth}
90	\includegraphics[scale=1.2]{Images/bootstrapping_materialized.pdf}
91	\caption{OBDA system architecture with materialized \name{RDF} view}
92	\label{back_subfig_materialized}
93	\end{subfigure}\\~\\
94	\begin{subfigure}[b]{0.8\textwidth}
95	\includegraphics[scale=1.2]{Images/bootstrapping_virtual.pdf}
96	\caption{OBDA system architecture with a virtual \name{RDF} view}
97	\label{back_subfig_virtual}
98	\end{subfigure}
99	\caption[The two basic OBDA system architectures]
100	{The two basic OBDA system architectures:
101	materialized and virtual \name{RDF} views (from \cite{eng})}
102	\label{back_fig_bootstrapping}
103	\end{center}\end{figure}
104
105	Finally, it has to be mentioned, that ontology-based data access is not
106	limited to databases.
107	Although this is the most common scenario and the only one this thesis
108	deals with, ontology-based data access also works with other sources of
109	structured information, like \name{ODS}, \name{XLS} or \name{CSV} files,
110	though some additional preparation might be necessary in these cases \cite{eng}.
111
112	\subsection{OBDA specifications}
45d598e9 PM	113	\label{back_obdaspecs}
	114	TODO: more, maybe shorten introduction
	115
b96bb723	116	As mentioned in Section~\fullref{motivation}, the sole bootstrapping of
002fa020 PM	117	\name{RDF} triples \cite{rdf} or other forms of structured information
002fa020 PM	118	from relational database schemata is a relatively well understood topic.
b96bb723 PM	119	This is outlined more comprehensively in Section~\fullref{related}.
	120
	121	Nonetheless, bootstrapping remains an elaborate process involving complex
	122	tools to be invoked -- possibly in different versions and configurations
	123	and processing different formats -- and working on changing input data
	124	\cite{eng}.
	125	This is why Skjæveland and others proposed the introduction of OBDA
	126	specifications centralizing the task of driving these tools and
	127	to gather in one place all the information describing the desired mapping
	128	between the source database and the target ontology \cite{eng}.
	129	As described in Section~\fullref{related}, their approach is the
	130	foundation of this thesis, which describes and specifies a format for
	131	storing and exchanging such OBDA specifications -- the \oslboth{} -- and
	132	introduces a tool that in turn automatically bootstraps OBDA Specifications
	133	from relational database schemata -- the \myprog{} software.
c31df1ed	134
b96bb723 PM	135	The bootstrapping process using OBDA specifications and \myprog{} is
	136	illustrated in Figure~\ref{intro_fig_bootstrapping}
	137	in Section~\fullref{approach}.
c31df1ed	138
b96bb723 PM	139	\subsection{The \name{OPTIQUE} project}
	140	The problems addressed in Section~\fullref{motivation} are a big issue
	141	inter alia TODO in the oil and gas industry:
	142	$30 \%$ to $70 \%$ of the working time of engineers is spent on collecting
	143	data or assessing its quality \cite{crompton}.
	144	This led to the origination of the \name{OPTIQUE} project in TODO which
	145	``advocates for a next generation of the well known Ontology-Based Data Access
	146	(OBDA) approach to address the data access problem [...] [aiming] at
	147	solutions that reduce the cost of data access dramatically'' \cite{optique}.
	148	Thus, the \name{OPTIQUE} project tries to reach exactly the benefits a
45d598e9 PM	149	well-developed OBDA system can provide (explained in
45d598e9 PM	150	Section~\ref{back_obda}):
b96bb723 PM	151	an easy end-user access to data without knowing about its structuring
	152	while taking advantage of automatic translations \cite{optique2}.
	153	In doing so, ascertained shortcomings of existing OBDA systems were addressed:
	154	\emph{usability} (for example the need to use formal query languages),
	155	\emph{costly prerequisites} (consider, for example, the disadvantages
45d598e9	156	of materialized OBDA described in Section~\ref{back_obda}) and
b96bb723 PM	157	\emph{efficiency} (which was perceived as being insufficiently addressed
b96bb723 PM	158	in previous approaches) \cite{optique}.
c31df1ed	159
c31df1ed	160	\section{Related work}
b96bb723 PM	161	\label{related}
	162	\subsection{Ontologies and the semantic web -- publications}
	163	\label{ontologies}
	164
	165	\subsection{OBDA specifications -- publications}
	166	A publication building the foundation of the work presented in this thesis, is
	167	the summarizing and benchmarking work on OBDA specifications by Skjæveland et al.
	168	\cite{eng}, the group that developed them in their present form.
	169
	170	\subsection{OBDA systems -- publications}
	171
	172	\subsection{General ontology bootstrapping -- publications}
	173	Skjæveland, Lian and Horrocks \cite{npd} provided an exemplifying description of
	174	the transformation of the \emph{NPD FactPages}, an enormous collection of data
	175	related to oil drilling on the Norwegian continental shelf, provided by
	176	the Norwegian Petroleum Dictorate (NPD).
	177
	178	Sequeda et al. \cite{survey} provided an overview over different
	179	direct mapping approaches.
	180
	181	Sequeda, Arenas and Miranker \cite{ondirm} \cite{autodirm}
	182	describe the direct mapping of relational databases to \name{RDF}
	183	and \name{OWL} formally.
	184
	185	Stojanovic, Stojanovic and Volz \cite{sac} published a formal description of
	186	the mapping of relational databases onto ontology-based structures, describing
	187	concepts preceding and/or supplementing \name{OWL} and using \name{F-LOGIC} TODO
	188	as target language.
	189
	190	\subsection{The \name{OPTIQUE} project -- publications}
	191	Calvanese et al. \cite{optique2} presented the \name{OPTIQUE} project including
	192	its underlying OBDA system and showed limitations of current OBDA systems.
	193
	194	Kharlamov et al. \cite{optique} described the first version of the \name{OPTIQUE}
	195	system, customized for use with the \emph{NPD FactPages} of the
	196	Norwegian Petroleum Dictorate (NPD).
	197
	198	Skjæveland and Lian \cite{benefits} summarized the benefits of and the
	199	proceeding for converting the \name{NPD FactPages} to linked data \cite{linked}
	200	and discuss associated terms like linked data, URIs, \name{RDF} and
	201	\name{SPARQL}.
	202
	203	\subsection{Alternative approaches -- publications}
	204	Barrasa, Corcho and P{\'e}rez \cite{r2o} proposed a declarative mapping
	205	language -- \name{r2o} -- able to express a mapping between a relational
	206	database and ontologies represented in the \name{OWL} and \name{RDF} formats.
	207	This approach however aims at connecting existing databases
	208	and existing ontologies.
	209
	210	TODO: R2RML, SQL2SW