Major change
[u/philim/db2osl_thesis.git] / background.tex
CommitLineData
c31df1ed 1\chapter{Background and related work}
c31df1ed 2
002fa020 3\section{Background}
45d598e9
PM
4\subsection{Basic concepts}
5\label{back_basic}
6TODO: r2rml, rdf, rdfs, owl, xml, iris, baseiri, end with :
7
b96bb723 8\subsection{Ontology-based data access (OBDA)}
45d598e9 9\label{back_obda}
b96bb723
PM
10TODO: References
11
12Storing data in relational databases is a very common proceeding, since the
13notion of a relational database is comprehensible and widely known, while the
14required software is widely available both commercially and as open-source
15software.
16Thus, it is easy for a domain expert to set up and populate a database.
17Furthermore, relational databases provide significant advantages concerning
18performance, data consistency and integrity, integration abilities, support
19and general prominence.
20These topics definitely played a major role in the success and the extensive
21exploration that databases discovered, up to the degree that these are the
22main fields the strengths of databases are seen in.
23Many -- if not all -- of these strengths trace back to the relatively fixed
24and rigid schema databases embody: a well-defined database schema imposing
25strong and clear cut constraints on the contained data.
26
27However, this principle also induces notable disadvantages.
28The database schema, although theoretically changeable, constitutes a
29significant burden on TODO the representation of data of dynamic
30environments, incomplete data or changing requirements,
31especially when dealing with large amounts of data.
32The resulting representation of data in unintuitive, suboptimal schemata
33makes the use of prolonged and complex query constructs inevitable, which
34lets more elaborate queries quickly become unmanageable for non-experts
35and time-consuming and error-prone even for experts.
36Ontologies, on the other hand, are much more flexible regarding incomplete
37data or changing requirements or environments and allow for much more
38intuitive and abstract query systems, while still being a quite
39comprehensible formalism
40(see Section~\ref{ontologies} for publications describing
41ontologies and the semantic web).
42Besides, ontologies provide support for different data records referencing
43the same entity, while databases do not \cite{eng}, and for the deduction of
44implicit information \cite{eng}, which with common database systems, if at
45all, is at least not possible out of the box and in an easy manner.
46Often, however, relational databases are preferred for their advantages
47(although the availability of cheaper yet more powerful hardware in some
48cases offsets these) or simply erroneously.
49Besides, in some cases, the migration to ontology-based systems,
50even if beneficial, is to costly to be seriously considered.
51%, often due to the large amounts of data that would have to be converted.
52
53Ontology-based data access often provides a solution to this collision of
54interests:
55By adding an ontology-based front-end processing the queries that is
56sensibly mapped to the data representation,
57the querying facilities of ontology-based systems are introduced,
58and changes in the data representation most often can be carried out
59without breaking existing queries; only the mapping has to be changed
60-- in a one-time effort -- and only when it introduces changes in the way
61it presents existing structures to the user, existing queries have to be
62modified.
63The creation of these mappings in turn can happen computer-aided or, in
64simple cases or to a certain degree of completeness and accuracy,
65the mappings can be completely bootstrapped.
66
67Moreover, in cases where it is to costly or for other cases infeasible
68to carry out a complete data duplication, the data can remain in the
69underlying database as is and the query front-end merely acts as an
70interface transforming the query into a database query \cite{eng} by
71making use of a backward-chaining technique called \emph{query rewriting}
72\cite{eng}.
73This approach is called \emph{virtual OBDA} or \emph{virtual RDF view}
74\cite{eng} and is illustrated in Figure~\ref{back_subfig_virtual}.
75The oppositional approach of duplicating all data is called
76\emph{materialized OBDA} or \emph{materialized RDF view} and is
77illustrated in Figure~\ref{back_subfig_materialized}.
78Virtual OBDA provides limited abilities compared to materialized OBDA
79in that it does not allow for decoupling from the source data by for
80example adding inferred information or applying elaborate transformations
81and does not support ``fragments of OWL for which query
82rewriting is not a complete deduction method'' \cite{eng}.
83However, the response time of systems is hard to predict solely from the
84architectural approach used, so if it is critical, several systems should
85be prototyped and evaluated upfront on what are expected to be
86typical queries \cite{eng}.
87
88\begin{figure}[H]\begin{center}
89 \begin{subfigure}[b]{0.8\textwidth}
90 \includegraphics[scale=1.2]{Images/bootstrapping_materialized.pdf}
91 \caption{OBDA system architecture with materialized \name{RDF} view}
92 \label{back_subfig_materialized}
93 \end{subfigure}\\~\\
94 \begin{subfigure}[b]{0.8\textwidth}
95 \includegraphics[scale=1.2]{Images/bootstrapping_virtual.pdf}
96 \caption{OBDA system architecture with a virtual \name{RDF} view}
97 \label{back_subfig_virtual}
98 \end{subfigure}
99 \caption[The two basic OBDA system architectures]
100 {The two basic OBDA system architectures:
101 materialized and virtual \name{RDF} views (from \cite{eng})}
102 \label{back_fig_bootstrapping}
103\end{center}\end{figure}
104
105Finally, it has to be mentioned, that ontology-based data access is not
106limited to databases.
107Although this is the most common scenario and the only one this thesis
108deals with, ontology-based data access also works with other sources of
109structured information, like \name{ODS}, \name{XLS} or \name{CSV} files,
110though some additional preparation might be necessary in these cases \cite{eng}.
111
112\subsection{OBDA specifications}
45d598e9
PM
113\label{back_obdaspecs}
114TODO: more, maybe shorten introduction
115
b96bb723 116As mentioned in Section~\fullref{motivation}, the sole bootstrapping of
002fa020
PM
117\name{RDF} triples \cite{rdf} or other forms of structured information
118from relational database schemata is a relatively well understood topic.
b96bb723
PM
119This is outlined more comprehensively in Section~\fullref{related}.
120
121Nonetheless, bootstrapping remains an elaborate process involving complex
122tools to be invoked -- possibly in different versions and configurations
123and processing different formats -- and working on changing input data
124\cite{eng}.
125This is why Skjæveland and others proposed the introduction of OBDA
126specifications centralizing the task of driving these tools and
127to gather in one place all the information describing the desired mapping
128between the source database and the target ontology \cite{eng}.
129As described in Section~\fullref{related}, their approach is the
130foundation of this thesis, which describes and specifies a format for
131storing and exchanging such OBDA specifications -- the \oslboth{} -- and
132introduces a tool that in turn automatically bootstraps OBDA Specifications
133from relational database schemata -- the \myprog{} software.
c31df1ed 134
b96bb723
PM
135The bootstrapping process using OBDA specifications and \myprog{} is
136illustrated in Figure~\ref{intro_fig_bootstrapping}
137in Section~\fullref{approach}.
c31df1ed 138
b96bb723
PM
139\subsection{The \name{OPTIQUE} project}
140The problems addressed in Section~\fullref{motivation} are a big issue
141inter alia TODO in the oil and gas industry:
142$30 \%$ to $70 \%$ of the working time of engineers is spent on collecting
143data or assessing its quality \cite{crompton}.
144This led to the origination of the \name{OPTIQUE} project in TODO which
145``advocates for a next generation of the well known Ontology-Based Data Access
146(OBDA) approach to address the data access problem [...] [aiming] at
147solutions that reduce the cost of data access dramatically'' \cite{optique}.
148Thus, the \name{OPTIQUE} project tries to reach exactly the benefits a
45d598e9
PM
149well-developed OBDA system can provide (explained in
150Section~\ref{back_obda}):
b96bb723
PM
151an easy end-user access to data without knowing about its structuring
152while taking advantage of automatic translations \cite{optique2}.
153In doing so, ascertained shortcomings of existing OBDA systems were addressed:
154\emph{usability} (for example the need to use formal query languages),
155\emph{costly prerequisites} (consider, for example, the disadvantages
45d598e9 156of materialized OBDA described in Section~\ref{back_obda}) and
b96bb723
PM
157\emph{efficiency} (which was perceived as being insufficiently addressed
158in previous approaches) \cite{optique}.
c31df1ed 159
c31df1ed 160\section{Related work}
b96bb723
PM
161\label{related}
162\subsection{Ontologies and the semantic web -- publications}
163\label{ontologies}
164
165\subsection{OBDA specifications -- publications}
166A publication building the foundation of the work presented in this thesis, is
167the summarizing and benchmarking work on OBDA specifications by Skjæveland et al.
168\cite{eng}, the group that developed them in their present form.
169
170\subsection{OBDA systems -- publications}
171
172\subsection{General ontology bootstrapping -- publications}
173Skjæveland, Lian and Horrocks \cite{npd} provided an exemplifying description of
174the transformation of the \emph{NPD FactPages}, an enormous collection of data
175related to oil drilling on the Norwegian continental shelf, provided by
176the Norwegian Petroleum Dictorate (NPD).
177
178Sequeda et al. \cite{survey} provided an overview over different
179direct mapping approaches.
180
181Sequeda, Arenas and Miranker \cite{ondirm} \cite{autodirm}
182describe the direct mapping of relational databases to \name{RDF}
183and \name{OWL} formally.
184
185Stojanovic, Stojanovic and Volz \cite{sac} published a formal description of
186the mapping of relational databases onto ontology-based structures, describing
187concepts preceding and/or supplementing \name{OWL} and using \name{F-LOGIC} TODO
188as target language.
189
190\subsection{The \name{OPTIQUE} project -- publications}
191Calvanese et al. \cite{optique2} presented the \name{OPTIQUE} project including
192its underlying OBDA system and showed limitations of current OBDA systems.
193
194Kharlamov et al. \cite{optique} described the first version of the \name{OPTIQUE}
195system, customized for use with the \emph{NPD FactPages} of the
196Norwegian Petroleum Dictorate (NPD).
197
198Skjæveland and Lian \cite{benefits} summarized the benefits of and the
199proceeding for converting the \name{NPD FactPages} to linked data \cite{linked}
200and discuss associated terms like linked data, URIs, \name{RDF} and
201\name{SPARQL}.
202
203\subsection{Alternative approaches -- publications}
204Barrasa, Corcho and P{\'e}rez \cite{r2o} proposed a declarative mapping
205language -- \name{r2o} -- able to express a mapping between a relational
206database and ontologies represented in the \name{OWL} and \name{RDF} formats.
207This approach however aims at connecting existing databases
208and existing ontologies.
209
210TODO: R2RML, SQL2SW