Major change
[u/philim/db2osl_thesis.git] / background.tex
c31df1ed 1\chapter{Background and related work}
c31df1ed 2
002fa020 3\section{Background}
4\subsection{Ontology-based data access (OBDA)}
6TODO: References
8Storing data in relational databases is a very common proceeding, since the
9notion of a relational database is comprehensible and widely known, while the
10required software is widely available both commercially and as open-source
12Thus, it is easy for a domain expert to set up and populate a database.
13Furthermore, relational databases provide significant advantages concerning
14performance, data consistency and integrity, integration abilities, support
15and general prominence.
16These topics definitely played a major role in the success and the extensive
17exploration that databases discovered, up to the degree that these are the
18main fields the strengths of databases are seen in.
19Many -- if not all -- of these strengths trace back to the relatively fixed
20and rigid schema databases embody: a well-defined database schema imposing
21strong and clear cut constraints on the contained data.
23However, this principle also induces notable disadvantages.
24The database schema, although theoretically changeable, constitutes a
25significant burden on TODO the representation of data of dynamic
26environments, incomplete data or changing requirements,
27especially when dealing with large amounts of data.
28The resulting representation of data in unintuitive, suboptimal schemata
29makes the use of prolonged and complex query constructs inevitable, which
30lets more elaborate queries quickly become unmanageable for non-experts
31and time-consuming and error-prone even for experts.
32Ontologies, on the other hand, are much more flexible regarding incomplete
33data or changing requirements or environments and allow for much more
34intuitive and abstract query systems, while still being a quite
35comprehensible formalism
36(see Section~\ref{ontologies} for publications describing
37ontologies and the semantic web).
38Besides, ontologies provide support for different data records referencing
39the same entity, while databases do not \cite{eng}, and for the deduction of
40implicit information \cite{eng}, which with common database systems, if at
41all, is at least not possible out of the box and in an easy manner.
42Often, however, relational databases are preferred for their advantages
43(although the availability of cheaper yet more powerful hardware in some
44cases offsets these) or simply erroneously.
45Besides, in some cases, the migration to ontology-based systems,
46even if beneficial, is to costly to be seriously considered.
47%, often due to the large amounts of data that would have to be converted.
49Ontology-based data access often provides a solution to this collision of
51By adding an ontology-based front-end processing the queries that is
52sensibly mapped to the data representation,
53the querying facilities of ontology-based systems are introduced,
54and changes in the data representation most often can be carried out
55without breaking existing queries; only the mapping has to be changed
56-- in a one-time effort -- and only when it introduces changes in the way
57it presents existing structures to the user, existing queries have to be
59The creation of these mappings in turn can happen computer-aided or, in
60simple cases or to a certain degree of completeness and accuracy,
61the mappings can be completely bootstrapped.
63Moreover, in cases where it is to costly or for other cases infeasible
64to carry out a complete data duplication, the data can remain in the
65underlying database as is and the query front-end merely acts as an
66interface transforming the query into a database query \cite{eng} by
67making use of a backward-chaining technique called \emph{query rewriting}
69This approach is called \emph{virtual OBDA} or \emph{virtual RDF view}
70\cite{eng} and is illustrated in Figure~\ref{back_subfig_virtual}.
71The oppositional approach of duplicating all data is called
72\emph{materialized OBDA} or \emph{materialized RDF view} and is
73illustrated in Figure~\ref{back_subfig_materialized}.
74Virtual OBDA provides limited abilities compared to materialized OBDA
75in that it does not allow for decoupling from the source data by for
76example adding inferred information or applying elaborate transformations
77and does not support ``fragments of OWL for which query
78rewriting is not a complete deduction method'' \cite{eng}.
79However, the response time of systems is hard to predict solely from the
80architectural approach used, so if it is critical, several systems should
81be prototyped and evaluated upfront on what are expected to be
82typical queries \cite{eng}.
85 \begin{subfigure}[b]{0.8\textwidth}
86 \includegraphics[scale=1.2]{Images/bootstrapping_materialized.pdf}
87 \caption{OBDA system architecture with materialized \name{RDF} view}
88 \label{back_subfig_materialized}
89 \end{subfigure}\\~\\
90 \begin{subfigure}[b]{0.8\textwidth}
91 \includegraphics[scale=1.2]{Images/bootstrapping_virtual.pdf}
92 \caption{OBDA system architecture with a virtual \name{RDF} view}
93 \label{back_subfig_virtual}
94 \end{subfigure}
95 \caption[The two basic OBDA system architectures]
96 {The two basic OBDA system architectures:
97 materialized and virtual \name{RDF} views (from \cite{eng})}
98 \label{back_fig_bootstrapping}
101Finally, it has to be mentioned, that ontology-based data access is not
102limited to databases.
103Although this is the most common scenario and the only one this thesis
104deals with, ontology-based data access also works with other sources of
105structured information, like \name{ODS}, \name{XLS} or \name{CSV} files,
106though some additional preparation might be necessary in these cases \cite{eng}.
108\subsection{OBDA specifications}
109As mentioned in Section~\fullref{motivation}, the sole bootstrapping of
110\name{RDF} triples \cite{rdf} or other forms of structured information
111from relational database schemata is a relatively well understood topic.
112This is outlined more comprehensively in Section~\fullref{related}.
114Nonetheless, bootstrapping remains an elaborate process involving complex
115tools to be invoked -- possibly in different versions and configurations
116and processing different formats -- and working on changing input data
118This is why Skjæveland and others proposed the introduction of OBDA
119specifications centralizing the task of driving these tools and
120to gather in one place all the information describing the desired mapping
121between the source database and the target ontology \cite{eng}.
122As described in Section~\fullref{related}, their approach is the
123foundation of this thesis, which describes and specifies a format for
124storing and exchanging such OBDA specifications -- the \oslboth{} -- and
125introduces a tool that in turn automatically bootstraps OBDA Specifications
126from relational database schemata -- the \myprog{} software.
c31df1ed 127
128The bootstrapping process using OBDA specifications and \myprog{} is
129illustrated in Figure~\ref{intro_fig_bootstrapping}
130in Section~\fullref{approach}.
c31df1ed 131
132\subsection{The \name{OPTIQUE} project}
133The problems addressed in Section~\fullref{motivation} are a big issue
134inter alia TODO in the oil and gas industry:
135$30 \%$ to $70 \%$ of the working time of engineers is spent on collecting
136data or assessing its quality \cite{crompton}.
137This led to the origination of the \name{OPTIQUE} project in TODO which
138``advocates for a next generation of the well known Ontology-Based Data Access
139(OBDA) approach to address the data access problem [...] [aiming] at
140solutions that reduce the cost of data access dramatically'' \cite{optique}.
141Thus, the \name{OPTIQUE} project tries to reach exactly the benefits a
142well-developed OBDA system can provide (explained in Section~\ref{obda}):
143an easy end-user access to data without knowing about its structuring
144while taking advantage of automatic translations \cite{optique2}.
145In doing so, ascertained shortcomings of existing OBDA systems were addressed:
146\emph{usability} (for example the need to use formal query languages),
147\emph{costly prerequisites} (consider, for example, the disadvantages
148of materialized OBDA described in Section~\ref{obda}) and
149\emph{efficiency} (which was perceived as being insufficiently addressed
150in previous approaches) \cite{optique}.
c31df1ed 151
c31df1ed 152\section{Related work}
154\subsection{Ontologies and the semantic web -- publications}
157\subsection{OBDA specifications -- publications}
158A publication building the foundation of the work presented in this thesis, is
159the summarizing and benchmarking work on OBDA specifications by Skjæveland et al.
160\cite{eng}, the group that developed them in their present form.
162\subsection{OBDA systems -- publications}
164\subsection{General ontology bootstrapping -- publications}
165Skjæveland, Lian and Horrocks \cite{npd} provided an exemplifying description of
166the transformation of the \emph{NPD FactPages}, an enormous collection of data
167related to oil drilling on the Norwegian continental shelf, provided by
168the Norwegian Petroleum Dictorate (NPD).
170Sequeda et al. \cite{survey} provided an overview over different
171direct mapping approaches.
173Sequeda, Arenas and Miranker \cite{ondirm} \cite{autodirm}
174describe the direct mapping of relational databases to \name{RDF}
175and \name{OWL} formally.
177Stojanovic, Stojanovic and Volz \cite{sac} published a formal description of
178the mapping of relational databases onto ontology-based structures, describing
179concepts preceding and/or supplementing \name{OWL} and using \name{F-LOGIC} TODO
180as target language.
182\subsection{The \name{OPTIQUE} project -- publications}
183Calvanese et al. \cite{optique2} presented the \name{OPTIQUE} project including
184its underlying OBDA system and showed limitations of current OBDA systems.
186Kharlamov et al. \cite{optique} described the first version of the \name{OPTIQUE}
187system, customized for use with the \emph{NPD FactPages} of the
188Norwegian Petroleum Dictorate (NPD).
190Skjæveland and Lian \cite{benefits} summarized the benefits of and the
191proceeding for converting the \name{NPD FactPages} to linked data \cite{linked}
192and discuss associated terms like linked data, URIs, \name{RDF} and
195\subsection{Alternative approaches -- publications}
196Barrasa, Corcho and P{\'e}rez \cite{r2o} proposed a declarative mapping
197language -- \name{r2o} -- able to express a mapping between a relational
198database and ontologies represented in the \name{OWL} and \name{RDF} formats.
199This approach however aims at connecting existing databases
200and existing ontologies.