Integration Problem to be solved
- Data in different databases, even with Linked Open data sources.
- Misaligned models, different datasets have different meanings for classes and predicates that need to be aligned.
- Misaligned names for the same concepts.
- Replication is problematical.
- Query definition and scope of querying difficult to define in advance.
- Provence of data necessary.
- Cannot depend on inferences being available in advance
- Scalable architecture requires that all queries are stateless
Data Cathedrals versus Information Shopping Bazaars
Linked Open Data has been growing since 2007 from a few (12) interconnected datasets to 295 as of 2011, and it continues to grow. To quote “Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.” (Linked Data, n.d.)
Figure 1: Growth of the Linked Data ‘Cloud’
As impressive as the growth of interconnected datasets is, what is more important is the value of that interconnected data. A corollary of Metcalf’s law suggests that the benefit gained from integrated information grows geometrically[1] with the number of data communities that are integrated.
Many organizations have their own icebergs of information: operations, sales, marketing, personnel, legal, finance, research, maintenance, CRM, document vaults etc. (Lawrence, 2012) Over the years there have been various attempts to melt the boundaries between these icebergs including the creation of the mother-of-all databases that houses (or replicates) all information or the replacement of disparate applications with their own database with a mother-of-all application that eliminates the separate databases. Neither of these has really succeeded in unifying any or all data within an organization. (Lawrence, Data cathedrals versus information bazaars?, 2012). The result is a ‘Data Cathedral’ through which users have no way to navigate to find the information that will answer their questions.
Figure 2: Users have no way to navigate through the Enterprise’s Data Cathedral
Remediator at the heart of Linked Enterprise Data
Can we create an information shopping bazaar for users to answer their questions without committing heresy in the Data Cathedral? Can we create the same information shopping bazaar as Linked Data within the Enterprise: Linked Enterprise Data (LED). That is the objective of Remediator.
First of all we must recognize that the enterprise will have many structured, aggregated, and unstructured data stores already in place:
Figure 3: Enterprise Structured, Aggregated, and Unstructured Data Icebergs
One of the keys to the ability of Linked Data to interlink 300+ datasets is that they are all are expressed as RDF. The enterprise does not have the luxury of replicating all existing data into RDF datasets. However that is not necessary (although still sometimes desirable) because there are adapters that can make any existing dataset look as if it contains RDF, and can be accessed via a SPARQLEndpoint. Examples are listed below
- D2RQ: (D2RQ: Accessing Relational Databases as Virtual RDF Graphs )
- Ultrawrap:(Research in Bioinformatics and Semantic Web/Ultrawrap)
- Ontop:(-ontop- is a platform to query databases as Virtual RDF Graphs using SPARQ)
Attaching these adapters to existing data-stores, or replicating existing data into a triple store, takes us one step further to the Linked Enterprise Data:
Figure 4: Enterprise Data Cloud, the first step to integration
Of course now that we have harmonized the data all as RDF accessible via a SPARQLEndpoint we can view this as an extension of the Linked Data cloud in which we provide enterprises users access to both enterprise and public data:
Figure 5: Enterprise Data Cloud and Linked Data cloud
We are now closer to the information shopping bazaar, since users would, given appropriate discovery and searching user interfaces, be able to navigate their own way through this data cloud. However, despite the harmonization of the data into RDF, we still have not provided a means for users ask new questions:
What Company (and their fiscal structure) are we working with that have a Business Practise of type Maintenance for the target industry of Oil and Gas with a supporting technology based on Vendor-Application and this Application is not similar to any of our Application? |
Such questions require pulling information from many different sources within an organization. Even with the Enterprise Data Cloud one has provided the capability to discover such answers. Would it not be better to allow a user to ask such a question, and let the Linked Enterprise Data determine from where it should pull partial answers which it can then aggregate into the complete answer to the question. It is like asking a team of people to answer a complex question, each contributing their own, and then assembling the overall answer rather than relying on a single guru. Remediator has the role of that team, taking parts of the questions and asking that part of the question of the data-sources.
Figure 6: Remediator as the Common Entry Point to Linked Enterprise Data (LED)
Thus our question can become:
|
This decomposition of a question into sub-questions relevant to each dataset is automated by Remediator:
Figure 7: Sub-Questions distributed to datasets for answers
Requirements for a Linked Enterprise Data Architecture
- Keep it simple
- Do not re-invent that which already exists.
- Eliminate replication where possible.
- Avoid the need for prior inferencing.
- Efficient query performance.
- Provide provenance of results.
- Provide optional caching for further slicing and dicing of result-set.
- Use Void only Void and nothing but Void to drive the query
[1] If I have 10 database systems running my business that are entirely disconnected, then the benefits are 10 * K, where K is some constant. If I integrate these databases in pairs (operations + accounting, accounting + payroll, etc), then the benefits increase to 10 * K * 2. If I integrate in threes, (operations + accounting + maintenance, accounting + payroll + receiving, etc), then the benefits increase four-fold (a corollary of Metcalf’s law) to 10 * K * 4. For quad-wise integration my benefits would be 10 * K * 8 and so on. Now it might not be 8 fold but the point is there is a geometric, not linear, growth in benefits as I integrate all of my information across my organization.