Integration Problem to be solved

  • Data in different databases, even with Linked Open data sources.
  • Misaligned models, different datasets have different meanings for classes and predicates that need to be aligned.
  • Misaligned names for the same concepts.
  • Replication is problematical.
  • Query definition and scope of querying difficult to define in advance.
  • Provence of data necessary.
  • Cannot depend on inferences being available in advance
  • Scalable architecture requires that all queries are stateless

Data Cathedrals versus Information Shopping Bazaars

Linked Open Data has been growing since 2007 from a few (12) interconnected datasets to 295 as of 2011, and it continues to grow. To quote “Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.” (Linked Data, n.d.) 

Figure 1: Growth of the Linked Data ‘Cloud’

As impressive as the growth of interconnected datasets is, what is more important is the value of that interconnected data. A corollary of Metcalf’s law suggests that the benefit gained from integrated information grows geometrically[1] with the number of data communities that are integrated.

Many organizations have their own icebergs of information: operations, sales, marketing, personnel, legal, finance, research, maintenance, CRM, document vaults etc. (Lawrence, 2012) Over the years there have been various attempts to melt the boundaries between these icebergs including the creation of the mother-of-all databases that houses (or replicates) all information or the replacement of disparate applications with their own database with a mother-of-all application that eliminates the separate databases. Neither of these has really succeeded in unifying any or all data within an organization. (Lawrence, Data cathedrals versus information bazaars?, 2012). The result is a ‘Data Cathedral’ through which users have no way to navigate to find the information that will answer their questions.

Figure 2: Users have no way to navigate through the Enterprise’s Data Cathedral

Remediator at the heart of Linked Enterprise Data

Can we create an information shopping bazaar for users to answer their questions without committing heresy in the Data Cathedral?  Can we create the same information shopping bazaar as Linked Data within the Enterprise: Linked Enterprise Data (LED). That is the objective of Remediator.

First of all we must recognize that the enterprise will have many structured, aggregated, and unstructured data stores already in place:

Figure 3: Enterprise Structured, Aggregated, and Unstructured Data Icebergs

One of the keys to the ability of Linked Data to interlink 300+ datasets is that they are all are expressed as RDF. The enterprise does not have the luxury of replicating all existing data into RDF datasets. However that is not necessary (although still sometimes desirable) because there are adapters that can make any existing dataset look as if it contains RDF, and can be accessed via a SPARQLEndpoint. Examples are listed below

  1. D2RQ: (D2RQ: Accessing Relational Databases as Virtual RDF Graphs )
  2. Ultrawrap:(Research in Bioinformatics and Semantic Web/Ultrawrap)
  3. Ontop:(-ontop- is a platform to query databases as Virtual RDF Graphs using SPARQ)

Attaching these adapters to existing data-stores, or replicating existing data into a triple store, takes us one step further to the Linked Enterprise Data:

Figure 4: Enterprise Data Cloud, the first step to integration

Of course now that we have harmonized the data all as RDF accessible via a SPARQLEndpoint we can view this as an extension of the Linked Data cloud in which we provide enterprises users access to both enterprise and public data:

Figure 5: Enterprise Data Cloud and Linked Data cloud

We are now closer to the information shopping bazaar, since users would, given appropriate discovery and searching user interfaces, be able to navigate their own way through this data cloud.  However, despite the harmonization of the data into RDF, we still have not provided a means for users ask new questions:

What Company (and their fiscal structure) are we working with that have a Business Practise of type Maintenance for the target industry of Oil and Gas with a supporting technology based on Vendor-Application and this Application is not similar to any of our Application?

Such questions require pulling information from many different sources within an organization. Even with the Enterprise Data Cloud one has provided the capability to discover such answers. Would it not be better to allow a user to ask such a question, and let the Linked Enterprise Data determine from where it should pull partial answers which it can then aggregate into the complete answer to the question. It is like asking a team of people to answer a complex question, each contributing their own, and then assembling the overall answer rather than relying on a single guru.  Remediator has the role of that team, taking parts of the questions and asking that part of the question of the data-sources.

Figure 6: Remediator as the Common Entry Point to Linked Enterprise Data (LED)

Thus our question can become:

  1. What Business Practise of type Maintenance for the target industry of Oil and Gas?
  2. What Company are we working with?
  3. What Company have a Business Practise of type Maintenance?
  4. What Business Practise with a supporting technology based on Vendor- Application?
  5. What Company (and their fiscal structure)?
  6. What Vendor-Application and this Application is not similar to any of our Application?

This decomposition of a question into sub-questions relevant to each dataset is automated by Remediator:

Figure 7: Sub-Questions distributed to datasets for answers

Requirements for a Linked Enterprise Data Architecture

  • Keep it simple
  • Do not re-invent that which already exists.
  • Eliminate replication where possible.
  • Avoid the need for prior inferencing.
  • Efficient query performance.
  • Provide provenance of results.
  • Provide optional caching for further slicing and dicing of result-set.
  • Use Void only Void and nothing but Void to drive the query

[1]  If I have 10 database systems running my business that are entirely disconnected, then the benefits are 10 * K, where K is some constant. If I integrate these databases in pairs (operations + accounting, accounting + payroll, etc), then the benefits increase to 10 * K * 2. If I integrate in threes, (operations + accounting + maintenance, accounting + payroll + receiving, etc), then the benefits increase four-fold (a corollary of Metcalf’s law) to 10 * K * 4. For quad-wise integration my benefits would be 10 * K * 8 and so on. Now it might not be 8 fold but the point is there is a geometric, not linear, growth in benefits as I integrate all of my information across my organization.


OData2SPARQL is an OData proxy protocol convertor for any SPARQL/RDF triplestore. To compare SPARQL with OData is somewhat misleading. After all SPARQL has its roots as a very powerful query language for RDF data, but is not intended as a RESTful protocol. Similarly OData has its roots as an abstract interface to any type of datastore, not as a specification of that datastore. Some have said “OData is the equivalent of ODBC for the Web”.
The data management strengths of SPARQL/RDF can be combined with the application development strengths of OData with a protocol proxy: OData2SPARQL, a Janus-point between the application development world and the semantic information world.

Figure 1: OData2SPARQL Proxy between Semantic data and Application consumers

What is OData?

OData is a standardized protocol for creating and consuming data APIs. OData builds on core protocols like HTTP and commonly accepted methodologies like REST. The result is a uniform way to expose full-featured data APIs (Odata.org).  Version 4.0 has been standardized at OASIS, and was released in March 2014.

OData RESTful APIs are easy to consume. The OData metadata, a machine-readable description of the data model of the APIs, enables the creation of powerful generic client proxies and tools. Some have said “OData is the equivalent of ODBC for the Web” (OASIS Approves OData 4.0 Standards for an Open, Programmable Web, 2014). Thus a comprehensive ecosystem of applications, and development tools, has emerged a few of which are listed below:

  • LINQpad: LINQPad is a tool for building OData queries interactively.
  • OpenUI5is an open source JavaScript UI library, maintained by SAP and available under the Apache 2.0 license. OpenUI5 lets you build enterprise-ready web applications, responsive to all devices, running on almost any browser of your choice. It’s based on JavaScript, using JQuery as its foundation, and follows web standards. It eases your development with a client-side HTML5 rendering library including a rich set of controls, and supports data binding to different models including OData.
  • Power Query for Excel is a free Excel add-in that enhances the self-service Business Intelligence experience in Excel by simplifying data discovery, access and collaboration.
  • Tableau – an excellent client-side analytics tool – can now consume OData feeds
  • Teiid allows you to connect to and consume sources (odata services, relational data, web services, file, etc.) and deploy a single source that is available as an OData service out-of-the-box.
  • Telerik not only provides native support for the OData protocol in its products, but also offers several applications and services which expose their data using the OData protocol.
  • TIBCO Spotfire is a visual data discovery tool which can connect to OData feeds.
  • Sharepoint: Any data you’ve got on SharePoint as of version 2010 can be manipulated via the OData protocol, which makes the SharePoint developer API considerably simpler.
  • XOData is a generic web-based OData Service visualization & exploration tool that will assist in rapid design, prototype, verification, testing and documentation of OData Services. 

OData vs SPARQL/RDF

To compare SPARQL with OData is somewhat misleading. After all SPARQL has its roots as a very powerful query language for RDF data, and is not intended as a RESTful protocol. Similarly OData has its roots as an abstract interface to any type of datastore, not as a specification of that datastore.  Recently JSON-LD has emerged (Manu Sporny, 2014), providing a method of transporting Linked Data using JSON. However JSON-LD focusses on the serialization of linked data (RDF) as JSON rather than defining the protocol for a RESTful CRUD interface. Thus it is largely an alternative to, say, Turtle or RDF/XML serialization format.

OData and SPARQL/RDF: Contradictory or Complimentary?
  OData SPARQL/RDF
Strengths ·   Schema discovery

·   OData provides a data source neutral web service interface which means application components can be developed independently of the back end datasource.

·   Supports CRUD

·   Not limited to any particular physical data storage

·   Client tooling support

·   Easy to use from JavaScript

·   Growing set of OData productivity tools such as Excel, SharePoint, Tableau and BusinessObjects.

·   Growing set of  OData frameworks such as SAPUI5, OpenUI5, and KendoUI

·   Growing set of independent development tools such as LINQPad, and XOdata

·   Based on open (OASIS) standards after being initiated by Microsoft

·   Strong commercial support from Microsoft, IBM, and SAP.

·   OData is not limited to traditional RDBMS applications. Vendors of real-time data such as OSI are publishing their data as an OData endpoint.

 

·   Extremely flexible schema that can change over time.

·   Vendor independent.

·   Portability of data between triple stores.

·   Federation over multiple, disparate, data-sources is inherent in the intent of RDF/SPARQL.

·   Increasingly standard format for publishing open data.

·   Linked Open Data expanding.

·   Identities globally defined.

·   Inferencing allows deduction of additional facts not originally asserted which can be queried via SPARQL.

·   Based on open (W3C) standards

Weaknesses ·   Was perceived as vendor (Microsoft) protocol

·   Built around the use of a static data-model (RDBMS, JPA, etc)

·   No concept of federation of data-sources

·   Identities defined with respect to the server.

·   Inferencing limited to sub-classes of objects

 

·   Application development frameworks that are aligned with RDF/SPARQL limited.

·   Difficult to access from de-facto standard BI tools such as Excel.

·   Difficult to report using popular reporting tools

 

Table 1: ODATA AND SPARQL/RDF: CONTRADICTORY OR COMPLIMENTARY?

OData2SPARQL: OData complementing RDF/SPARQL

The data management strengths of SPARQL/RDF can be combined with the application development strengths of OData with a protocol proxy: OData4SPARQL. OData4SPARQL is the Janus-point between the application development world and the semantic information world.

  • Brings together the strength of a ubiquitous RESTful interface standard (OData) with the flexibility, federation ability of RDF/SPARQL.
  • SPARQL/OData Interop proposed W3C interoperation proxy between OData and SPARQL (Kal Ahmed, 2013)
  • Opens up many popular user-interface development frameworks and tools such as OpneUI5.
  • Acts as a Janus-point between application development and data-sources.
  • User interface developers are not, and do not want to be, database developers. Therefore they want to use a standardized interface that abstracts away the database, even to the extent of what type of database: RDBMS, NoSQL, or RDF/SPARQL
  • By providing an OData4SPARQL server, it opens up any SPARQL data-source to the C#/LINQ development world.
  • Opens up many productivity tools such as Excel/PowerQuery, and SharePoint to be consumers of SPARQL data such as Dbpedia, Chembl, Chebi, BioPax and any of the Linked Open Data endpoints!
  • Microsoft has been joined by IBM and SAP using OData as their primary interface method which means there will many application developers familiar with OData as the means to communicate with a backend data source.

Consuming RDF via OData: OData2SPARQL

All of the following tools are demonstrated accessing an RDF triple store via the OData2SPARQL porotocol proxy.

Development Tools

XOData

A new online OData development is XOData from (PragmatiQa, n.d.). Unlike other OData tools, XOData renders very useful relationship diagrams. The Northwind RDFD model published via OData4SPARQL endpoint is shown below:

 Figure 2: Browsing the EDM model Published by ODaTa2SPARQL using XOData

XOData also allows the construct of queries as shown below:

Figure 3: Querying The OData2SPARQL Endpoints Using XODATA

LINQPad

(LINQPad, n.d.) is a free development tool for interactively querying databases using C#/LINQ. Thus it supports Object, SQL, EntityFramework, WCF Data Services, and, most importantly for OData4SPARQWL, OData services. Since LINQPad is centered on the Microsoft frameworks, WCF, WPF etc, this illustrates how the use of OData can bridge between the Java worlds of many semantic tools, and the Microsoft worlds of corporate applications such as SharePoint and Excel.

LINQPad shows the contents of the EDM model as a tree. One can then select an entity within that tree, and then create a LINQ or Lambda query. The results of executing that query are then presented below in a grid.

Figure 4:  Browsing and Querying the OData2SPARQL Endpoints Using LINQPad

LINQPad and XOData are good for testing out queries against any datasource. Therefore this also demonstrates using the DBpedia SPARQL endpoint as shown below:

 Figure 5: Browsing DBPedia SPARQLEndpoint Using LINQPad via OData2SPARQL

Browsing Data

One of the primary motivations for the creation of OData2SPARQL is to allow access to Linked Open Data and other SPARQLEndpoints from the ubiquitous enterprise and desktop tools such as SharePoint and Excel.

Excel/PowerQuery

“Power Query is a free add-in for Excel 2010 and up that provide users an easy way to discover, combine and refine data all within the familiar Excel interface.” (Introduction to Microsoft Power Query for Excel, 2014)

PowerQuery allows a user to build their personal data-mart from external data, such as that published by OData2SPARQL. The user can fetch data from the datasource, add filters to that data, navigate through that data to other entities, and so on with PowerQuery recording the steps taken along the way. Once the data-mart is created it can be used within Excel as a PivotTable or a simple list within a sheet. PowerQuery caches this data, but since the steps to create the data have been recorded, it can be refreshed automatically by allowing PowerQuery to follow the same processing steps. This feature resolves the issue of concurrency in which the data-sources are continuously being updated with new data yet one cannot afford to always query the source of the data. These features are illustrated below using the Northwind.rdf endpoint published via OData2SPARQL:

Figure 6: Browsing the OData4SPARQL Endpoint model with PowerQuery

Choosing an entity set allows one to start filtering and navigating through the data, as shown in the ‘Applied Steps’ frame on the right.

Note that the selected source is showing all values as ‘List’ since each value can have zero, one, or more values as is allowed for RDF DatatypeProperties.

Figure 7: Setting Up Initial Source of Data in PowerQuery

As we expand the data, such as the companyProperty, we see that the Applied Steps records the steps take so that they can be repeated.

Figure 8: Expanding Details in PowerQuery

The above example expanded a DatatypeProperty collection. Alternatively we may navigate through a navigation property such as Customer_orders, the orders that are related to the selected customer:

Figure 9: Navigating through related data with PowerQuery

 Once complete the data is imported into Excel:

Figure 10: Importing data from OData2SPARQL with PowerQuery

Unlike conventional importing of data into Excel, the personal data-mart that was created in the process of selecting the data is still available.

Application Development Frameworks

There are a great number of superb application development frameworks that allow one to create cross platform (desktop, web, iOS, and Android), rich (large selection of components such as grids, charts, forms etc) applications. Most of these are based on the MVC or MVVM model both of which require a systematic and complete (CRUD) access to the back-end data via a RESTful API. Now that OData has been adopted by OASIS, the number of companies offering explicit support for OData is increasing, ranging from Microsoft, IBM, and SAP to real-time database vendors such as OSI. Similarly there are a number of frameworks, one of which is SAPUI5 (UI Development Toolkit for HTML5 Developer Center , n.d.) which has an open source version OpenUI5 (OpenUI5, n.d.).

SAPUI5

SAPUI5 is an impressive framework which makes MVVC/MVVM application development easy via the Eclipse-based IDE. Given that OData4SPARQL publishes any SPARQLEndpoint as an OData endpoint, it means that this development environment is immediately available for an semantic application development.  The following illustrates a master-detail example against the Northwind.rdf SPARQL endpoint via OData4SPARQL.

Figure 11: SAPUI5 Application using OData4SPARQL endpoint

Yes we could have cheated and used the Northwind OData endpoint directly, but the Qnames of the Customer ID and Order Number reveals that the origin of the data is really RDF.

Handling Contradictions between OData and RDF/SPARQL

RDF is an extremely powerful way of expressing data, so a natural question to ask is what could be lost when that data is published via an OData service. The answer is very little! The following table lists the potential issues and their mitigation:

Issue Description Mitigation
OData 3NF versus RDF 1NF RDF inherently supports multiple values for a property, whereas OData up to V2 only supported scalar values Odata V3+ supports collections of property values, which are supported by OData4SPARQL proxy server
RDF Language tagging RDF supports language tagging of strings OData supports complex types, which are used to map a language tagged string to a complex type with separate language tag, and string value.
DatatypeProperties versus object-attributes OWL DatatypeProperties are concepts independent of the class, bound to a class by a domain, range or OWL restriction. Conversely OData treats such properties bound to the class. In OData4SPARQL The OWL DatatypeProperty is converted to an OData EntityType property for each of the DatatypeProperty domains.
Multiple inheritance Odata only supports single inheritance via the OData baseType declaration within an EntityType definition.  
Multiple domain properties An OWL ObjectProperty will be mapped to an OData Association.  An Association can be between only one FromRole and one ToRole and the Association must be unique within the namespace. OData Associations are created for each domain. The OData4SPARQL names these associations {Domain}_{ObjectProperty}, ensuring uniqueness.
Cardinality The capabilities of OData V3 allow all DatatypeProperties to be OData collections. However the ontology might have further restrictions on the cardinality. OData4SPARQL assumes cardinality will be managed by the backend triple store. However in future versions, if the cardinality is restricted to one or less, then the EntityType property can be declared as a scalar rather than a collection.

Table 2: Contradictions between OData and RDF/SPARQL

Availability of OData2SPARQL

Two versions of OData2SPARQL are freely available as listed below:

  1. inova8.odata2sparql.v2 : OData V2 based on the Olingo.V2 library supporting OData Version 2
  2. inova8.odata2sparql.v4 : OData V4 based on the Olingo.V4 library supporting OData V4 (in progress)

 


Answering complex queries with easy-to-use graphical interface

The objectives of lens2odata are to provide a simple method of OData query construction driven by the metadata provide by OData services

  • Provides metamodel-driven OData query construction
    • Eliminates any configuration required to expose any OData service to lens2odata
  • Allows searches to be saved and rerun
    • Allows ease of use by casual users
  • Allows queries to be pinned to ‘Lens’ dashboard panels
    • Provides simple-to-use dashboard
  • Searches can be parameterized
    • Allows for easy configuration of queries
  • Compatible with odata2sparql, a service that exposes any triple store as an OData service
    • Provides a Query-Answering-over-Linked-Data (QALD) interface to any linked data.

Concept of Operation

Lens2odata consists of 3 primary pages with which users interact:

  1. Query: is the page in which users can
    1. add new OData services,
    2. compose queries,
    3. save those queries for reuse, and
    4. pin the queries as result fragments on a Lens
  2. Search: is the page in which users can
    1. select an existing query, and execute that query to explore the results
    2. from where they can navigate to Lens pages for specific entities or collections of entities
  3. Lens: are the pages, composed by users, which display fragments of details, optionally grouped into tabs, about a specific entity or collections of entities. Fragments can either be forms or tables. Other fragment layouts are being added.

Navigation between these pages are shown in the diagram below. Specifically these navigation paths are:

  1. Toggle between Search and Query to explore how the results would appear to a casual user
  2. Navigate to a concept’s Lens from Query preview hyperlinks
  3. Navigate to a concept’s Lens from Search results’ hyperlinks

Figure 1: lens2odata Navigation

Quick Start

Login to lens2odata

  1. Navigate to the Url provided by your administrator for lens2odata, http://<server>/lens2odata
  2. Enter users name and password
  3. Since no service has been previously setup, you will be prompted to enter the service display name and Url of that service. Check to use default proxy if not a local service
  4. After ‘save’, as long as your service was validated, you will immediately enter the Query page with a new query initialized with the first collection found in the service
  5. Execute ‘Preview’ to populate the Results Preview with a few values from the collection:
  6. You are now ready to:
    1. View the query via Search
    2. Explore the results further via Lens
    3. Expand the query with more values and filters

Let’s explore Search first of all

  1. Click on at the top-left of the Query page to navigate to the search page:

Search is the page that general users will access. From this page they can select a predefined query, and execute that query to start their discovery journey.

  1. Click on to populate the results form:

 

This shows more details than the Query page because, in the absence of any specific definition about what details of the instances of collections should be displayed, search will display whatever it can find.

  1. Click on the Url “Categories(1)” to navigate to the Lens for that type of instance.

The Lens page is an information dashboard that can be constructed for any type of instance that is discovered. In this case, since no specific lens page has been setup for ‘Category’ types of instance, a default page has been used with a single tab.

  1. There is another Uri on this page “Products”. Navigating this link will take you to a similar lens page, but one for a collection of instances:

 

  1. This Lens for Products shows a list of instances of products. The first column is a Uri to the lens page for the individual product. Click one to navigate to its Lens:
  2. You are now discovering information by navigating through the data, having started by querying a collection. Next steps would be:
  • The original query was not particularly specific. Lens2odata allows that query to be further refined by specifying which attributes should be displayed, adding navigation properties to other entities, adding filters to the query to restrict results or even parameterizing the query so that a user can execute the saved query, modifying the results just by supplying parameter values.
  • The Lens dashboard page can be further refined by adding more fragments of information on a tab, or adding more tabs to the lens page to logically group the information.

 

Availability of Lens2OData

A versions of Lens2OData  is available on GitHub at com.inova8.lens2odata


SKOS, the Simple Knowledge Organization System, offers an easy to understand schema for vocabularies and taxonomies. However modeling precision is lost when skos:semanticRelation predicates are introduced.

Combining SKOS with RDFS/OWL allows both the precision of owl:ObjectProperty to be combined with the flexibility of SKOS. However clarity is then lost as the number of core concepts (aka owl:Class) grow.
Many models are not just documenting the ‘state’ of an entity. Instead they are often tracking the actions performed on entities by agents at locations. Thus aligning the core concepts to the Activity, Entity, Agent, and Location classes of the PROV ontology provides a generic upper-ontology within which to organize the model details.

Vehicle Manufacturing Example

This examples captures information about vehicle manufacturing. Following

  1. Manufacturers: the manufacturer of models of cars in various production lines sited at plants
  2. Models: the models that the manufacturer produces
  3. ProductionLines: the production lines set up to produce models of vehicles on behalf of a manufacturer
  4. Plants: the plants that house the production lines

In addition there are different ‘styles’ of manufacturing that occur for various models and various sites:

  1. Manufacturing: the use of a ProductionLine for a particular Model

SKOS Modeling

If we follow a pure SKOS model we proceed as follows by creating a VehicleManufacturingScheme  skos:ConceptScheme

s:VehicleManufacturingScheme

rdf:type skos:ConceptScheme

.

Then we create skos:topConceptOf Manufacturer, Model, Plant, and Production as follows:

 

s:Manufacturer

rdf:type owl:Class ;

rdfs:subClassOf skos:Concept ;

skos:topConceptOf s:VehicleManufacturingScheme

.

s:Model

rdf:type owl:Class ;

rdfs:subClassOf skos:Concept ;

skos:topConceptOf s:VehicleManufacturingScheme

.

These top-level concepts are being created of type owl:Class and a subClassOf skos:Concept.  This is the pattern recommended in (Bechhofer, et al.)

Finally we can create skos:broader concepts as follows:

s:Ford

rdf:type s:Manufacturer ;

skos:broader s:Manufacturer ;

skos:inScheme s:VehicleManufacturingScheme

.

s:Fusion

rdf:type s:Model ;

skos:broader s:Model ;

skos:inScheme s:VehicleManufacturingScheme

.

The resultant SKOS taxonomy of the VehicleManufacturingScheme  skos:ConceptScheme then appears as follows:

Figure 1: SKOS taxonomy

Why?

By starting with a pure SKOS model we provide access to the underling concepts in a more accessible style for the less proficient user, as illustrated by the SKOS Taxonomy above. Yet we have not sacrificed the ontological precision of owl:Classes.

Thus we can ask questions about all concepts:

SELECT * WHERE

{

       ?myConcepts rdfs:subClassOf+ skos:Concept .

}

Or we can get a list of anything broader than one of these concepts:

SELECT * WHERE

{

       ?myBroaderConcepts skos:broader s:Model .

}

SKOS+OWL Modeling

Although skos:semanticRelation allows one to link concepts together, this predicate is often too broad when trying to create an ontology that documents specific relations between specific types of concept.

In our VehicleManufacturingScheme we might want to know the following:

  1. isManufacturedBy: which manufacturer manufactures a particular model
  2. operatedBy: which manufacturer operates a particular production facility
  3. performedAt: which plant is the location of a production facility
  4. wasManufacturedAt: which production facility was used to manufacture a particular model

Figure 2: SKOS+OWL model of Relations

 

These predicates can be defined using RDFS as follows:

so:isManufacturerBy

 rdf:type owl:ObjectProperty ;

rdfs:domain s:Model ;

rdfs:range s:Manufacturer ;

rdfs:subPropertyOf skos:semanticRelation

.

so:operatedBy

rdf:type owl:ObjectProperty ;

rdfs:domain s:Production ;

rdfs:range s:Manufacturer ;

rdfs:subPropertyOf skos:semanticRelation

.

Note that the definition of Model, Manufacturer etc. as subClassOf skos:Concept allows us to precisely define the domain and range.

s:Fusion

so:isManufacturerBy s:Ford ;

so:wasManufacturedAt s:Halewood-SmallVehicle ;

.

s:Dagenham-Truck

so:operatedBy s:Ford ;

so:performedAt s:Dagenham ;

.Thus we have used the flexibility of SKOS with the greater modeling precision of RDFS/OWL.

Why?

By building upon the SKOS model, one can ask an expansive question such as what concepts are semantically related to, say, the concept s:Fusion with a simple query:

SELECT * WHERE

{

       s:Fusion  ?p ?y .

       ?p rdfs:subPropertyOf* skos:semanticRelation

}

Yet with the same model we can ask a specific question about a relationship of a specific instance:

SELECT * WHERE

{

       s:Camry so:isManufacturerBy   ?o .

}

SKOS+OWL+PROV Modeling

One of the attractions of SKOS is that a taxonomy can grow organically. One of the problems of SKOS is that a taxonomy can grow organically!

As the taxonomy grows it can be useful to add another layer of structure beyond a catalog of concepts. Many models are not just documenting the ‘state’ of an entity. Instead they are often tracking the actions performed on entities by agents at locations. Thus aligning the core concepts to the Activity, Entity, Agent, and Location classes of the PROV ontology (Lebo, et al.) provides a generic upper-ontology within which to organize the model details.

Figure 3: PROV model

Thus our VehicleManufacturingScheme has each core PROV concept:

  1. Manufacturers: the Agents who manufacture models, and operate plants
  2. Models: the Entities
  3. ProductionLines: the Activities that produce Models on behalf of Manufacturers.
  4. Plants: the Location at which Activities take place, and Agents and Entities are located.

Figure 4: PROV Model

 

s:Production

rdfs:subClassOf prov:Activity ;

.

s:Model

rdfs:subClassOf prov:Entity ;

.

s:Manufacturer

rdfs:subClassOf prov:Organization ;

.

s:Plant

rdfs:subClassOf prov:Location ;

.

Similarly we can cast our predicates into the same PROV model as follows:

so:isManufacturerBy

rdfs:subPropertyOf prov:wasAttributedTo ;

.

so:operatedBy

rdfs:subPropertyOf prov:wasAssociatedWith ;

.

so:performedAt

rdfs:subPropertyOf prov:atLocation ;

.

so:wasManufacturedAt

rdfs:subPropertyOf prov:wasGeneratedBy ;

.

Why?

The PROV model is closer to the requirements of most enterprise models, that are trying to ‘model the business’, than a simple E-R model. The latter concentrates on capturing the attributes of an entity that record the current state of that entity. Often those attributes focus on documenting the process by which the entity gained its current state:

  • The agent that created the entity
  • The activity used to create the entity
  • The location when things were performed
  • The data of the activity, etc

Superimposing the PROV model formalizes this model, and thus allows a structure within which a more casual user can navigate, rather than a sea of entities.

By building upon the PROV model, one can ask an expansive question such as what entities behave as Agents and in which entities are they involved:

SELECT * WHERE

{

?organization a ?Agent .

?Agent rdfs:subClassOf* prov:Agent .

?entity ?predicate ?organization

}

SKOS+OWL+PROV-Qualified Modeling

Within the structure of PROV, predicates define the relationships between Activities, Entities, Agents, and Locations. However it is sometimes necessary to qualify these relationships. For example, the so:wasManufacturedAt predicate defines that a s:Production facility was used to manufacture a s:Model. When? How was it used? Why?

To extend the model, PROV adds the concept of a qualified influence, which allows the relationship to be further defined.

Figure 5: Qualified PROV for some predicates

We do this first of all by creating sopq:Manufacturing:

sopq:Manufacturing

rdf:type owl:Class ;

rdfs:subClassOf skos:Concept ;

rdfs:subClassOf prov:Generation ;

skos:topConceptOf s:VehicleManufacturingScheme ;

.

Note that this is a rdfs:subClassOf prov:Generation, the reification of the predicate prov:wasGeneratedBy

We then add two predicates, one (sopq:wasManufacturedUsing) from the prov:Entity to the prov:Generation, and one (sopq:production) from the prov:Generation to the prov:Activity as follows:

sopq:wasManufacturedUsing

rdf:type owl:ObjectProperty ;

rdfs:domain s:Model ;

rdfs:range sopq:Manufacturing ;

rdfs:subPropertyOf skos:semanticRelation ;

rdfs:subPropertyOf prov:qualifiedGeneration ;

.

sopq:production

rdf:type owl:ObjectProperty ;

rdfs:subPropertyOf skos:semanticRelation ;

rdfs:subPropertyOf prov:activity ;

.

Finally we can create a Manufacturing qualified generation concept as follows:

sopq:L-450H_at_Swindon-Hybrid

rdf:type sopq:Manufacturing ;

sopq:production s:Swindon-Hybrid ;

 skos:broader sopq:Manufacturing ;

.

s:L-450H

sopq:wasManufacturedUsing sopq:L-450H_at_Swindon-Hybrid ;

.

In the figure below we can see that these qualified actions simply extend the SKPOS taxonomy:

Figure 6: Taxonomy extended with Qualified Actions

Why?

Using qualifiedActions provides a systematic, rather than ad-hoc, way to provide more precision to a model.

Remaining Issues

  1. The PROV structure does not manifest itself within the taxonomy. Should Activity, Entity, Agent, and Location therefore be ConceptSchemes?

Model

The model files used in this example are included here: model2rdf

  • skos.ttl
  • skos+owl.ttl
  • skos+owl+prov.ttl
  • skos+owl+provqualified.ttl

References

Bechhofer, Sean and Miles, Alistair. Using OWL and SKOS. [Online] W3C. https://www.w3.org/2006/07/SWD/SKOS/skos-and-owl/master.html.

Lebo, Timothy, Satya, Sahoo and McGuinness, Deborah. PROV-O: The PROV Ontology. W3C. [Online] https://www.w3.org/TR/prov-o/.Reaming Issues


A large community exists “that sees Linked Data, let alone the full Semantic Web, as an unnecessarily complicated technology”, Phil Archer. However early adopters are always known for zealotry rather than pragmatism: mainframe to mini, mini-to-pc, pc-to-web and many more are examples where the zealots often threw the baby out with the bath-water.

Rather than abandoning the semantic web perhaps we should take a pragmatic view and identify its strengths, together with some soul-searching to be honest about its weaknesses:

SPARQL is unlikely ever to be end-user friendly:

However one can say exactly the same of SQL (which, believe it or not, used to be called User-Friendly-Interface, or UFI for short). SQL is now hidden behind much more developer-friendly facades so that they can deliver user-friendly experiences such as Hibernate/JPA, C#+LINQ, Odata, and many more. Even so SQL remains the language of choice for manipulating and querying RDBMS.

So we still need SPARQL for manipulating semantic data, but we need developer- and user-friendly facades that make the semantic information more accessible. Some initiatives are JPA for RDF, LINQtoRDF, Odata-SPARQL but the effort is fragmented with few standard initiatives.

RDF/OWL is unlikely to supplant Entity-Relational, JSON-Object databases:

However we have all learnt the lesson that the data model is often at the core of any application. Furthermore any changes to that data model can be costly, especially in the later stages of the development lifecycle. Who has not created an object-attribute data model deployed in an RDBMS to retain design flexibility without the need to change the underlying database schema? Are we not reinventing the semantic data model?

So the principles and flexibility of semantic (RDF, RDFS, and OWL) data modeling of data could be adopted as a data modeling paradigm representing the evolution that started with Entity-Relationship (ER), went through Object-Role Modeling (ORM or NIAM), to Semantic Data Modeling (SDM). Used in combination with dynamic REST/JSON, such as Odata, one could truly respond to the Dynamic Business Application Imperative (Forrester).

Triple-stores are unlikely to supplant Big-Data:

However, of volume, velocity, and variety, many big data solutions are weaker when handling variety, especially when the variety of data is changing over time as it does with evolving user needs. Most of the time the variety at the data source will be solved by Extract-Transform-Load (ETL) into a big-data store. Is this any different than data warehousing which has matured over the last 20 years, except for the use of different data storage technology? Like any goods in transit, data gets damaged and contaminated when it is moved. It is far better to use the data in-situ if at all possible, but this has been the unachievable Holy Grail of data integration for many years.

So the normalization of any and all data into triples might not be the best way to store data but can be the way to mediate variable information from a variety of data-sources: Ontology Based Data Access (OBDA). The ontology acts a semantic layer between the user and the data. The semantics of the ontology are used to enrich the information on the sources and/or cope with incomplete information in them. 

In summary …

Semantic technologies will see more success if it pursues the more pragmatic approach of the solving those problems that are not satisfactorily solved elsewhere such as querying and reasoning, dynamic data models, and data source mediation; instead of resolving the already solved.  

References

Phil Archer http://semanticweb.com/tag/phil-archer

LINQtoRDF: https://code.google.com/p/linqtordf/

JPA/RDF: https://github.com/mhgrove/Empire

ODATA/SPARQL: https://github.com/brightstardb/odata-sparql

Dynamic Business Imperative: http://www.forrester.com/The+Dynamic+Business+Applications+Imperative/fulltext/-/E-RES41397?objectid=RES41397

Ontology Based Data Access: http://obda.inf.unibz.it/


Enterprises create data cathedrals with an enforced dogma to control data purity, causing much information to be outside its walls where informal information bazaars thrive. These information bazaars have suspect quality, uncertain provenance, yet are responsive to users’ needs. Metcalf’s law suggests that the benefit gained from integrated information grows geometrically1
 with the number of data communities that are integrated. How can we balance the dogma of the data cathedrals and the spontaneity of the information bazaar?

Enterprise’s database cathedrals reflect corporate dogma. Nothing gets changed without approval from high. Change is very slow. New databases orders get integrated only after a considerably long time assuming that the new data is 100% squeaky clean. So there are a lot of databases that are entirely outside the database cathedrals’ walls. Badly behaved sources of data might even be excommunicated.

Where does the other data go? It is not as though this other data does not exist, although many would like to pretend it to be so. Instead they are all in the information bazaar. Anyone with any information can set up their own information stall, and store their own data in Excel, Access, anywhere they want. They only specialize in their own data for their own use. This data is pretty good because that is all they need for their business. They share well with others but on a barter basis. In fact the information bazaar is chaotic, but lively, always changing to users’ demands, and a fun place to be. 

Why do we have the conflict between the database cathedral and the information bazaars?

The data cathedral offers security, quality, and good provenance. It provides the system of record for users who then should have complete confidence in their decision making. It does this using accurate relational models capturing enterprise information. But a relational model is designed by the cathedral hierarchy based on the closed model: only pure data can be entered into the database; impure data can lead to excommunication. 

The information bazaar has few rules of entry. As demonstrated by the web, it allows anyone to say anything about anything (AAA). Even with this deficiency we will regularly search the web to help us with our decision making, not exploring sources that are suspect, and filtering information that we feel lacks accuracy until we end up with information to support our decision.

Can we resolve these conflicting objectives?

Can we expect the cathedral hierarchy to relax its admittance criteria to let in as much of the information bazaar as possible? Somewhat, but we cannot expect miracles.

Can we expect the information bazaar to become more sober and responsible so that it can securely provide information with guaranteed quality and provenance? Somewhat, but we cannot expect an evangelical conversion?

Really this is not optimal, because the benefit of having data integrated grows geometrically with the number of interconnected sources, yet the database cathedral cannot grow because the information bazaar does not meet their purity dogma.

So how can these conflicting objectives be redeemed?

One path to redemption is to unite the information bazaar through a common semantic model. This allows all information to be available within a universal graph (model). Of course some riff-raff will get in, but again that is an advantage for the semantic model as you can also declare rules that will verify the accuracy of the data even though it is already stored. 

At the same time the data cathedral can continue to expand, hopefully at faster pace, by integrating those graphs that meet their criteria. 

However we allow users to access both the data cathedral, from where they can obtain the system of record, and information bazaar. We could even report results federating form the two data-sources annotating that information from the information bazaar with its provenance and hence less certain data quality. Doing this in a standards compliant way turns existing enterprise information resources into connectable, responsive and interoperable semantic assets.

Harmony

Using this approach we don’t need to force the data cathedral to relax its dogma, nor do we ask the information bazaar to shut down. Yet we can offer users access to 99% of the enterprise information providing users the ‘Metcalf’1 benefits of full integration. As semantic assets grow and connect, they enable a resilient semantic ecosystem of meaningful interactions between people, applications and data irrespective of the differences in structures, data schemas, governance and technologies. The dividing boundaries between the cathedral and the bazaar no longer need to be obstacles to information users. Semantic ecosystem seamlessly embraces and provides integrated access to data cathedrals and information bazaars alike.

 

1 If I have 10 database systems running my business that are entirely disconnected, then the benefits are 10 * K, some constant. If I integrate these databases in pairs (operations + accounting, accounting + payroll, etc), then the benefits increase to 10 * K * 2. If I integrate in threes, (operations + accounting + maintenance, accounting + payroll + receiving, etc), then the benefits increase four-fold (a corollary of Metcalf’s law) to 10 * K * 4. For quad-wise integration my benefits would be 10 * K * 8 and so on. Now it might not be 8 fold but the point is there is a geometric, not linear, growth in benefits as I integrate all of my information across my organization


A key process manufacturing problem yet to be solved is the management of knowledge, the know-how, know-who, and know-when. Just as we have been eliminating data silos by introducing common data repositories, and creating common information (semantic) services by adding structure to the data, we need a common rules repository rather than having them distributed in Excel spreadsheets, work-flows, documents, government regulations etc. which causes silos of rules, lack of consistency, difficulty to ensure consistent compliance, and many more problems.

Semantic models (RDF, RDFS, and OWL) are excellent at federating information from multiple sources, but there are competing approaches for information integration. However semantic models are also the perfect way to express rules (SPIN) because rules are also described as data within the same database, unlike other technologies where the rules have to be expressed as code.

Given the importance of consistent application of rules in the process manufacturing business, is this then the sweet spot for semantic technologies?

Is data transparent?

We have been struggling with managing knowledge about our process plants for many years. We tried to solve this problem starting in the 80’s with sophisticated data collection applications and real-time databases. However a data-centric alone solution only provides a view of the process plant as a very long list of measurement tags, reinforcing the definition of data as “being discrete, objective facts or observations, which are unorganized and unprocessed and therefore have no meaning or value because of lack of context and interpretation.

Thus our view of the knowledge about the plant, its equipment, its operation, its performance provides a new definition of ‘transparency’ or lack there-of. If we do want to obtain knowledge we must make use of application programs within which are encoded our knowledge-extraction rules. Most prevalent is the ubiquitous spreadsheet in which there are:

  • Object names written into cells
  • RTDB Tag names in hidden cells
  • A fixed number of feed/product rows
  • New spreadsheet for each unit/report

Consequences of this data-without-information or knowledge are a ‘gagging’ of spreadsheets: information is encoded in the tag names, knowledge is encoded in the Excel layout and formulae, action relies on a user running the application and using their experience to detect problems and deduce remedial actions, uncertainty as to the consistent application of the rules throughout the organization, and many more problems.

Data + Model = Information

The problems of the data-centric approach have driven the process manufacturing industries to seek better information management where information is defined as “organized or structured data, which has been processed in such a way that the information now has relevance for a specific purpose or context, and is therefore meaningful, valuable, useful and relevant.” Recognizing the deficiency of a data-only approach, much effort in the last 20 years has been expended on adding context to this data. This usually takes the form of a database schema which contextualizes the data within a model of the process business and plant.

One of the ways that structure is added to data is to use a relational data model. A cornerstone of the relational model is the use of referential integrity rules. An example of a referential integrity rule within the context of a process plant data model might be that a material movement must have one, and only one, source and destination. However there are limits to what rules can be expressed with referential integrity constraints alone.
To obtain knowledge using this information approach we must ask the database via a database query. The advantage of the information approach over the data approach is that it allows one to ask the database complex questions. For example, to obtain a unit’s material balance:

  • For selected unit
    • Find all feed streams
    • For each feed stream fetch desired criteria

This allows for the use of one report for all units throughout the enterprise, compared with a new spreadsheet for every unit. These reports are also more robust since, for example, if the plant changes in some way then the report will reflect those changes. This certainly tackles some of the inconsistency problems faced by a data-centric approach, however many problems remain. For example, it would be exceedingly difficult to express the rule that an operator’s training concurrency expires, say, 2 years after his last training. Thus we still need to resort to complex reports, application programs, Excel macros, etc into which we can encode our ‘rules’ about the information.

Information + Rules = Knowledge

As good as an information-centric approach may be, it still fails to solve many of the business problems that we face that can only be solved by creating knowledge, defined as “know-how, know-who, and know-when; knowledge is action, not a description of action”. Some of the business challenges that can only be solved by a rule-centric solution are shown below.

Business Rule Challenges
Business Issue Problem Rule-centric Solution Advantages
Business rules are distributed throughout  the business Very difficult to know all of the business rules in place, are they duplicated, are they consistently applied Provide common ‘Rules repository’ for the entire organization

Parallels concept of ‘data warehousing’ of data

Many Excel spreadsheets containing knowledge of how to handle information Data Management left to IT or DBA

End users cannot modify the model or *their* data

Instead they resort to Excel on the path to hell!

Use rules server to consistently apply the semantic rules that are then common to all applications

Use Excel for reporting against data, information, and inferred results of rules

Custom reports written against custom information models have encoded rules Reporting languages such as SQL end up containing many business rules

Rules are duplicated in similar but not identical reports

Report against information server and inferred results of application of rules against information
Manual work processes, some of which are not documented Exercises such as ISO90001 remain only as documented procedures with no means of automating those procedures Use common rules repository to define the rules.

Documentation of the rules can then be created from the rules repository

Regulatory compliance rules exist only as documents Difficult to assure compliance to regulations when it is left to individuals who must be familiar with entire regulation Translate regulations to rules deposited in the rules repository
Loss of skills with aging workforce Loss of knowledge in the form of the rules (aka experience) as to how to handle situations Capture the experience as rules within rules repository
Difficult to audit the adherence to rules Since the rules are not formalized it is difficult to ensure that procedures are followed.

Personnel might be trained on procedures, but if the system does not enforce them then management cannot be certain they are followed.

Since most actions are recorded, it is possible to verify that the actions taken comply with the rules even if they were not being forced to follow the rules in the form of a workflow
Complexity of data and information Difficult for users to determine what rules should apply The results of rules become inferred information that is available for reporting
Impossible end-user reporting The Holy-Grail but never achieved

Even if good informational model that provides context to the data, the ‘knowledge’ that should be the result of a report (‘poor yields’, ‘excessive material loss’, ‘pending equipment failure’’) is impossible for end-users to encode into their reports

Semantic information can be reported using ‘query by example’: far easier than any other reporting

Inferred results of rules is available for reporting using the same technique

Knowledge which is defined as “know-how, know-who, and know-when.” requires rules about information, which requires contextualized data Measurement tags disguise the model. Users forced to interact with abstract measurement tags (10FI107.OP)

General Information models become too complex.

Customers desire to support standards, but competing standards supported by different constituencies: PPDMA/ProdML/WitsML/ISA95/MIMOSA/IEC-CIM etc

Although rules can be applied to any informational model, a semantic informational model is a better match to semantic rules

To obtain knowledge from a data-centric approach we encoded many rules into the application such as Excel. Although the information-centric approach could encode certain types of rules into the database schema, such as the referential examples above, there are many rules about the business that cannot be expressed in this way. Throughout any business we have many rules distributed throughout spreadsheets, reports, application programs, work-flows, procedures, etc. Below are examples of rules that can be found throughout Process Manufacturing.

Example operational compliance rules throughout the Process Manufacturing Model (PMM)
  Validation Calculation Deduction Invocation
Validate that the information  is consistent with known rules such as a movement must have a source and a destination Calculating another (data) statement such as the power of a pump is the product of the flow and pressure rise Deducing additional (object) statements such as knowing that the measurement of something downstream is the same as the upstream measurement Invocating an external process to ensure the correct action is taken based upon the change of a rule
Materiel Chain Finance        
Safety Validate that equipment in use has a valid HAZOP assessment     Initiate the safety review work process after an incident.
Initiate the HAZOP review whenever major equipment changed
Docs       Initiate review of encoded rules when document containing rules is revised.
Assets HME Validate that the equipment has undergone appropriate repair or upgrade as recommended Calculate Overall Operational Efficiency (OEE) based on availability, planned, and actual Deduce the onset of increased operational risk based on past observations, and planned use of equipment Initiate maintenance or repair process
Fixed Plant   Provide efficiency, energy consumption calculation based on data and model Deduce links to MSDS, maintenance records, maintenance procedures and other documents  
Technical Infrastructure Validate that critical equipment has a valid security policy.
Validate that users with access to critical equipment have valid training.
  Deduce connectivity between critical equipment through the LAN Initiate remedial action to update software and utilities when risk identified.
Security Validate user has the correct access privileges to perform this action Calculate the currency of any users privileges to perform action Deduce that the building containing critical equipment has secure access controls Initiate security review when equipment moved to new location
People Validate the correct training status of individuals Calculate time remaining for currency of their training Deduce what assets and facilities an individual has based on their training Initiate retraining program when retraining is deduced to become necessary
Consumables Validate that inventory of consumables matches the measured consumption   Deduce the route of consumable materials (additives etc) into the product stream based on the topology so that the costs can be correctly calculated  
Raw
Materials
       
Utilities   Calculate the quantities of utilities in the absence of complete measurements Deduce the route of utilities (water, electricity, fuel etc) into the production facilities based on the topology so that the costs can be correctly assigned  
Emissions Validate that no measured emissions are exceeding regulations Calculate emissions that are not directly measured
calculate total emissions
Deduce the flow of regulated material from the plant topology  
Business Procedures Validate that the correct procedures are being followed.   Deduce which business processes should be applied in particular situations. For example, if an area is designated as secure, then all processes applied to sub-areas must follow that same designated work processes. Initiate a process to update work processes when deviates from following recommended work processes are detected.
Intellectual
Property
       
Product Resource Validate that there is a valid exploration-rights associated with options Calculate the time remaining to take advantage of exploration rights   Initiate review of exploration rights prior to their expiration.
Field Validate that the field has active contracts Calculate the royalty payments based on the individual contracts Deduce the applicable contract rules Initiate contract reviews and payment processes.
Well Validate that each well has an active contract.
Validate that each well is operating in accordance with its operating permits.
Calculate the actual flow based on pressure and temperature (in absence of flow measurement).
Calculate variables required for regulatory reporting.
Infer the line-up between well and receiving station based on topology of lines. Initiate transmission of regulatory reporting requirements
Pipeline Validate that the nomination and routing information is complete: source and destination, quantity, and quality      
Crude Storage Validate that a new crude from pipeline is not being run into the incorrect storage Calculate the overall assay of the inventory based on component assays Deduce the assay available at the crude unit based on the line-up of the crude tanks to units Initiate rerouting of incoming crude to more appropriate storage.
Initiate crude-switchover on crude unit based on assay of new crude tank
Processing Validate that the configured mode of operation matches the planned or scheduled mode of operation Calculate material balances, yields, qualities, efficiencies. Deduce measurements of downstream elements based on the operational configuration and knowledge of location of actual measurements.
Deduce the operational configuration and flow model of the plant given the material movement and battery limit flows
Initiate a work flow to switch modes of operation
Storage Check to ensure that material planned to flow into storage is compatible with in-store material Calculate the actual contents of the storage Deduce the grade of the stored material based on existing stored material and inbound movements Invoke rescheduling of blends when an actual blend is found to be out-of-specification
Pipeline Validate that material is not planned for a line that would contaminate the contents of the line or the planned material. Calculate the material movement based on either source or destination quantity measurements   Invoke custody transfer dispute when transfer outside of acceptable measurement deviation
Port Storage Validate that quantity available for planned shipments Calculate inventory remaining after current shipment commitments Deduce the grade of material based on mixed assay of storage or stockpile Initiate pull-through of more inventory when commitments exceed current inventory and planned receipts
Shipping Check that the vessel is compatible with the scheduled berth Calculate demurrage charges based on agreed rules Deduce the stored material destination based on the vessel berth. Initiate loading re-schedule in the event that a vessel is delayed
Customer Validate that customer order has valid contract upon which transfer can be based      

 

These rules have similar characteristics that caused us to resolve the original data-silo problem. A simple example is that we would want to calculate the corrected custody transfer quantities both for operational and financial needs. We can also observe that rules span multiple business areas. For example the currency of an operator’s privileges span the training records, access control to the building housing the equipment, maintenance records of the equipment, and more. Finally we do not want the rules to be passive. Instead we want any deviation of the rules to initiate, or at least recommend, the correct remedial action. Thus we want our rules and information to be combined to achieve active knowledge, as shown below.

Knowledge + Action = Results

Even if we have the perfect set of rules, they have no business value unless we act upon the know-how, know-when, and know-who. Thus it is important to close the business loop by taking action on the knowledge to produce the desired results; no action, then no results. This means that the knowledge must have a mechanism for invoking the remediation process.

Realizing a rule-centric solutions

How is it possible to abstract the handling of rules away from the individual applications into which they are encoded? Our recommendation is that there should be a separate rules repository; a container that defines all of the rules. Although there are several candidates for describing rules, one favored choice would be to semantically define the rules using something like SPARQL Rules or SPIN. Since information is conveniently modeled semantically, it then makes sense to harmonize the technologies and use the same for the rules repository. The complete rules-centric architecture is shown below.

 

Data:                Raw measurements collected from the instruments and data entry, stored in real-time databases and historians.

Technology:      Real-time historians

Model:             Context and structure added to the data to create information. It takes the form database schema in the case of a relational model, or ontology in the case of a semantic model plus the configuration that represents the plant: equipment, topology, etc.  

Technology:      Relational schema, object structure or semantic ontology

Information:     The combination of data and model manifested as a database, relational, object, or semantic.

Technology:      Relational, object or semantic data store

Rules:               A repository of the rules. Traditional information system architectures fail to separate this as a separation element. Instead the rules are distributed throughout the application systems. We propose that all rules should be held in a common repository, just like data and information. This repository should be able to handle all rules: validations, calculations, deductions, and invocations. The best choice for organizing such a repository is semantically as this allows both information and rules to share the same technology.

Technology:      Semantic rules data store

Knowledge:     The combination of information and knowledge manifested as an inference engine capable of executing the rules. However it is unrealistic to expect rules to be only executed within the inference engine, so rules within spreadsheets, workflows, applications, and calculation engines should be synchronized with the rules repository.

Technology:      Rules inference engine together with synchronization interfaces

Action:             The actions invoked by the knowledge manifested as a workflow engine capable of invoking external actions via web-service interfaces.

Technology:      Workflow or temporal rules engine.

Visualize:         A portal through which the data, information, knowledge, and actions can be presented, as well as through which the model and rules can be configured.

Technology:      Portal, preferably one whose presentation is semantically deduced from the action, knowledge, information, and data

Control:            Either part of the visualization portal or a separate application through which the users can execute control based on the actions.

Technology:      Conventional control system interface through which users can control the plant.

Is it feasible?

One of the key architectural elements is the management rules as data, along with the closely related model and action elements. Outside of the process manufacturing industry and especially in finance and insurance rules engines have been in use for some time, thus there are several vendors, shown below. Of particular interest is TopQuadrant, the sponsor of SPARQL Rules or SPIN, that provides a standards-based way to define rules and constraints for Semantic Web data, and OntoRule an EU project that brings together leading vendors of knowledge-based systems and a handful of top research institutions to develop the technology that will empower business policies in the enterprise of the future.

  • Corticon  Decision table or rulesheet-centric business rules management system
  • FICO Blaze Advisor General purpose business rules management system with .Net, Java and COBOL deployment
  • IDIOM Decision-centric business rules management system
  • IBM ILOG RulesGeneral purpose business rules management system with .Net, Java and COBOL deployment
  • InRule.Net based business rules management system
  • JBoss Drools/JBoss Enterprise BRMSOpen source business rules management system that is working on updating its Decision Tables)
  • ModellicaEuropean business rules management system focused on the credit risk business available in the US through GDS Link
  • OntoRuleLeading vendors of knowledge-based systems and a handful of top research institutions join their efforts to develop the technology that will empower business policies in the enterprise of the future.
  • OpenRules Decision Management SystemOpen source Excel-based business rules management system.
  • PegasystemsA unified business rules and process management environment now including the Chordiant decision management products.
  • Progress BRMSDrools-based business rules management system acquired with Savvion
  • Sparkling LogicA “social-logic” platform for managing business rules
  • TopQuadrantTopBraid Suite™ leverages emerging technology to help our customers connect silos of data, systems and infrastructure and to build flexible applications from linked data models. SPIN is a standards-based way to define rules and constraints for Semantic Web data.
  • Visual RulesJava-based business rules management system from Bosch Innovations
  • XpertRuleXpertRule develops advanced Business Rules Management and Expert System software that helps organizations:
    • Capture expertise and skills in risk assessment, advising, and performance improvement as well as in selling and supporting both products and services.
    • Comply with regulations, policies, laws and legislations.
    • Automate process orchestration both for intelligent front-end user interface navigation, back-end process flow and data interchange.
  • ZementisA cloud-based execution platform for business rules and analytic models.


Icebergs of information loiter throughout process manufacturing IT waiting to sink any information integration project. The impact of semantic technologies is being felt in medicine, life sciences, intelligence, and elsewhere but can it solve this problem in process manufacturing? The ability to federate information from multiple data-sources into a schema-less structure, and then deliver that federated information in any format and in accordance with any standard schema uniquely positions semantic technology. Is this a sweet spot for semantic technologies?

Process Manufacturing Application Focus over the Years

Over the years we have been solving problems within process manufacturing IT only to uncover more problems. Once the problem was that of measurement data in silos which was solved by the introduction of real-time data historians. However that created the problems of data visibility, solved by the introduction of graphical user interfaces. This introduced data overload which was partially solved by the introduction of analytical tools to digest the information and produce diagnostics. Unfortunately these tools were difficult to deploy across all assets within an organization, so we have been trying to solve that problem with information models. The current problem is how to convert the diagnostics into actionable knowledge with the use of work-flow engines and ensuring the sustainability of applications as solutions increases in complexity.

Process Manufacturing Application Problems and Solutions over the Years
  1985-

1995

1990-

2000

1995-

2005

2000-

2010

2005-

2015

2010-
Problem Measurement data in silos Data access and visualization Analysis and business intelligence Contextualized information Consistent actioning Sustainability
Industry

Response

Real-time databases collecting measurements

(proprietary)

Graphical user interfaces, trending and reporting tools

(proprietary)

Analytical tools to digest data into information and diagnostics Plant data models (ProdML, ISA-95, ISO15926, IEC 61970/61968, Proprietary) ISO-9001 Outsourcing

Standards

Consequence Data but no user access Data overload Deployability of analysis to all assets Interpretation limited to experts Complexity, much more than RTDB, limiting sustainability Improved ongoing application benefits

 

However it is not only the increased technological complexity that is causing problems. Business decisions now cross many more business boundaries. When measurement data was trapped in silos we were content with unit-wide or plant-wide data historians. Now a well performance problem might involve a maintenance engineer located in Houston accessing a Mimosa[1]-based maintenance management system, an operations engineer located in Aberdeen accessing an OPC-UA[2]-based data historian, a production engineer located in London accessing a custom system driven by WITSML[3]-based feeds, and a facilities engineer using an ISO-15926[4] facilities management model. Not only are the participants in different locations and business units, but they also rely on different systems using different models to support their decision making. However they all should be talking about the same well, measured by the same instruments, producing the same flows, and processed by the same equipment.

The problem is that these operational support systems are not simply data silos whose homogeneous data we need to merge into one to answer our questions. In fact these operational support systems are icebergs of information. Above the surface they publish a public perspective focused on the core operational function of the application. However this data needs context, so below the surface is much of the same information that is contained in other systems. This information provides the context to the operational data so that the operational system can perform its required functions. For example the historian needs to know something about the instruments that are the source of its measurements; maintenance management systems need to know not only about the equipment to be maintained but the location of that equipment, physically and organizationally.

Figure 1: Icebergs of Information

Icebergs of information are not limited to the operational data stores deployed in organizations. An essential practice in these days of interoperability requirements is the adoption of model standards. However even these exhibit the same problems as shown by the diagram below. This diagram maps the available standards to its focus within the hydrocarbon supply chain.

Figure 2: Multiple Overlapping Model Standards

Increasing regulatory and competitive demands on the business are forcing decision making to be more timely, and to be more integrated across the traditional business boundaries. However these icebergs are getting in the way of effective decision making.

One way to make any or all of this information available to consumers is to create the bigger iceberg. ‘Simply’ create the relational database schema that covers every past, current, and future business need, and build adapters to populate this database from the operational data stores. Unfortunately this mega-store can only get more complex as it has to keep up with an expanding scope of information required to support the decision making processes.

Figure 3: Integration using the Bigger Iceberg

Alternatively we can keep building data-marts every time someone has a different business query.  However these do not provide the timeliness required to support operational decision making.

The Need for a Babel-Fish

We cannot meet the needs of the business, and solve their decision making needs by having one mega-store because it will never keep up with the changing business requirements. Instead we need a babel-fish (with thanks to the Hitchhikers Guide to the Galaxy).

This babel-fish can consume all of the different operational data in different standards, and translate them into any standard that the end-consumer wants. Thus the babel-fish will need to know that OPC UA’s concept ‘hasInstrument’ has the same meaning as Mimosa’s concept of ‘Instrumented’. Similarly 10FIC107 from an OPCUA provider is the same as 10-FIC-1-7 from Mimosa.

  1. Information providers (operational data stores) within the business will want to provide information according to their capabilities, but preferably using the standards appropriate for their application. For example measurements should be OPC UA, maintenance should use Mimosa
  2. Information consumers will want to consume information in the form of one or more standards appropriate for their application.

Figure 4: Integration babel-fish

The Semantic/RDF model comes to the rescue

First of all a definition: a semantic model means organizing all data and knowledge as RDF triples {subject, property, object}. Thus {:Peter, :hasAge, 21^^:years}, and {:Pump101, :manufacturedBy, :Rotek} are examples of RDF triples. RDF triples can be persisted in a variety of ways: SQL table, custom organizations, NoSQL, XML files and many more. If we were designing relational database to hold these RDF triples we would only have one ‘table’ so it may appear that we have no schema, in the relational database design-sense when we have key relationships to enforce integrity, and unique indices to enforce uniqueness. However we can add other statements about the data such as {:Pump101, :type, :ReciprocatingPump} and {:ReciprocatingPump, :subClassOf, :Pump}[5]. Used in combination with a reasoner we can infer consequences from these asserted facts, such as :Pump101 is a type of :Pump, and Peter is not a :Pump, despite rumors to the contrary.   These triples can be visualized as the links in a graph with the subject and object being the nodes of the graph, and the property the name of the edge linking these nodes:

Figure 5: RDF Triples as a graph

Over the years, new modeling metaphors have been introduced to solve perceived or actual problems with their predecessors. For example the Relational Model had perceived difficulties associated with reporting, model complexity, flexibility, and data distribution. A semantic model helps solve these problems.

Figure 6: Evolution of Model Metaphors

  • In response to the perceived reporting issues, OLAP techniques were introduced along with the data warehouse. This greatly eased the problem of user-reporting, and data mining. However it did introduce the problem of data duplication.
    • A semantic model can query against a federated model in which information is distributed throughout the original data sources.
  • In response to the perceived complexity issues, various forms of object-orientated modeling were introduced. There is no doubt that it is easier to think of one’s problem in terms of an object model rather than a complex relational or ER model, especially when there are a large number of entities and relations.
    • The semantic model is built around the very simple concept of statements of facts such as {:Peter, :hasAge, 21^^:years}, and {:Pump101, :manufacturedBy, :Rotek} combined with statements that describe the model such as {:Pump101, :type, :ReciprocatingPump} and {:ReciprocatingPump, :subClassOf, :Pump}.
  • The model flexibility problem occurs when, after the model has been designed, the business needs the model to change. In response to this flexibility issue, the choice is to make the original model anticipate all potential uses but then risk complexity, or use an object-relational approach in which it is possible to add new attributes without changing the underlying storage schema.
    • In semantic models these relationships are expressed in triples, using RDFS, SKOS, OWL, etc. Thus RDF is also used as the physical model (in RDF stores, at least).
  • There have been various responses to data distribution.
    • In the relational world there is not much choice other than to replicate the data from heterogeneous data stores using Extract-Transform-Load (ETL) techniques. In the case of homogenous but distributed databases distributed queries are possible, although it does require intimate knowledge of all the schemas in all of the distributed databases.
    • In the object-orientated world we are in a worse situation: it is very difficult to manage a distributed object in which different objects are distributed or attributes are distributed.

The good news is that a semantic approach is the ideal (or even the only) approach that can solve the information integration problem as follows:

  1. Convert to RDF normal form: Convert all source data into RDF. The data can be left at source and fetched on demand (federated) or moved into temporary RDF storage
    • There are already standard ways of doing this for any spreadsheet, relational database, XML schema, and more. For example, TopBraid Suite (http://www.topquadrant.com/products/TB_Suite.html) provides converters and adaptors for all common data sources. It is relatively easy to create more mappings such as OPCUA. The dynamic adapters act as SPARQLEndpoints[6].
  1. Federated data model: Create ‘rules’ that map one vocabulary to another.
    • The language of these rules would be RDFS, SKOS and OWL. For example you can declare {OPCUA:hasInstrument, owl:sameAs, Mimosa:Instrumented}. Note that these are simply additional statements expressed in RDF which are then used by a reasoner to infer the consequences such as :FI101 is actually the same as :10FIC101.
    • More sophisticated rules can also be created using directly RDF and SPARQL. For some examples, see SPIN or SPARQL Rules at http://spinrdf.org/ and http://www.w3.org/Submission/2011/SUBM-spin-overview-20110222/
  1. Chameleon data services: Create consumer queries that extract the information from the combined model into the standard required using SPARQL queries.
    • For example even though all instrument data is in OPCUA, a consumer could use a Mimosa interface to fetch this data. The results can then be published as web-services for consumption by external applications using SPARQLMotion (http://www.topquadrant.com/products/SPARQLMotion.html)

Figure 7: Federation End-to-End

Let’s look into these steps in detail:

Convert to RDF normal form

Despite the fact that data will be stored in different formats (relational, XML, object, Excel, etc) according to different schemas they can always be converted into RDF triples. Always is a strong word, but it really does work. There are already ways of doing this for any spreadsheet, relational database, XML schema, and more and it is relatively easy to create more mappings such as OPC-UA. The data can be left at source and fetched on demand (federated) or moved into temporary RDF storage. For example, TopBraid Suite (http://www.topquadrant.com/products/TB_Suite.html) provides converters and adaptors for all common data sources.

Figure 8: Conversion to RDF Normal Form

Federated Data Model

A federated data model allows different graphs (aka databases) to be aggregated by linking the shared objects. This applies to real-time measurements (OPC-UA), maintenance (MIMOSA), production data (ProdML), or any external database. We can visualize this as combining the graphs of the individual operational data stores into a single graph.

Of course there will be vocabulary differences between the different data-sources. For example, in the OPC-UA data-source you might have a property OPCUA:hasInstrument, and in a MIMOSA data-source the equivalent is called Mimosa:Instrumented. So the federated data model incorporates ‘rules’ that map one vocabulary to another. The language of these rules would be RDFS, SKOS, and OWL. For example, in OWL, you can declare {OPCUA:hasInstrument owl:sameAs Mimosa:Instrumented}. Note that these are simply additional statements expressed as RDF triples which are then used by a reasoner to infer consequences such as :FI101 is actually the same as :10FIC101.

There will also be identity differences between the different data-sources. These can also be handled by additional statements, such as {:TANK#102, owl:sameAs, :TK102 }. This allows a reasoner to infer that the statement {:TK102, :has_price, 83^^:$} also applies to :TANK#102, implying {:TANK#102, :has_price, 83^^:$}.

Figure 9: Information Federated from MulTiple Datasources

Chameleon Data Services

To extract information from the federated information, the best choice is SPARQL, the semantic equivalent of SQL only simpler. Whilst SQL allows one to query the contents of multiple tables within a database, SPARQL matches patterns within the graph. With SQL we need to know in which table each field belongs. With SPARQL we define the graph pattern that we want to match, and the query engine will search throughout the federated graphs to find the matches. In the example illustrated below we do not need to know that the price attribute comes from one data source, whilst the volume comes from another. In fact SPARQL allows even further flexibility. The price attribute for Tank#101 could come from a different data source than the price attribute for Tank#102. This is part of the magic of the semantic technology.

  

Figure 10: Graph Pattern matching with SPARQL

SPARQL can be used to directly query the federated graph for reporting purposes, however most consumers of the information will expect to interface to a web-service, with SOAP or REST being the most popular. These services do not have to be programmed. Instead they can be declared using SPARQLMotion (http:www.sparqlmotion.org) to produce easily consumed and adaptable web-services. The designer for SPARQLMotion is shown below:

Figure 11: Example SPARQLMotion

Semantic/RDF advantages for the Process Manufacturing

Despite solving a complex data integration problem, Semantic/RDF is inherently simpler. Can there be anything simpler than storing all knowledge as RDF triples?  Despite this simplicity, we do not lose any expressivity.

There is no predefined schema to limit flexibility. However the schema rules, encoded as tables and keys in the relational model, can still be expressed using RDFS, OWL, and SKOS statements.

Deconstructing all information into statements (triples) allows data from distributed sources to be easily merged into a single graph.

Any information model can be reconstructed from the merged graph using SPARQL and presented as web-services (SOAP or REST).

References

[1] MIMOSA is a not-for-profit trade association dedicated to developing and encouraging the adoption of open information standards for Operations and Maintenance in manufacturing, fleet, and facility environments. MIMOSA’s open standards enable collaborative asset lifecycle management in both commercial and military applications.

[2] The Unified Architecture (UA) is THE next generation OPC standard that provides a cohesive, secure and reliable cross platform framework for access to real time and historical data and events. 

[3] WITSML™ (Wellsite Information Transfer Standard Markup Language) is an industry initiative to provide open, non-proprietary, standard interfaces for technology and software that monitor and manage wells, completions and workovers.

[4] ISO 15926 provides integration of life-cycle data for process plants including oil and gas production facilities

[5] I should really be using URIs instead of text labels for subject, property, and objects, but the intent of the semantic model is conveyed more simply if we avoid identifiers like ‘http://www.example.org/equipment#Pump101’ and use :Pump#101

[6] SPARQL is a query language for RDF. A SPARQL endpoint is a protocol service that makes it possible to query a data source using SPARQL. The source itself does not need to be in RDF. It can, for example, be a traditional relational database. Later in this article we will describe SPARQL in more detail and show some query examples