The creation of a DigitalTwin knowledge graph data model confronts the need for access to measurement data in order that the DigitalTwin can create timely performance metrics, identify promptly performance issues, and so on.

However, the quantity of raw data in an Industrial IoT is staggering. A typical process manufacturing plant might have greater than 100,000 measurement points each of which is streaming data by the second or even faster. So how can the raw data be integrated to allow performance analysis?

Just … important😓: 

  • Someone(?) determines a subset of the data that can and should be replicated. 
  • A problem is that it is inevitable someone will ask a yet unformulated question about data that has *not* been replicated. Can you afford to keep revising your ETL?

Just … in case😧: 

  • The bullet is bitten and all data is replicated. 
  • An issue is the DigitalTwin now has to resolve the problem of handling vast quantities of data that has already been solved by IIoT applications. Why not stick to solving unsolved problems?

Just … reporting😫: 

  • Someone writes a dedicated application that pulls data from the DigitalTwin and the IIoT just when reporting. 
  • A problem is that every new performance metric or calculation requires another dedicated application, even if that same metric has been already created in another application.

Just … in time😂: 

  • The KnowledgeGraph pulls the IIoT data whenever required by an IntelligentGraph calculation. 
  • There are no limits on what IIoT data can be requested, but no storage or replication issues either. Also, the IntelligentGraph calculations can access the results of other IntelligentGraph calculations, simplifying the deployment of metrics.

See this short video demonstration of how easy an #IoT-connected #DigitalTwin of a process plant #IntelligentGraph can be created:

The IntelligentGraph-way or the Groundhog-way to efficient data analytics.pdf

Data is rather like poor red wine: it neither travels nor ages well. IntelligentGraph avoids data traveling by moving analysis into the knowledge graph rather than moving data to the analysis engine, obsoleting the groundhog-analysis-way

Solving data analysis, the IntelligentGraph-way

Data is streamed to the IntelligentGraph datastore, and then analysis/calculation nodes are added to that graph which is accessible to all and by all applications.

The IntelligentGraph-way of data-analysis is to:

  • Extract-Load (losslessly) the source data into an Intelligent Knowledge Graph
  • Add analysis/calculation nodes to the KnowledgeGraph which calculate additional analysis values, aggregations, etc.
  • Report results, using any standard reporting tool
  • Add more analysis/calculation nodes as additional requests come through. These new calculations can refer to the existing results
  • … relax, you’ve become the star data analyst:-)

Solving data analysis, the Groundhog-way

Data is in the operational data-sources, data is staged in a data-warehouse/mart/lake, then the analysis is done by the analysis engine (aka Excel), right? And like poor red wine, constantly moving data damages it.

The Groundhog-way of data analysis is to:

  • Extract-Transform(aka probably damage the data as well)-Load the source data into a data-warehouse/mart/lake just to make it ‘easier’ to access.
  • Realizing the required analytical results are not in the data-warehouse/mart/lake, extract some data into Excel/PowerBI/BI-tool-of-choice where you can write your analysis calculations, aggregations, etc.
  • Report analysis results, but forget to or cannot put the analysis results back into the data-warehouse/mart/lake.
  • Repeat the same process every time there is a similar (or identical) analysis required.
  • … don’t relax, another analysis request shortly follows 🙁

IntelligentGraph Benefits

IntelligentGraph moves analysis into the knowledge graph rather than moving data to the analysis engine, avoiding the groundhog-analysis-way:

  • Improves analyst performance and efficiency
    • Eliminates the need for analysts to create ELT to move data to the analysis engine. 
  • Simplifies complex calculations and aggregations
    • PathQL language greatly simplifies navigating and aggregating throughout the graph.
  • Ensures calculation and KPI concurrency
    • Calculations are performed in-situ with the data, so no need to re-export data to the analysis engine to view updated results.
  • Uses familiar scripting language
    • Scripts expressed in any of multiple scripting languages including Python, Javascript, Groovy, and Java.
  • Improves analysis performance and efficiency
    • Time-to-answer reduced or eliminated as analysis is equivalent to reporting
  • Ensures analysis effort is shared with all
    • Analysis results become part of the graph which can be used by others ensuring consistency.
  • Self-documenting analysis path to raw data
    • The IntelligentGraph contains calculation scripts that define which calculations will be performed on what data (or other calculation results).
  • Improves analysis accuracy by providing provenance of all calculations
    • Trace of any analysis through to raw data is automatically available.
  • Simplifies reporting
    • Reporting tools can be used that focus on report appearance rather than calculation capability since the latter is performed in the IntelligentGraph. 
  • Highly scalable
    • IntelligentGraph is built upon the de-facto graph standard RDF4J, allowing for the use of any RDF4J compliant datastore.
  • Standard support
    • Access to IntelligentGraph for querying, reporting, and more is unchanged from any RDF-based KnowledgeGraph.
  • Evolutionary, not revolutionary modeling and analysis
    • Graph-based models offer the ability to evolve as data analysis needs grow, such as adding new dimensions to the data, unlike a ‘traditional’ data mart or warehouse which usually require a rebuild.
  • Creates the Intelligent Internet of Things
    • Scripts can access external data, such as IoT, on-demand allowing IoT-based calculations and analysis to be performed in-situ.
  • Eliminates spreadsheet-hell
    • All spreadsheet calculations and aggregations can be moved into the graph, leaving the spreadsheet as a presentation tool. This eliminates the problem of undocumented calculations and inconsistent calculations in different spreadsheets.

Since IntelligentGraph combines Knowledge Graphs with embedded data analytics, Jupyter is an obvious choice as a data analysts’ IntelligentGraph workbench.

The following are screen-captures of a Jupyter-Notebook session showing how Jupyter can be used as an IDE for IntelligentGraph to perform all of the following:

  • Create a new IntelligentGraph repository
  • Add nodes to that repository
  • Add calculation nodes to the same repository
  • Navigate through the calculated results
  • Query the results using SPARQL

GettingStarted is available as a JupyterNotebook here:

GettingStartedIntelligentGraph.ipynb

This document is available for download here:

IntelligentGraph-Getting Started.pdf

 

 

SPARQLing

Using the Jupyter ISparql, we can easily perform SPARQL queries over the same IntelligentGraph created above. 

GettingStarted Using SPARQL

We do not have to use Java to script our interaction with the repository. We can always use SPARQL directly as described by the following Jupyter Notebook.

GettingStartedSPARQL.ipynb

 

IntelligentGraph embeds calculation and analysis capability within RDF knowledge graphs, rather than forcing analysis to be undertaken by exporting query results to external analysis applications such as Excel.

IntelligentGraph achieves this by embedding scripts into the RDF knowledge graph which are evaluated when queried with SPARQL.

Scripts, written in a variety of languages such as JavaScript, Java, and Python, access the underlying graph using simple pathPatternQL navigation.

IntelligentGraph=KnowledgeGraph+Embedded Analysis.pdf

Why IntelligentGraph?

At present calculations over stored data are either delivered by custom code or exporting the stored data to spreadsheets. The data behind these tools is inevitably tabular. In fact, so dominant are spreadsheets for analysis that the spreadsheet itself becomes the ‘database’ with the inherent difficulties of syncing that data with the source system of record.

The real-world is better represented as a network or graph of interconnected things Therefore a knowledge graph is a far better storage organization than tables or objects. However, there is still the need to perform ad hoc numerical analysis over this data. 

RDF DataCube can help organize data for analysis, but still the analysis has to be performed externally. Confronted with this dilemma, knowledge graph data would typically be exported in tabular form to a datamart or directly into, yet again, a spreadsheet where the analysis could be performed.

IntelligentGraph turns this approach on its head by embedding the calculations as scripts within the knowledge graph. These scripts are evaluated on query, and utilise the data in situ: no concurrency issues.. This allows the calculations to have knowledge of its neighbouring nodes and edges, just like Excel cells can access other cells in the spreadsheet. Access to other nodes within the graph uses pathPatternQL navigation.

Example Data and Analysis

An Industrial Internet of Things (IIoT) application is connecting all the measurements about a process plant, such as an oil refinery, into a knowledge graph that relates the measurements to the material flows through the process equipment.

Although there is an abundance of measurements and laboratory analyses available, the values required for operating and performance monitoring are not (and mostly cannot) be directly measured. 

For example:

  • Stream Mass-Flow: direct mass flow measurements are rare. Instead, a volume flow measurement is used in conjunction with a measured material density to calculate the mass-flow
  • Unit Mass Flow Throughput: this is calculated by summing either all feed stream mass flows or product stream mass flows.
  • Unit Mass Balance: this is calculated by differencing the feed from product mass flows
  • Product Stream Yield: this is the ratio of a stream’s mass-flow to the unit to which the stream is connected throughput.

Figure 1: Typical Process Flow Sheet

These are simple examples; however, they show the reliance on the knowledge graph structure to perform the analysis.

Solving data analysis, the traditional way

Data is in the database, analysis is done by the analysis engine (aka Excel), right?

Figure 2: Data analysis the traditional way

In this scenario, the local power user sets up a query to export data from the database and converts to a format that can be imported into Excel. Ever increasingly complex formulae are then written to wrangle the data into the results that are required.

Why is the spreadsheet approach risky?

  • The analysis is now separated from the data. Data changes will not be reflected in the analysis. Worse still, changes to the analysis might not be propagated to all the spawned copies of the spreadsheet.
  • The data is separated from the analysis. The analysis results are rarely re-imported into the data store where data vs analysis could be performed. Instead, even more data is extracted into the spreadsheet.
  • The difficultly of managing the separation of data from analysis becomes so great that in many cases the database is dispensed with entirely and the spreadsheet becomes the de-facto database.

Solving data analysis with an IntelligentGraph

The beauty of Excel is that a cell can contain either a value or a formula that can reference other cell’s values. Why not do the same with a graph: a node can have edges that terminate with a literal value, or a formula that can reference other node’s values.

This is illustrated in the diagram below:

  • The :massFlow property is not measured directly, so a formula is used for its value instead. This formula references $this, the node to which the calculation is attached, and uses the method getFact() to retrieve related values. The argument of getFact() is a pathPatternQL expression.
  • The :totalProduction property is not measured directly, so a formula is used instead which iterates over all of the ‘stream out’ nodes, retrieving the value of the :massFlow for each stream. The :massFlow value is, of course, in turn a calculation.

Figure 3: Intelligent Graph Data Analysis

Why is the IntelligentGraph approach so advantageous?

  • There is no separation between data and analysis, removing the risk of stale and inaccurate data and calculations.
  • The calculations embedded within the graph can take advantage of the knowledge that is contained within that graph. This makes the calculations far simpler than those that need to be embedded in spreadsheets.
  • The calculations will automatically utilize on the fly the changing knowledge.

How does IntelligentGraph Work?

Analysis is embedded in an IntelligentGraph simply by adding script literals as object values of subjects with datatype of the scripting language (groovy, javascript, python etc).

The IntelligentGraph engine is provided as an RDF4J Stackable SAIL. This means that its capabilities can be combined with any other RDF4J capabilities. The choice of RDF storage remains the same as for any other RDF4J compliant framework.

Modeling with Scripts

Typically, a graph node will have associated attributes with values, such as a stream with volumeFlow and density values:

Stream Attributes:

:Stream_1
   :density ".36"^^xsd:float ;
   :volumeFlow "40"^^xsd:float .

Of course, in the ‘real-world’ these measured values are sourced from outside the KnowledgeGraph and change over time. IntelligentGraph can deal with both of these requirements.

The ‘model’ of the streams can be captured as edges associated with the Unit:

:Unit_1
   :hasProductStream :Stream_1 ;
   :hasProductStream :Stream_2 ;
.

Calculate Mass Flow

The calculations are declared as literals[1] with a datatype whose local name corresponds to one of the installed script languages:

:Stream_1 :massFlow     
    "_this.getFact(':density')*    
     _this.getFact(':volumeFlow');"^^:groovy .

Calculate Total Production

A typical performance metric is to understand the total production from a unit, which is not of course directly measured. However, it can be easily expressed using existing calculated values:

:Unit_1   :totalProduction
    "var totalProduction =0.0;
    for(Resource stream : _this.getFacts(':hasProductStream'))
    {
        totalProduction += stream.getFact(':massFlow');
    }
    return totalProduction; "^^:groovy .

Instead of returning the object literal value (aka the script), the IntelligentGraph will return the result value for the script.

We can write this script even more succinctly using the expressive power PathQL:

:Unit_1  :totalProduction  
    "return _this.getFacts(':hasProductStream/:massFlow').total(); "^^:groovy

However, IntelligentGraph allows us to build upon existing calculations to simply express what would normally be difficult-to-calculate metrics, such as product yield or mass balance.

Calculate Mass Yield

Any production unit has different valued products. So a key metric is the yield of individual streams. This can easily be calculated as follows, using values that are themselves calculations.

var result= _this.getFact(":massFlow").floatValue()/ 
_this.getFact("^:hasStream/:totalProduction").floatValue();  
result;

Calculate Mass Balance

Measurements are not perfect, nor is the operation of a unit. One of the first indicators of a problem is when the mass flow in does not match the mass flow out. This can be expressed as another calculated property of a Unit:

return  _this.getFacts(":hasFeedStream/massFlow").total() -_this.getFacts(":totalProduction").total();

Querying Results

Access to the calculated values is via standard-SPARQL. However instead of returning the script literal, IntelligentGraph will invoke the script engine, 

Thus to access the :massFlow calculated value, the SPARQL is simply:

select ?massFlow
{
 :Stream_1 :massFlow ?massFlow 
}

If the script literal is required then the object variable can be postfixed with _SCRIPT:

select ?massFlow ?massFlow_SCRIPT 
{
 :Stream_1 :massFlow ?massFlow, ?massFlow_SCRIPT 
}

If a full trace of the calculation, including tracing calls to other scripts, is required then the object variable can be postfixed with _TRACE:

select ?massFlow ?massFlow_TRACE 
{
 :Stream_1 :massFlow ?massFlow, ?massFlow_TRACE 
}

How to Write IntelligentGraph Scripts?

Script Languages

Any Java 9 supported language can be used simply by making the corresponding language JAR available. 

By default, JavaScript, Groovy, Python JAR are installed. The complete list of compliant languages is as follows

AWK, BeanShell, ejs, FreeMarker, Groovy, Jaskell, Java, JavaScript, JavaScript (Web Browser), Jelly, JEP, Jexl, jst, JudoScript, JUEL, OGNL, Pnuts, Python, Ruby, Scheme, Sleep, Tcl, Velocity, XPath, XSLT, JavaFX Script, ABCL, AppleScript, Bex script, OCaml Scripting Project, PHP, Python, Smalltalk, CajuScript, MathEclipse

Script Context Variables

In addition, each script has access to the following predefined variables that allow the script to access the context within which it is being run.

  • _this, a Thing corresponding to the subject of the triples for which the script is the object.  Since this available, helper functions are provided to navigate edges to or from this ‘thing’ below:
  • _property, a Thing corresponding to the predicate or property of the triples for which the script is the object.
  • _customQueryOptions, a HashMap<String, Value> of name/value pairs corresponding to the pairs of additional arguments to the SPARQL extension function. These are useful for passing application-specific parameters.
  • _builder, a RDF4J graph builder object allowing a graph to be constructed (and manipulated) within the script. A graph cannot be returned from a SPARQL function. However the IRI of the graph can be returned, and any graph created by a script will be persisted.
  • _tripleSource, the RDF4J TripleSource to which the subject, predicate, triple belongs.

Fact and Path Functions

The spreadsheets’ secret sauce is the ability of a cell formula to access values of other cells, either individually or as a set. The IntelligentGraph provides this functionality with several methods associated with Thing, which are applicable to the _this Thing initiated for each script with the subject Thing.

Thing.getFact(String pathPattern) returns Value

Returns the value of node referenced by the pathPattern, for example “:volumeFlow” returns the object value of the :volumeFlow edge relative to _this node. The pathPattern allows for more complex path navigation.

Thing.getFacts(String pathPattern) returns Values

Returns the values of nodes referenced by the pathPattern, for example “:hasProductStream” returns an iterator for all object values of the :hasProductStream edge relative to _this node. The pathPattern allows for more complex path navigation.

Thing.getPath(String pathQL) returns Path

Returns the first (shortest)  path referenced by the pathQL, for example “:parent{1..5}” returns the path to the first ancestor of _this node. The pathQL allows for more complex path navigation.

Thing.getPaths(String pathQL) returns PathResults

Returns all paths referenced by the pathQL, for example :parent{1..5}” returns an iterator, starting with the shortest path,  for all paths to the ancestors of _this node. The pathQL allows for more complex path navigation.

Path Patterns

Spreadsheets are not limited to accessing just adjacent cells; neither is the IntelligentGraph. PathPatterns provide a powerful way of navigating from one Thing node to another. PathPatterns are inspired by SPARQL and propertyPaths, but a richer, more expressive, PathQL was required for the IntelligentGraph.

Examples

Examples of PathQL patterns are as follows:

_this.getFact(“:hasParent”)

will return the first parent of $this.

_this.getFact(“^:hasParent”)

will return the first child of $this.

_this.getFacts(“:hasParent/:hasParent”)

will return the grandparents of $this.

_this.getFacts(“:hasParent/^:hasParent”)

will return the siblings of $this.

_this.getFacts(“:hasParent[:gender :female]/:hasParent”)

will return the maternal grandparents of $this

_this.getFacts(“:hasParent[:gender :female]/:hasParent[:gender :male]”)

will return the maternal grandfather of $this.

_this.getFacts(“:hasParent[:gender [ rdfs:label “female”]]”)

will return the mother of $this but using the label instead of the IRI.

_this.getFacts(“:hasParent[eq :Peter]/:hasParent[:gender :male]”)

will return the grandfather of $this, who is the parent of :Peter.

_this.getFacts(“:hasParent[ne :Peter]/:hasParent[:gender :male]”)

will return grandfathers of $this, who are not the parent of :Peter.

The following diagram visualizes a path through a genealogical graph, from _this to the find the parents of a maternal grandfather born in Maidstone:

_this.getFacts(“/:parent[:gender :female]/:parent[:gender :male, :birthplace [rdfs:label ‘Maidstone’]]/:parent”)

 

Figure 4: PathPatternQL Example

How Is Performance?

IntelligentGraph takes the following actions to improve performance:

  1. All intermediate calculation results are cached, keyed by the subjectNode, predicate, and customQueryOptions.
  2. Cache can be cleared using the SPARQL function ClearCache.
  3. The SPARQL function ObjectValue takes as its argument the subject, predicate and objectValue. If the objectValue supplied is not of script datatype, the function will immediately return the objectValue.
  4. Circular functions, in which A calls B calls A, are detected and rejected.

Can I Debug?

Since IntelligentGraph combines calculations with the knowledge graph, it is inevitable that any evaluation will involve calls to values of other nodes which are in turn calculations. For this reason,  IntelligenrtGraph supports tracing and debugging:

 

Figure 4: Tracing Calculation

How Do I Add Intelligence to my RDFGraph?

Download

The project is located in Github, from where the intelligentgraph.jar can be downloaded from there:

The intelligentgraph.jar does not include all of the scripting etc language dependencies, so to use it you would have to be certain all dependencies are already available.

Install

IntelligentGraphwill work only with RDF4J version 3.3.0 and above. 

Copy intelligentgraph.jar

To  /usr/local/tomcat/webapps/rdf4j-server/WEB-INF/lib/intelligentgraph.jar

The RDF4J server will need to be restarted for it to recognize this new JAR and initiate the scripting engine.

[1] In this case the script uses Groovy, but any Java 9 compliant scripting language can be used, such as JavaScript, Python, Ruby, and many more.

Providing answers to users’ analysis, searching, visualizing or other questions of their own data

Creating an overall solution that presents data in a useful way can be challenging, but OData2SPARQL and Lens2OData solves this.

  • RDF-Graph: Data + Model = Information, allowing us to combine you raw data with an adaptable model to create meaningful information
  • OData2SPARQL: Information + Rules = Knowledge, provides you the ability to access that information combine with additional rules (SPIN and SHACL) to deliver useful knowledge that can be consumed by applications and BI tools.
  • Lens2OData: Knowledge + Action = Results, allows users to easily navigate, search, explore, and visualize this knowledge in such a way that it is easy to take action and produce results.

To see this all in action we have prepared a demonstrator that can be downloaded, and the following videos which illustrate the capabilities of this demonstrator.

  • Explore provenance of data sources which is retained by RDF- graph and OData2SPARQL rather than losing that provenance with typical ETL processing.

  • Explore the Transport For London train lines, stations and zones, illustrating how easy it is to transform any dataset to RDF_Graph and immediately get the benefits of OData access, and Lens UI/UX

To download and run this demonstrator go toDocker hub here

OData2SPARQL V4 endpoint now publishes any SHACL nodeShapes defined in the model mapping them to OData complexTypes, along with existing capability of publishing RDFS and OWL classes, and SPIN queries.

RDF graphs provide the best (IMHO) way to store and retrieve information. However creating user applications is complicated by the lack of a standard RESTful access to the graph, certainly not one that is supported by mainstream UI frameworks. OData2SPARQL solves that by providing a model-driven RESTful API that can be deployed against any RDF graph.

The addition of support for SHACL allows the RESTful services published by OData2SPARQL to not only reflect the underlying model structure, but also match the business requirements as defined by different information shapes.

Since OData is used by many applications and UI/UX frameworks as the 21st century replacement to ODBC/JDBC, the addition of SHACL support means that user interfaces can be automatically generated that match with the SHACL shapes metadata published by OData2SPARQL.

SHACL Northwind Model Example

The ubiquitous Northwind model contains sample data of employees, customers, products, orders, order-details, and other related information. This was originally a sample SQL database, but is also available in many other formats including RDF.

OData2SPARQL using RDFS+ model

Using OData2SPARQL to map the Northwind RDF-graph will create entity-sets of Employees, Customers, Orders, OrderDetails, and so on.

OData4SPARQL maps RDFS+

  • Any rdfs:Class/owl:Class to an OData EntityType and EntitySet
  • Any OWL DatatypeProperty to a property of an OData EntityType
  • Any OWL ObjectProperty to an OData navigationProperty

Within RDF it is possible to create an Order-type of thing, without that order having a customer, or any order-details. Note that this is not a disadvantage of RDF; in fact it is one of the many advantages of RDF as it allows an order to be registered before all of its other details are available.

However, when querying an RDF-graph we are likely to ask what orders exist for a customer made by an employee, and with at least one order line-item (order-detail) that includes the product, quantity and discount. We could say that this is a qualified-order.

If we were to request the description of a particular order using OData2SPARQL:

OData2SPARQL Request:
 …/odata2sparql/northwind/Order('NWD~Order-10248')?format=json
OData2SPARQL Response:
{
       @odata.context: "http://localhost:8080/odata2sparql/northwind/$metadata#Order/$entity",
       customerId: "NWD~Customer-VINET",
       employeeId: "NWD~Employee-5",
       freight: "32.380001",
       label: "Order-10248",
       lat: 49.2559582,
       long: 4.1547448,
       orderDate: "2016-07-03T23:00:00Z",
       …
}

Now the above response includes every OData property (aka RDF datatypeProperty) we know about Order(‘NWD~Order-10248’). Not all are shown above for brevity.

However we might want to include related information, that which is connected via an OData navigation property (aka RDF objectproperty). To include this related information we simple augment the request with select=* and expand=* as follows:

OData2SPARQL Request: 
…/odata2sparql/northwind/Order('NWD~Order-10248')?$select=*&$expand=*&$format=json
OData2SPARQL Response: 
{
       @odata.context: "http://localhost:8080/odata2sparql/northwind/$metadata#Order(*)/$entity",
       freight: "32.380001",
       lat: 49.2559582,
       long: 4.1547448,
       orderDate: "2016-07-03T23:00:00Z",
       subjectId: "NWD~OrderDetail-10248”,
       …
       employee: {
              birthDate: "1975-03-04",
              employeeAddress: "14 Garrett Hill",
              employeeCity: "London",
              employeeCountry: "UK",
              subjectId: "NWD~Employee-5",
              …
       },
       orderRegion: null,
       shipVia: {
              shipperCompanyName: "Federal Shipping",
              shipperPhone: "(503) 555-9931",
              subjectId: "NWD~Shipper-3"
       },
       customer: {
              customerAddress: "59 rue de l'Abbaye",
              subjectId: "NWD~Customer-VINET",
              …
       },
       hasOrderDetail: [{
              discount: 0,
              orderDetailUnitPrice: 14,
              orderId: "NWD~Order-10248",
              productId: "NWD~Product-11",
              quantity: 12,
              subjectId: "NWD~OrderDetail-10248-11",
              …
       },
       {
              discount: 0,
              orderDetailUnitPrice: 9.8,
              orderId: "NWD~Order-10248",
              productId: "NWD~Product-42",
              quantity: 10,
              subjectId: "NWD~OrderDetail-10248-42",
              …
       },
       {
              discount: 0,
              orderDetailUnitPrice: 34.8,
              orderId: "NWD~Order-10248",
              productId: "NWD~Product-72",
              quantity: 5,
              subjectId: "NWD~OrderDetail-10248-72",
              …
       }],
       …
}

OData2SPARQL with SHACL Shapes

Well we got what we asked for: everything (a lot of the information returned has been hidden in the above for clarity). However this might be a little overwhelming as the actual question we wanted answered was:

Give me any qualified orders that have salesperson and customer with a least one line-item that has product, quantity, and discount defined.

The beauty of RDF is that it is based on an open-world assumption: anything can be said. Unfortunately the information-transaction world is steeped in referential integrity, so they see the looseness of RDF as anarchical. SHACL bridges that gap by defining patterns to which the RDF-graphs should adhere if they want to be classified as that particular shape. However SHACL still allows the open-world to co-exist.

Closed or Open World? We can have the best of both worlds:

  • The open-world assumption behind RDF-graphs in which anyone can say anything about anything, combined with
  • The close world assumption behind referential integrity that limits information to predefined patterns

Let’s shape what is a qualified order using SHACL. Using this shapes constraint language we can say that:

  • A QualifiedOrder comprises of
    • An Order, with
      • One and only one Customer
      • One and only one Employee (salesperson)
      • At least one OrderDetail, each with
        • Optionally one discount
        • One and only one product
        • One and only one quantity

OK, this is not overly complex but the intention is not to befuddle with complexity, but illustrate with a simple yet meaningful example.

The above ‘word-model’ can be expressed using the SHACL vocabulary as the following, which can be included with any existing schema/model of the information:

shapes:QualifiedOrder
  rdf:type sh:NodeShape ;
  sh:name "QualifiedOrder" ;
  sh:targetClass model:Order ;
  sh:property [
      rdf:type sh:PropertyShape ;
      skos:prefLabel "" ;
      sh:maxCount 1 ;
      sh:minCount 1 ;
      sh:name "mustHaveCustomer" ;
      sh:path model:customer ;
    ] ;
  sh:property [
      rdf:type sh:PropertyShape ;
      sh:maxCount 1 ;
      sh:minCount 1 ;
      sh:name "mustHaveSalesperson" ;
      sh:path model:employee ;
    ] ;
  sh:property [
      rdf:type sh:PropertyShape ;
      sh:inversePath model:order ;
      sh:minCount 1 ;
      sh:name "mustHaveOrderDetails" ;
      sh:node [
          rdf:type sh:NodeShape ;
          sh:name "QualifiedOrderDetail" ;
          sh:property [
              rdf:type sh:PropertyShape ;
              sh:maxCount 1 ;
              sh:minCount 0 ;
              sh:name "mayHaveOrderDetailDiscount" ;
              sh:path model:discount ;
            ] ;
          sh:property [
              rdf:type sh:PropertyShape ;
              sh:maxCount 1 ;
              sh:minCount 1 ;
              sh:name "mustHaveOrderDetailProduct" ;
              sh:path model:product ;
            ] ;
          sh:property [
              rdf:type sh:PropertyShape ;
              sh:maxCount 1 ;
              sh:minCount 1 ;
              sh:name "mustHaveOrderDetailQuantity" ;
              sh:path model:quantity ;
            ] ;
          sh:targetClass model:OrderDetail ;
        ] ;
    ] ;
.

OData2SPARQL has been extended to extract both the RDFS+ model, any SPIN operations, with now any SHACL shapes. These SHACL shapes are mapped as follows:

OData4SPARQL maps SHACL

  • any SHACL nodeShape to an OData ComplexType and a OData EntityType and EntitySet
  • any propertyShape to an OData property with the same restrictions

An OData2SPARQL request for an EntityType derived from a SHACL shape will construct the SPARQL query adhering to the shapes restrictions as shown in the example request below:

OData2SPARQL Request: 

…/odata2sparql/northwind/shapes_QualifiedOrder('NWD~Order-10248')?$select=*&$expand=*&$format=json
OData2SPARQL Response: 

{
       @odata.context: "http://localhost:8080/odata2sparql/northwind/$metadata#shapes_QualifiedOrder(*)/$entity",
       QualifiedOrder: {
              hasOrderDetail: [{
                     discount: 0,
                     quantity: 12,
                     product: {
                           subjectId: "NWD~Product-11",
                           …
                     }
              },
              {
                     discount: 0,
                     quantity: 10,
                     product: {
                           subjectId: "NWD~Product-42",
                           …
                     }
              },
              {
                     discount: 0,
                     quantity: 5,
                     product: {
                           subjectId: "NWD~Product-72",
                           …
                     }
              }],
              customer: {
                     subjectId: "NWD~Customer-VINET"
              },
              employee: {
                     subjectId: "NWD~Employee-5",
              }
       },
       subjectId: "NWD~Order-10248",
       …
}

In OData2SPARQL terms this means the following request:

OData2SPARQL Request: 

…/odata2sparql/northwind/shapes_QualifiedOrder?$format=json
OData2SPARQL Response: 

(Returns all orders that match the shape)

The shape is not implying that all orders have to follow the same shape rules. There could still be orders without, for example, any line-items. These are still valid, but they simply do not match this shape restriction.

Extending a SHACL shape

SHACL allows shapes to be derived from other shapes. For example we might further qualify an order with those that have a shipper specified: a ShippingOrder

 This can be expressed in SHACL as follows:

shapes:ShippingOrder
  rdf:type sh:NodeShape ;
  sh:name "ShippingOrder" ;
  sh:node shapes:QualifiedOrder ;
  sh:property [
      rdf:type sh:PropertyShape ;
      sh:maxCount 1 ;
      sh:minCount 1 ;
      sh:path model:shipVia ;
    ];
.
OData2SPARQL Request: 

…/odata2sparql/northwind/shapes_QShippingOrder('NWD~Order-10248')?$select=*&$expand=*&$format=json
OData2SPARQL Response:

(This is the same structure as the QualifiedOrder with the addition of the Shipper details.)

SHACL Northwind User Interface Example

One of the motivations for the use of OData2SPARQL to publish RDF is that it brings together the strength of a ubiquitous RESTful interface standard (OData) with the flexibility, federation ability of RDF/SPARQL. This opens up many popular user-interface development frameworks and tools such as OpenUI5 and SAP WebIDE. This greatly simplifies the creation of user interface applications.

Already OpenUI5 makes it very easy to create a user interface of say Orders. OpenUI5 uses the metadata published by OData (and hence RDF schema published by OData2SPARQL, as illustrated in Really Rapid RDF Graph Application Development)

With the addition of SHACL shapes to OData2SPARQL, it allows us to create a UI/UX derived from the OData metadata, including these SHACL shapes. For example the QualifiedOrders shapes is what we would expect of a typical master-detail UI: the Order is the ‘master’ and the line-items of the order the ‘detail’. With OpenUI5 it is as simple to publish a SHACL shape as it is to publish any OData EntitySet based on an rdfs:Class.

Requesting all QualifiedOrders will show a list of any (not all) orders that satisfy the restrictions in the SHACL nodeShape. Odata2SPARQL does this by constructing a SPARQL query aligned with the nodeShape and containing the same restrictions.

Figure 1: A Grid displaying all QualifiedOrders derived directly from the shape definition

Similarly requesting everything about a particular QualifiedOrder will show a master-detail.

  • The master displays required salesperson and customer
  • The detail contains all line items of the order (the shape specifies at least one)
  • Each line item displays product, quantity, and discount

Figure 2: A Master-Detail Derived Directly from the QualifiedOrder Shape

The benefits of OData2SPARQL+SHACL

The addition of SHACL support to OData2SPARQL enables the model-centric approach of RDF-graphs to drive both a model-centric RESTful interface and model-centric user-interface.

OData not only offers a RESTful interface to any datastore to perform CRUD operations, it also provides a very powerful query interface to the underlying datastore making it a candidate for the ‘universal’ query language. However the query capability is often neglected. Why?

Simple OData Query Example

It is always easier to start simply. So the question I want answered from a Northwind datastore is

Find customers located in France

Given an OData endpoint, the query is simply answered with the following URL:

http://localhost:8080/odata2sparql.v4/NW/

Customer?

$filter=contains(customerCountry,’France’)

The elements of this URL are as follows:

  1. The first line identifies the OData endpoint, in this case it is a local OData2SPARQL (http://inova8.com/bg_inova8.com/offerings/odata2sparql/) endpoint that is publishing a RDF/SPARQL triplestore hosting an RDF version of Northwind (https://github.com/peterjohnlawrence/com.inova8.northwind).

Alternatively you could use a publicly hosted endpoint such as http://services.odata.org/V4/Northwind/Northwind.svc

  1. The next line specifies the entity or entityset, in this case the Customer entitySet that is the start of the query
  2. The final line specifies a filter condition to be applied to the property values of each entity in the entitySet

The partial results are shown below in JSON format, although OData supports other formats

But that is not much more than any custom RESTful API would provide, right? Actually OData querying is far more powerful as illustrated in the next section.

Complex OData Query Example

The question I now want answered from a Northwind datastore is

Get product unit prices for any product that is part of an order placed since 1996 made by customers located in France

One thing is that the terminology used by OData is different than that of RDBMS/SQL, RDF/RDFS/OWL/SPARQL, Graph, POJO, etc. The following table shows the corresponding terms that will be used in this description

OData Terminology Mapping to Relational, RDF, and Graph

Mapped OData Entity Relational Entity RDF/RDFS/OWL Graph
Schema Namespace, EntityContainer Name Model Name owl:Ontology Graph
EntityType, EntitySet Table/View rdfs:Class

owl:Class

Node
EntityType’s Properties Table Column owl:DatatypeProperty Attribute
EntityType’s Key Properties Primary Key URI Node ID
Navigation Property on EntityType, Association, AssosiationSet Foreign Key owl:ObjectProperty Edge
FunctionImport Procedure spin:Query  

Given an OData endpoint, the query is answered with the following URL:

http://localhost:8080/odata2sparql.v4/NW/

Customer?

$filter=contains(customerCountry,’France’)&

$select=customerCompanyName,customerAddress,customerContactName&

$expand=hasPlacedOrder(

$filter=orderDate gt 1996-07-05;

$select=shipAddress,orderDate;

$expand=hasOrderDetail(

$select=quantity;

$expand=product(

$select=productUnitPrice

)

)

)

The elements of this URL are much the same as before, but now we nest some queryOptions as we navigate from one related entity to another, just as we would when navigating through a graph or object structure.

  1. Customer entitySet is specified as the start of the query
  2. The $filter specifies the same filter condition as before since we want only customers with an address in France.
  3. Next we include a $select to limit the properties (aka RDBMS column, or OWL DatatypeProperty) of a Customer entity that are returned.
  4. Then the $expand is a query option to specify that we want to move along a navigationProperty (aka RDBMS foreign key, Graph edge or OWL ObjectProperty). In this case the navigationProperty takes us to all orders placed by that customer.

Now we are at a new Order entity we can add queryOptions to this entity

  1. The $filter specifies that only orders after a certain date should be included.
  2. The $select limits the properties of an Order entity that are returned.
  3. The $expand further navigates the query to the order details (aka line items) of this order.

Yet again we are at another entity so we can add queryOptions to this Order_Detail entity:

  1. The $select limits to only the quantity ordered
  2. The $expand yet again navigates the query to the product entity referred to in the Order_Detail

Finally we can add queryOptions to this product entity:

  1. The $select limits only the productUnitPrice to be returned.

The partial results are shown below:

You can test the same query at the public Northwind endpoint, the only difference being the modified names of entitySets, properties, and navigationProperties.

http://services.odata.org

/V4/Northwind/Northwind.svc/Customers?

$filter=contains(Country, ‘France’)&

$select=CompanyName,Address,ContactName&

$expand=Orders(

$filter=OrderDate gt 1996-07-05;

$select=ShipAddress,OrderDate;

$expand=Order_Details(

$select=Quantity;

$expand=Product(

$select=UnitPrice

)

)

)

Structure of an OData Query

The approximate BNF of an OData query is shown below. This is only illustrative, not pure BNF. The accurate version is available here (http://docs.oasis-open.org/odata/odata/v4.0/odata-v4.0-part2-url-conventions.html)

ODataRequest              := entity|entitySet?queryOptions

entity                            a single entity

                                    := Customer(‘NWD~Customer-GREAL’)

                                    := Customer(‘NWD~Customer-GREAL’)/hasPlacedOrder(‘NWD~Order-10528’)

entitySet                       a set of entities

                                    := Customer

                                    := Customer(‘NWD~Customer-GREAL’)/hasPlacedOrder

queryOptions               The optional conditions placed on the current entity

                                    := filter ; select ; expand

filter                             Specify a filter to limit what entities should be included

                                    := $filter = {filterCondition}*

                                    := $filter = contains(customerCountry,’France’)

select                            Specify what properties of the current entity should be returned

                                    := $select = {property}*

                                    := $select = customerCompanyName,customerAddress,customerContactName

expand                        Specify the navigationProperty along which the query should proceed

                                    := $expand = {navigationProperty(queryOptions)}*

.                                   := $expand = hasPlacedOrder($expand=..)

Perhaps this can be illustrated more clearly (for some!):

Using the same diagram notation the complex query can be seen as traversing nodes in a graph:

Other Notes

  1. OData publishes the underlying model via a $metamodel document, so any application can be truly model-driven rather than hard-coded. This means that OData query builders allow one to access any OData endpoint and start constructing complex queries. Examples include:
  2. The examples in this description use OData V4 syntax ( http://docs.oasis-open.org/odata/odata/v4.0/odata-v4.0-part2-url-conventions.html ). The same queries can be expressed in the older OData V2 syntax (http://www.odata.org/documentation/odata-version-2-0/uri-conventions/), but V4 has changed the syntax to be far more query friendly.
  3. What about really, really complex queries: do I have to define them always via the URL? Of course not because each underlying datastore has a way of composing ‘views’ of the data that can then be published as pseudo-entityTypes via OData:
    • RDBMS/SQL has user defined views
    • RDF/RDFS/OWL/SPARQL has SPIN queries (http://spinrdf.org/) that can be added as pseudo EntityTypes via OData2SPARQL
  4. By providing a standardized RESTful interface that also is capable of complex queries we can isolate application development and developers from the underlying datastore:
    • The development could start with a POJO database, migrate through an RDBMS, and then., realizing the supremacy of RDF, end up with a triplestore without having to change any of the application code. OData can be a true Janus-point in the development

Conclusions

OData is much more than JDBC/ODBC for the 21st century: it is also a candidate for the universal query language that allows development and developers to be completely isolated from the underlying datastore technology..

Additional Resources

Odata2SPARQL (http://inova8.com/bg_inova8.com/offerings/odata2sparql/)

Lens2Odata (http://inova8.com/bg_inova8.com/offerings/lens2odata/)

SQL2RDFincremental materialization uses a stream of source data changes (DML) and transforms them to the corresponding changes (inserts, deletes) in the triple store ensuring concurrency between RDBMS and triplestore. SQL2RDFincremental materialization is based on the same R2RML mapping used for virtualization (Ontop), and (Capsenta), and materialization (In4mium, and R2RML Parser) thus does not incur an additional configuration maintenance problem.

Mapping an existing RDBMS to RDF has become an increasingly popular way of accessing data when the system-of-record is not, or cannot be, in a RDF database. Despite its attractiveness, virtualization is not always possible for various reasons such as performance, the need for full SPARQL 1.1 support, the need to reason over the virtualized as well as other materialized data, etc. However materialization of an existing RDBMS to RDF is also not an ideal alternative. Reasons include the inevitable lack of concurrency between the system-of-record and the RDF.

Thus incremental materialization provides an alternative between virtualization and materialization, offering:

  • Easy ETL of source RDB as no additional configuration required other than the same R2RML required for virtualization or bulk materialization.
  • Improved concurrency of data compared with materialization.
  • Significantly less computational overhead than a full materialization of the source data.
  • Improved query performance compared with virtualization especially when reasoning is required.
  • Compatibility with R2RML thus reducing configuration maintenance.
  • Supports insert, update, and delete changes on the source RDBMS.
  • Source transactions can be batched, and the changes to the triplestore are part of a single transaction that can be rolled back in the event of a problem.
  • Supports change logging so that committed changes can be rolled back.

Contact  inova8 if you would like to try SQL2RDF