IntelligentGraph embeds calculation and analysis capability within RDF knowledge graphs, rather than forcing analysis to be undertaken by exporting query results to external analysis applications such as Excel.

IntelligentGraph achieves this by embedding scripts into the RDF knowledge graph which are evaluated when queried with SPARQL.

Scripts, written in a variety of languages such as JavaScript, Java, and Python, access the underlying graph using simple pathPatternQL navigation.

IntelligentGraph=KnowledgeGraph+Embedded Analysis.pdf

Why IntelligentGraph?

At present calculations over stored data are either delivered by custom code or exporting the stored data to spreadsheets. The data behind these tools is inevitably tabular. In fact, so dominant are spreadsheets for analysis that the spreadsheet itself becomes the ‘database’ with the inherent difficulties of syncing that data with the source system of record.

The real-world is better represented as a network or graph of interconnected things Therefore a knowledge graph is a far better storage organization than tables or objects. However, there is still the need to perform ad hoc numerical analysis over this data. 

RDF DataCube can help organize data for analysis, but still the analysis has to be performed externally. Confronted with this dilemma, knowledge graph data would typically be exported in tabular form to a datamart or directly into, yet again, a spreadsheet where the analysis could be performed.

IntelligentGraph turns this approach on its head by embedding the calculations as scripts within the knowledge graph. These scripts are evaluated on query, and utilise the data in situ: no concurrency issues.. This allows the calculations to have knowledge of its neighbouring nodes and edges, just like Excel cells can access other cells in the spreadsheet. Access to other nodes within the graph uses pathPatternQL navigation.

Example Data and Analysis

An Industrial Internet of Things (IIoT) application is connecting all the measurements about a process plant, such as an oil refinery, into a knowledge graph that relates the measurements to the material flows through the process equipment.

Although there is an abundance of measurements and laboratory analyses available, the values required for operating and performance monitoring are not (and mostly cannot) be directly measured. 

For example:

  • Stream Mass-Flow: direct mass flow measurements are rare. Instead, a volume flow measurement is used in conjunction with a measured material density to calculate the mass-flow
  • Unit Mass Flow Throughput: this is calculated by summing either all feed stream mass flows or product stream mass flows.
  • Unit Mass Balance: this is calculated by differencing the feed from product mass flows
  • Product Stream Yield: this is the ratio of a stream’s mass-flow to the unit to which the stream is connected throughput.

Figure 1: Typical Process Flow Sheet

These are simple examples; however, they show the reliance on the knowledge graph structure to perform the analysis.

Solving data analysis, the traditional way

Data is in the database, analysis is done by the analysis engine (aka Excel), right?

Figure 2: Data analysis the traditional way

In this scenario, the local power user sets up a query to export data from the database and converts to a format that can be imported into Excel. Ever increasingly complex formulae are then written to wrangle the data into the results that are required.

Why is the spreadsheet approach risky?

  • The analysis is now separated from the data. Data changes will not be reflected in the analysis. Worse still, changes to the analysis might not be propagated to all the spawned copies of the spreadsheet.
  • The data is separated from the analysis. The analysis results are rarely re-imported into the data store where data vs analysis could be performed. Instead, even more data is extracted into the spreadsheet.
  • The difficultly of managing the separation of data from analysis becomes so great that in many cases the database is dispensed with entirely and the spreadsheet becomes the de-facto database.

Solving data analysis with an IntelligentGraph

The beauty of Excel is that a cell can contain either a value or a formula that can reference other cell’s values. Why not do the same with a graph: a node can have edges that terminate with a literal value, or a formula that can reference other node’s values.

This is illustrated in the diagram below:

  • The :massFlow property is not measured directly, so a formula is used for its value instead. This formula references $this, the node to which the calculation is attached, and uses the method getFact() to retrieve related values. The argument of getFact() is a pathPatternQL expression.
  • The :totalProduction property is not measured directly, so a formula is used instead which iterates over all of the ‘stream out’ nodes, retrieving the value of the :massFlow for each stream. The :massFlow value is, of course, in turn a calculation.

Figure 3: Intelligent Graph Data Analysis

Why is the IntelligentGraph approach so advantageous?

  • There is no separation between data and analysis, removing the risk of stale and inaccurate data and calculations.
  • The calculations embedded within the graph can take advantage of the knowledge that is contained within that graph. This makes the calculations far simpler than those that need to be embedded in spreadsheets.
  • The calculations will automatically utilize on the fly the changing knowledge.

How does IntelligentGraph Work?

Analysis is embedded in an IntelligentGraph simply by adding script literals as object values of subjects with datatype of the scripting language (groovy, javascript, python etc).

The IntelligentGraph engine is provided as an RDF4J Stackable SAIL. This means that its capabilities can be combined with any other RDF4J capabilities. The choice of RDF storage remains the same as for any other RDF4J compliant framework.

Modeling with Scripts

Typically, a graph node will have associated attributes with values, such as a stream with volumeFlow and density values:

Stream Attributes:

:Stream_1
   :density ".36"^^xsd:float ;
   :volumeFlow "40"^^xsd:float .

Of course, in the ‘real-world’ these measured values are sourced from outside the KnowledgeGraph and change over time. IntelligentGraph can deal with both of these requirements.

The ‘model’ of the streams can be captured as edges associated with the Unit:

:Unit_1
   :hasProductStream :Stream_1 ;
   :hasProductStream :Stream_2 ;
.

Calculate Mass Flow

The calculations are declared as literals[1] with a datatype whose local name corresponds to one of the installed script languages:

:Stream_1 :massFlow     
    "_this.getFact(':density')*    
     _this.getFact(':volumeFlow');"^^:groovy .

Calculate Total Production

A typical performance metric is to understand the total production from a unit, which is not of course directly measured. However, it can be easily expressed using existing calculated values:

:Unit_1   :totalProduction
    "var totalProduction =0.0;
    for(Resource stream : _this.getFacts(':hasProductStream'))
    {
        totalProduction += stream.getFact(':massFlow');
    }
    return totalProduction; "^^:groovy .

Instead of returning the object literal value (aka the script), the IntelligentGraph will return the result value for the script.

We can write this script even more succinctly using the expressive power PathQL:

:Unit_1  :totalProduction  
    "return _this.getFacts(':hasProductStream/:massFlow').total(); "^^:groovy

However, IntelligentGraph allows us to build upon existing calculations to simply express what would normally be difficult-to-calculate metrics, such as product yield or mass balance.

Calculate Mass Yield

Any production unit has different valued products. So a key metric is the yield of individual streams. This can easily be calculated as follows, using values that are themselves calculations.

var result= _this.getFact(":massFlow").floatValue()/ 
_this.getFact("^:hasStream/:totalProduction").floatValue();  
result;

Calculate Mass Balance

Measurements are not perfect, nor is the operation of a unit. One of the first indicators of a problem is when the mass flow in does not match the mass flow out. This can be expressed as another calculated property of a Unit:

return  _this.getFacts(":hasFeedStream/massFlow").total() -_this.getFacts(":totalProduction").total();

Querying Results

Access to the calculated values is via standard-SPARQL. However instead of returning the script literal, IntelligentGraph will invoke the script engine, 

Thus to access the :massFlow calculated value, the SPARQL is simply:

select ?massFlow
{
 :Stream_1 :massFlow ?massFlow 
}

If the script literal is required then the object variable can be postfixed with _SCRIPT:

select ?massFlow ?massFlow_SCRIPT 
{
 :Stream_1 :massFlow ?massFlow, ?massFlow_SCRIPT 
}

If a full trace of the calculation, including tracing calls to other scripts, is required then the object variable can be postfixed with _TRACE:

select ?massFlow ?massFlow_TRACE 
{
 :Stream_1 :massFlow ?massFlow, ?massFlow_TRACE 
}

How to Write IntelligentGraph Scripts?

Script Languages

Any Java 9 supported language can be used simply by making the corresponding language JAR available. 

By default, JavaScript, Groovy, Python JAR are installed. The complete list of compliant languages is as follows

AWK, BeanShell, ejs, FreeMarker, Groovy, Jaskell, Java, JavaScript, JavaScript (Web Browser), Jelly, JEP, Jexl, jst, JudoScript, JUEL, OGNL, Pnuts, Python, Ruby, Scheme, Sleep, Tcl, Velocity, XPath, XSLT, JavaFX Script, ABCL, AppleScript, Bex script, OCaml Scripting Project, PHP, Python, Smalltalk, CajuScript, MathEclipse

Script Context Variables

In addition, each script has access to the following predefined variables that allow the script to access the context within which it is being run.

  • _this, a Thing corresponding to the subject of the triples for which the script is the object.  Since this available, helper functions are provided to navigate edges to or from this ‘thing’ below:
  • _property, a Thing corresponding to the predicate or property of the triples for which the script is the object.
  • _customQueryOptions, a HashMap<String, Value> of name/value pairs corresponding to the pairs of additional arguments to the SPARQL extension function. These are useful for passing application-specific parameters.
  • _builder, a RDF4J graph builder object allowing a graph to be constructed (and manipulated) within the script. A graph cannot be returned from a SPARQL function. However the IRI of the graph can be returned, and any graph created by a script will be persisted.
  • _tripleSource, the RDF4J TripleSource to which the subject, predicate, triple belongs.

Fact and Path Functions

The spreadsheets’ secret sauce is the ability of a cell formula to access values of other cells, either individually or as a set. The IntelligentGraph provides this functionality with several methods associated with Thing, which are applicable to the _this Thing initiated for each script with the subject Thing.

Thing.getFact(String pathPattern) returns Value

Returns the value of node referenced by the pathPattern, for example “:volumeFlow” returns the object value of the :volumeFlow edge relative to _this node. The pathPattern allows for more complex path navigation.

Thing.getFacts(String pathPattern) returns Values

Returns the values of nodes referenced by the pathPattern, for example “:hasProductStream” returns an iterator for all object values of the :hasProductStream edge relative to _this node. The pathPattern allows for more complex path navigation.

Thing.getPath(String pathQL) returns Path

Returns the first (shortest)  path referenced by the pathQL, for example “:parent{1..5}” returns the path to the first ancestor of _this node. The pathQL allows for more complex path navigation.

Thing.getPaths(String pathQL) returns PathResults

Returns all paths referenced by the pathQL, for example :parent{1..5}” returns an iterator, starting with the shortest path,  for all paths to the ancestors of _this node. The pathQL allows for more complex path navigation.

Path Patterns

Spreadsheets are not limited to accessing just adjacent cells; neither is the IntelligentGraph. PathPatterns provide a powerful way of navigating from one Thing node to another. PathPatterns are inspired by SPARQL and propertyPaths, but a richer, more expressive, PathQL was required for the IntelligentGraph.

Examples

Examples of PathQL patterns are as follows:

_this.getFact(“:hasParent”)

will return the first parent of $this.

_this.getFact(“^:hasParent”)

will return the first child of $this.

_this.getFacts(“:hasParent/:hasParent”)

will return the grandparents of $this.

_this.getFacts(“:hasParent/^:hasParent”)

will return the siblings of $this.

_this.getFacts(“:hasParent[:gender :female]/:hasParent”)

will return the maternal grandparents of $this

_this.getFacts(“:hasParent[:gender :female]/:hasParent[:gender :male]”)

will return the maternal grandfather of $this.

_this.getFacts(“:hasParent[:gender [ rdfs:label “female”]]”)

will return the mother of $this but using the label instead of the IRI.

_this.getFacts(“:hasParent[eq :Peter]/:hasParent[:gender :male]”)

will return the grandfather of $this, who is the parent of :Peter.

_this.getFacts(“:hasParent[ne :Peter]/:hasParent[:gender :male]”)

will return grandfathers of $this, who are not the parent of :Peter.

The following diagram visualizes a path through a genealogical graph, from _this to the find the parents of a maternal grandfather born in Maidstone:

_this.getFacts(“/:parent[:gender :female]/:parent[:gender :male, :birthplace [rdfs:label ‘Maidstone’]]/:parent”)

 

Figure 4: PathPatternQL Example

How Is Performance?

IntelligentGraph takes the following actions to improve performance:

  1. All intermediate calculation results are cached, keyed by the subjectNode, predicate, and customQueryOptions.
  2. Cache can be cleared using the SPARQL function ClearCache.
  3. The SPARQL function ObjectValue takes as its argument the subject, predicate and objectValue. If the objectValue supplied is not of script datatype, the function will immediately return the objectValue.
  4. Circular functions, in which A calls B calls A, are detected and rejected.

Can I Debug?

Since IntelligentGraph combines calculations with the knowledge graph, it is inevitable that any evaluation will involve calls to values of other nodes which are in turn calculations. For this reason,  IntelligenrtGraph supports tracing and debugging:

 

Figure 4: Tracing Calculation

How Do I Add Intelligence to my RDFGraph?

Download

The project is located in Github, from where the intelligentgraph.jar can be downloaded from there:

The intelligentgraph.jar does not include all of the scripting etc language dependencies, so to use it you would have to be certain all dependencies are already available.

Install

IntelligentGraphwill work only with RDF4J version 3.3.0 and above. 

Copy intelligentgraph.jar

To  /usr/local/tomcat/webapps/rdf4j-server/WEB-INF/lib/intelligentgraph.jar

The RDF4J server will need to be restarted for it to recognize this new JAR and initiate the scripting engine.

[1] In this case the script uses Groovy, but any Java 9 compliant scripting language can be used, such as JavaScript, Python, Ruby, and many more.

Providing answers to users’ analysis, searching, visualizing or other questions of their own data

Creating an overall solution that presents data in a useful way can be challenging, but OData2SPARQL and Lens2OData solves this.

  • RDF-Graph: Data + Model = Information, allowing us to combine you raw data with an adaptable model to create meaningful information
  • OData2SPARQL: Information + Rules = Knowledge, provides you the ability to access that information combine with additional rules (SPIN and SHACL) to deliver useful knowledge that can be consumed by applications and BI tools.
  • Lens2OData: Knowledge + Action = Results, allows users to easily navigate, search, explore, and visualize this knowledge in such a way that it is easy to take action and produce results.

To see this all in action we have prepared a demonstrator that can be downloaded, and the following videos which illustrate the capabilities of this demonstrator.

  • Explore provenance of data sources which is retained by RDF- graph and OData2SPARQL rather than losing that provenance with typical ETL processing.

  • Explore the Transport For London train lines, stations and zones, illustrating how easy it is to transform any dataset to RDF_Graph and immediately get the benefits of OData access, and Lens UI/UX

To download and run this demonstrator go toDocker hub here

Let’s face it, RDF Graph datastores have not become the go-to database for application development like MySQL, MongoDB, and others have. Why? It is not that they cannot scale to handle the volume and velocity of data, nor the variety of structured and unstructured types.

Perhaps it is the lack of application development frameworks integrated with RDF. After all any application needs to not only store and query the data but provide users with the means to interact with that data, whether it be data entry forms, charts, graphs, visualizations and more.

However application development frameworks target popular back-ends accessible via JDBC and, now we are in the 21century, OData. RDF and SPARQL are not on their radar … that is unless we wrap RDF with OData so that the world of these application development environments is opened up to RDF Graph datastores.

OData2SPARQL provides that Janus-inflexion point, hiding the nuances of the RDF Graph behind an OData service which can then be consumed by powerful development environments such as OpenUI5 and WebIDE.

This article shows how an RDF Graph CRUD application can be rapidly developed, yet without losing the flexibility that HTML5/JavaScript offers, from which it can be concluded that there is no reason preventing the use of RDF Graphs as the backend for production-capable applications.

A video of this demo can be found here: https://youtu.be/QFhcsS8Bx-U

Figure 1: OData2SPARQL: the Janus-Point between RDF data stores and application development

Rapid Application Development Environments

There are a great number of superb application development frameworks that allow one to create cross platform (desktop, web, iOS, and Android), rich (large selection of components such as grids, charts, forms etc) applications. Most of these are based on the MVC or MVVM model both of which require a systematic and complete (CRUD) access to the back-end data via a RESTful API. Now that OData has been adopted by OASIS, the number of companies offering explicit support for OData is increasing, ranging from Microsoft, IBM, and SAP. Similarly there are a number of frameworks, one of which is SAPUI5 which has an open source version OpenUI5.

OpenUI5

OpenUI5 is an open source JavaScript UI library, maintained by SAP and available under the Apache 2.0 license. OpenUI5 lets you build enterprise-ready web applications, responsive to all devices, running on almost any browser of your choice. It’s based on JavaScript, using JQuery as its foundation, and follows web standards. It eases your development with a client-side HTML5 rendering library including a rich set of controls, and supports data binding to different models (JSON, XML and OData).

With its extensive support for OData, combining OpenUI5 with OData2SPARQL releases the potential of RDF Graph datasources for web application development.

Web IDE

SAP Web IDE is a powerful, extensible, web-based integrated development tool that simplifies end-to-end application development. Since it is built around using OData datasources as its provider, then WebIDE can be used as a semantic application IDE when the RDF Graph data is published via OData2SPARQL.

WebIDE runs either as a cloud based service supplied free by SAP, or can be downloaded as an Eclipse ORION application. Since the development is probably against a local OData endpoint, then the latter is more convenient.

RDF Graph application in 5 steps:

  1. Deploy OData2SPARQL endpoint for chosen RDF Graph store

The Odata2SPARQL war is available here, together with instructions for configuring the endpoints: odata2sparql.v2

The endpoint is this example is against an RDF-ized version of the ubiquitous Northwind databases. This RDF graph version can be downloaded here: Northwind

  1. Install WebIDE

Instructions for installing the Web IDE Personal edition can be found here: SAP Web IDE Personal Edition

  1. Add OData service definition

Once installed an OData service definition file (for example NorthwindRDF) can be added to the SAPWebIDE\config_master\service.destinations\destinations folder, as follows

Description=Northwind

Type=HTTP

TrustAll=true

Authentication=NoAuthentication

Name=NorthwindRDF

ProxyType=Internet

URL=http://localhost:8080

WebIDEUsage=odata_gen

WebIDESystem=NorthwindRDF

WebIDEEnabled=true

  1. Create new application from template

An application can be built using a template which is accessed via File/New/Project from Template. In this example the “CRUD Master-Detail Application” was selected.

The template wizard needs a Data Connection: choose Service URL, select the data service (NorthwindRDF) and enter the path of the particular endpoint (/odata2sparql/2.0/NW/).

Figure 2: WEB IDE Data Connection definition

At this stage the template wizard allows you to browse the endpoint’s entityTypes and properties, or classes and properties in RDF graph-speak.

Since the Web IDE and OpenUI5 is truly model driven, the IDE creates a master-detail given the entities that you define in the next screen.

The template wizard will ask you for the ‘object’ entityType which in this example is Category. Additionally you should enter the ‘line item’ but in this case there is only one navigation property (aka objectproperty in RDF graph-speak) which is the products that belong to the category.

Note that this template allows other fields to be defined, so titls and productUnitprice were selected.

Figure 3: WEB IDE Template Customization

  1. Run the application

The application is now complete and can be launched from the IDE. Right-click the application and select Run/Run as/As web application:

Figure 4: Application

Even this templated application is not limited to browsing data: it allows full editing of the categories. Note that even the labels are derived from the OData endpoint, but of course they can be changed by editing the application.

Figure 5: Edit Category

Alternatively a new category can be added:

Figure 6: Create New Category

Next Steps

That’s it: a fully functional RDF Graph web application built without a single line of code being written. However we can chose to use this just as a starting point:

  1. Publish data views, the SPARQL equivalent of a SQL view on the data, to the OData2SPARQL endpoint.
    • These are particularly useful when publishing reports in OpenUI5, Excel PowerQuery, Spotfire, etc.
  2. Modify the model that is published via the OData2SPARQL endpoint
    • The model that is published is extracted from the schema defined in the endpoint configuration. Thus the schema can be changed to suit what one wants in the endpoint metamodel.
  3. Edit the templated application
    • The templated application is just a starting point for adaptation: the code that is created is nothing more than standard HTML5/JavaScript using the OpenUI5 libraries.
  4. Build the application from first-principles
    • A template is not always the best starting point, so an application can always be built from first-principles. However the OpenUI5 libraries make this much easier by, for example, making the complete metamodel of the OData2SPARQL endpoint available within the application simply by defining that endpoint as the datasource.

Integration Problem to be solved

  • Data in different databases, even with Linked Open data sources.
  • Misaligned models, different datasets have different meanings for classes and predicates that need to be aligned.
  • Misaligned names for the same concepts.
  • Replication is problematical.
  • Query definition and scope of querying difficult to define in advance.
  • Provence of data necessary.
  • Cannot depend on inferences being available in advance
  • Scalable architecture requires that all queries are stateless

Data Cathedrals versus Information Shopping Bazaars

Linked Open Data has been growing since 2007 from a few (12) interconnected datasets to 295 as of 2011, and it continues to grow. To quote “Linked Data is about using the Web to connect related data that wasn’t previously linked, or using the Web to lower the barriers to linking data currently linked using other methods.” (Linked Data, n.d.) 

Figure 1: Growth of the Linked Data ‘Cloud’

As impressive as the growth of interconnected datasets is, what is more important is the value of that interconnected data. A corollary of Metcalf’s law suggests that the benefit gained from integrated information grows geometrically[1] with the number of data communities that are integrated.

Many organizations have their own icebergs of information: operations, sales, marketing, personnel, legal, finance, research, maintenance, CRM, document vaults etc. (Lawrence, 2012) Over the years there have been various attempts to melt the boundaries between these icebergs including the creation of the mother-of-all databases that houses (or replicates) all information or the replacement of disparate applications with their own database with a mother-of-all application that eliminates the separate databases. Neither of these has really succeeded in unifying any or all data within an organization. (Lawrence, Data cathedrals versus information bazaars?, 2012). The result is a ‘Data Cathedral’ through which users have no way to navigate to find the information that will answer their questions.

Figure 2: Users have no way to navigate through the Enterprise’s Data Cathedral

Remediator at the heart of Linked Enterprise Data

Can we create an information shopping bazaar for users to answer their questions without committing heresy in the Data Cathedral?  Can we create the same information shopping bazaar as Linked Data within the Enterprise: Linked Enterprise Data (LED). That is the objective of Remediator.

First of all we must recognize that the enterprise will have many structured, aggregated, and unstructured data stores already in place:

Figure 3: Enterprise Structured, Aggregated, and Unstructured Data Icebergs

One of the keys to the ability of Linked Data to interlink 300+ datasets is that they are all are expressed as RDF. The enterprise does not have the luxury of replicating all existing data into RDF datasets. However that is not necessary (although still sometimes desirable) because there are adapters that can make any existing dataset look as if it contains RDF, and can be accessed via a SPARQLEndpoint. Examples are listed below

  1. D2RQ: (D2RQ: Accessing Relational Databases as Virtual RDF Graphs )
  2. Ultrawrap:(Research in Bioinformatics and Semantic Web/Ultrawrap)
  3. Ontop:(-ontop- is a platform to query databases as Virtual RDF Graphs using SPARQ)

Attaching these adapters to existing data-stores, or replicating existing data into a triple store, takes us one step further to the Linked Enterprise Data:

Figure 4: Enterprise Data Cloud, the first step to integration

Of course now that we have harmonized the data all as RDF accessible via a SPARQLEndpoint we can view this as an extension of the Linked Data cloud in which we provide enterprises users access to both enterprise and public data:

Figure 5: Enterprise Data Cloud and Linked Data cloud

We are now closer to the information shopping bazaar, since users would, given appropriate discovery and searching user interfaces, be able to navigate their own way through this data cloud.  However, despite the harmonization of the data into RDF, we still have not provided a means for users ask new questions:

What Company (and their fiscal structure) are we working with that have a Business Practise of type Maintenance for the target industry of Oil and Gas with a supporting technology based on Vendor-Application and this Application is not similar to any of our Application?

Such questions require pulling information from many different sources within an organization. Even with the Enterprise Data Cloud one has provided the capability to discover such answers. Would it not be better to allow a user to ask such a question, and let the Linked Enterprise Data determine from where it should pull partial answers which it can then aggregate into the complete answer to the question. It is like asking a team of people to answer a complex question, each contributing their own, and then assembling the overall answer rather than relying on a single guru.  Remediator has the role of that team, taking parts of the questions and asking that part of the question of the data-sources.

Figure 6: Remediator as the Common Entry Point to Linked Enterprise Data (LED)

Thus our question can become:

  1. What Business Practise of type Maintenance for the target industry of Oil and Gas?
  2. What Company are we working with?
  3. What Company have a Business Practise of type Maintenance?
  4. What Business Practise with a supporting technology based on Vendor- Application?
  5. What Company (and their fiscal structure)?
  6. What Vendor-Application and this Application is not similar to any of our Application?

This decomposition of a question into sub-questions relevant to each dataset is automated by Remediator:

Figure 7: Sub-Questions distributed to datasets for answers

Requirements for a Linked Enterprise Data Architecture

  • Keep it simple
  • Do not re-invent that which already exists.
  • Eliminate replication where possible.
  • Avoid the need for prior inferencing.
  • Efficient query performance.
  • Provide provenance of results.
  • Provide optional caching for further slicing and dicing of result-set.
  • Use Void only Void and nothing but Void to drive the query

[1]  If I have 10 database systems running my business that are entirely disconnected, then the benefits are 10 * K, where K is some constant. If I integrate these databases in pairs (operations + accounting, accounting + payroll, etc), then the benefits increase to 10 * K * 2. If I integrate in threes, (operations + accounting + maintenance, accounting + payroll + receiving, etc), then the benefits increase four-fold (a corollary of Metcalf’s law) to 10 * K * 4. For quad-wise integration my benefits would be 10 * K * 8 and so on. Now it might not be 8 fold but the point is there is a geometric, not linear, growth in benefits as I integrate all of my information across my organization.


If you do not plan where you are going you will not get there, but you will probably get what you deserve. This software development model, that accurately predicts resources and schedule given scope, can greatly help planning when you will get there and what will be needed on the way. This may not be a perfect model, but perfection is the enemy of the good. What software development model experience can you share?

In a previous posting Innovation != R&D$, I reasoned that the impact of R&D expenditure is not well correlated with gross operating margins because a significant proportion of R&D monies is absorbed by the less revenue generating activities of maintenance and support rather than innovation, by which I mean in this context the creation of new software applications. However it is still important to ensure that whatever monies are invested in innovation are wisely invested. Unfortunately our industry is beset with project delivery problems. We have earned the IT Rule-of-5: 5 times over-budget, 5 times schedule, 1/5th functionality.  Perhaps an exaggeration but there is a problem that few would deny.

One problem is that of creating unrealistic expectations. The development team’s answer to a feature request of ‘it is just a small matter of programming’ (one unit of SMOP) gets heard as ‘it will be on my desk tomorrow morning, tested and with a quality and quantity of documentation that would shame Charles Dickens’.  

The answer is a good looking model. I love models. Not the type you are thinking of, but simple mathematical formulae that allow me estimate what will happen.  Having been long involved with automation, MES, and software development I always want to know how long a software development will take. ‘As long as a piece of string’ is not the most useful answer when customer expectations or development budgets need to be met. So over the years I have developed a model that estimates total development effort and tracks development progress over the life-cycle of the project with surprising accuracy. Before you say that most models have so many ‘tuning factors’ that you can of course make it always fit, I want to point out that this model has just one factor. Also I want to point out that most of this model originates with Capers-Jones seminal work ‘Applied Software Measurement’

Estimating Model

The objective of the estimating model is to use a measure of the size of the development and come up with estimates for project duration, the number of project resources, total development effort and average productivity. From these estimates the project cost can be derived.

Tuning factor (J)

This is the only factor you need for this model. Fortunately Capers-Jones also provides a range of suggested values, as shown below. I would suggest 0.4 as a starting point.

Kind of software Best in class Average Worst in class
Systems 0.43 0.45 0.48
Business 0.41 0.43 0.46
Shrink-wrap 0.39 0.42 0.45

Estimated Scope (LOC)

Estimated size for the project (lines-of-code). OK, it can be difficult to come up with a really accurate estimate, but there is much written on the subject. Quick ways are to simply say that this application is approximately the same size as a similar one done in the past. For greater accuracy, Function Point or Story Point estimating can be used and then converted to lines-of-code.

Estimated Function Point (FP)

= Lines-of-Code (LOC) / LOC-per-FP, where LOC-per-FP is taken as 54 for languages such as C#

Estimated Project Duration (D)

= FPJ  (months)

The ideal project duration given the ideal number of resources for the project.

Development Resources (R)

= FP2*J/27 (persons)

The ideal number of full-time project-persons to complete the project.

Total development effort (Effort)

= D * R (person-months)

Average Productivity (P)  

= LOC/Effort (lines-of-code per person-month)

Below is an example of applying this model to an example project:

Development Estimate Model
Factor Units Formula Example Project
Tuning Factor dim J 0.4
Lines of Code LOC LOC 75000
LOC/FP LOC/FP LOC-per-FP 54
Function Points FP FP 1389
Estimated project duration months FPJ 18
Development resources persons FP2*J/27 12
Total development effort person-months FP3*J/27 219
Average productivity LOC/person-month 27*FP (1-3*J) 343

 

The graphs below shows the project effort and productivity plotted for various sizes of project. As would be expected we can see the productivity falling off as project size increase (see Mythical Man Month)

Tracking Model

The Tracking Model recognizes that the assumptions in the original estimate might not apply in practice. For example the number of resources assigned to the project might change, or the scope decreases or, more likely, increases. Thus the tracking model uses known measurements of the project such as code produced to date, resources actually assigned, and current total project size to calculate what the progress to date should be and to predict into the future the revised project completion.

Estimated Scope (LOC)

Estimated size for the project (lines-of-code). This is the estimate at the beginning of the project because scope creep and scrope additions will inevitably occur.

Scope Creep (Creep)

Percent change month-on-month of the project scope (%). This requires careful measurement because, as has been shown elsewhere, projects can only tolerate a small amount of scope creep before they become ‘runaways’.

Estimated Scope including creep (ELOC)

= LOC * (1+Creep) + Any additional scope     

Estimated Scope (EFP)

= ELOC / LOC-per-FP

Project Duration (D)                              

Duration of project, measured from the original project start, assuming optimal resource allocation, no scope creep, and sustained productivity

= EFPJ

Required Project Resources (R)

Resources that should be allocated to the project for the duration

= EFP2*J/27

Actual Project Resources (AR)

Actual person-month assigned to the project for the period.

Estimated productivity (P)

Estimated lines-of-code per assigned person-month. Note that productivity reduces as the assigned resources increase (see Mythical Nan Month)

= ELOC*27*(27*AR)(1-3*J)/(2*J)

Estimated Production (PR)

Estimated lines-of-code produced in period based on assigned resources and estimated productivity.

= P * AR

Accumulated lines-of-code (ALOC)

Accumulated lines of code to-date based on assigned resources and estimated productivity.

Remaining Scope                                

Estimated scope less estimated accumulated code

= ELOC – ALOC

Accumulated Cost (AC)

Cost of actual resources assigned based on annual rate.

Cost per Line-of-code

A metric indicating how much each line-of-code is costing

= Accumulated Cost / Accumulated lines-of-code

Agile Productivity Ratio  

The promise of Agile/SCRUM is that productivity will increase as large teams as split into Scrum teams. As indicated before, productivity reduces as team size increases. This is the ratio of the productivity of a single team versus multiple Scrum teams with the same total resources.

Allowable scope change

Scope creep is a project killer. If each month the project scope is allowed to increase, then the project size will increase. If the project size increases, then duration and number of project resources follow. As project resources increase then productivity will reduce, further decreasing project duration. As project duration increases, then more scope changes can accumulate, further increasing project duration, and so on. The tipping point when the project becomes a runaway has been determined and is calculated as follows:

= 1/EFPJ

The graphs below show the results from a typical project. We see the project scope increasing (scope-creep), resulting in more resources being pulled in to tackle the retreating deadline. Note that without scope creep the project would have completed in month 23 even with the same assigned resource profile. This signals the hidden dangers of even mild scope-creep.

The above graph is based on the actual resources assigned. The graph below compares the estimated with the actual production of code over the same period, confirming the accuracy of the model.

Observations

  • Although there is only one factor in this model, estimates are quite sensitive to the value chosen. Therefore it is best to track the estimates with actual measurements to ensure the accuracy of the model and hence the tuning factor.
  • Include test code or not? If the development teams are using test-driven development (TDD) or any form of Agile/SCRUM then I think it is important to include test code along with production code in the estimates, at the same time including the test resources with the overall project resources. Generally I expect to see test code lines-of-code to be approximately 50% of the production code.
  • Counting lines-of-code should use the same tool, such as Visual Studio, for consistency. There is much debate about what should be included and excluded. However I think it is more important to simply be consistent because you will end up deriving your own tuning factor based on your assumptions.
  • There is a difference between development productivity, and productivity of customer expectations. Just because the lines-of-code have been efficiently produced, albeit error free and with great documentation, it does not mean that customer expectations have been met because they might have wanted an entirely different solution.
  • Scaling lines-of-code to function points. Despite the superiority of Function Points (or Story Points) over lines-of-code as a measure of software size, lines-of-code seems to be more tangible to management. Thus although the model is expressed in function points, I expect most will estimate lines-of-code or convert (‘back-cast’ to quote Capers-Jones) from lines-of-code to function points using a factor based on the type of programming language being used. For example there are approximately 54 C# lines-of-code per function point.

Associated Spreadsheet

To those who have got this far, I am sharing a spreadsheet version of this model that you are free to download and use for your own estimating. I hope it works as well for you as it has for me.

References

  1. Applied Software Measurement: Global Analysis of Productivity and Quality. Third Edition; Capers Jones; 2008
  2. The Mythical Man-Month: Essays on Software Engineering; Frederick Brooks;1995


It is known that software R&D expenditure positively impacts the gross operating margins or the market-to-book values of a company. However the correlation is not strong. Is it because maintenance, sustainment as well as innovation are lumped together as R&D, yet the return is not equal throughout the software product life-cycle. This articles shows that there may be far less of the high-returning innovation development than management believes. In fact innovation is being starved out by the need to maintain and sustain the existing software product portfolio.

R&D = Innovation + Maintenance + Sustainment

It is generally agreed that R&D expenditure positively impacts gross operating margins or the market-to-book values of a company[1], but how strong a correlation is hotly debated[2]. What is more difficult is the measure of R&D expenditure productivity, or return on investment. For example, if the majority of R&D expenditure is going into existing product maintenance then it is unlikely to offer the same ROI as investing into new innovations.

In their study Booz Allen Hamilton presents their results exploring the smart spenders of R&D. They question the degree of correlation between R&D and company financial performance: “There are no significant statistical relationships between R&D spending and the primary measures of financial or corporate success: sales and earnings growth, gross and operating profitability, market capitalization growth, and total shareholder returns. Gross profits as a percentage of sales is the single performance variable with a statistical relationship to R&D spending.”

However Booz Allen Hamilton assumes that R&D is limited to ideation, project selection, product development and commercialization, as illustrated below.

In practice we know that the R&D expense continues throughout the innovated product life-cycle to include sustainment and maintenance, where

  • Sustainment is the addition of new features to an existing product to maintain or gain market share
  • Maintenance is to ensure quality and hence customer loyalty.

As important as sustainment and maintenance are, R&D investment into these later phases of a product’s lifecycle is never likely to offer the same returns as investment in innovation, the more disruptive the better[3].

Maintenance and Sustainment is an inevitable consequence of Innovation

To truly evaluate the return-on-investment of R&D expenditure we need to distinguish between the R&D investments at different phases of the product life cycle as they surely offer different ROI.

So what is the typical distribution of expenditure for software products? A simple model, verified by actual observations, reveals some surprises.

  • Sustainment (adding new features) cost is 8-15% per annum of the original development cost and accumulated sustainment investment to date in any supported code. This 8-15% creates additional code that needs to be sustained and maintained in the future. For example if the original development was $100,000, budget $8,000-$15,000 per annum to add new features requested by customers in order to sustain a competitive product. Note that this would lead to a doubling of the code base over the typical 7-year life-cycle of a product.
  • Maintenance (providing bug fixes) cost is 8-15% per annum of development investment to date in any maintained product code. For example if the original development was $100,000, budget $8,000-$15,000 per annum for maintenance in the first year but expect that to grow as sustainment investment increases the code base to be maintained.

So lets us apply this model to a start-up company that has decided to invest $150,000 per annum to create their new product. For the first few years this model works well as the sustainment and maintenance costs are relatively minor. However after a few years the sustainment and maintenance costs are starving out continued innovation until in Year 7 when there is scarcely any innovative development at all. Does this look familiar to you?


No wonder the ROI of R&D does not correlate well; it depends where a company is within this cycle. Code created for the initial release (innovation code) allows one to capture new markets or market share, which is surely more valuable than code added to supply additional features (sustainment code) to ensure that customers are satisfied, and that the product retains a competitive position, which is more valuable than code added to fix problems (maintenance code) to ensure quality and hence customer loyalty. Unfortunately the high-performing innovation investment reduces to less than 12% of R&D total. Over the life-cycle of a product, accumulated sustainment and maintenance can be 180-600% of the original innovation investment.

One could argue that a company which had a successful innovation would be growing, so its R&D budget would be growing proportionately. However, even if we modify our investment strategy and decide to maintain the innovation investment at a constant level, innovation as a percent of R&D would be reduced to 35% by Year 7 as shown below:

How to solve the Innovation Dilemma

Starvation of innovation is caused by the need to sustain and maintain existing code. Therefore, aside from making the innovation investment more productive, the following are suggested strategies: Deprecate old products as soon as possible

Old products do not freely sit on the shelf. They are like volcanoes that have not erupted for some time. Are they extinct? It could be less profitable for that old product to be sold since that will then perpetuate the sustainment and maintenance. However many careers may be wedded to these old products so they become very difficult to kill.

Minimize the code investment in the original product (Lean!)

If you own a larger house, then sustainment and maintenance inevitably costs more. Code is no different. Thus any opportunities to downsize that code yet still meet the functional requirements will reduce the long term sustainment and maintenance costs, releasing more R&D spend for future innovations.

  • Code that is created quickly and efficiently allows the product to be released to the market earlier, ensuring an increased internal rate of return or net present value of the investment.
  • Code that adds functionality to solve customer problems is likely to be more valuable than core component code that could be purchased from OEMs.

Capitalize code investment

This will probably make accountants pay attention, but expensing code as it is created disguises the fact that it really behaves like a capital investment; it will need sustainment and maintenance investment over the years to retain its value.

[1] Hall, Bronwyn, Jacques Mairesse & Pierre Mohnen, 2010, Measuring the Returns to R&D, in: Hall, B. and Rosenberg, N, Handbook of the Economics of Innovation, Elsevier, Amsterdam, pp. 1034-1076

[2] Booz-Allen-Hamilton. (2006), Smart Spenders: The Global Innovation 1000

[3] The Innovator’s Dilemma.  Clayton M. Christensen Cambridge, Massachusetts: Harvard Business School Press, 1997


Displacing problems: do information technology ‘solutions’ create more problems than they solve. Do information technology ‘solutions’ simply displace problems to another group? HAZOPS have been used to systematically assess risks in processes and operations. Why not ‘HAZOP’ the supposed information technology solutions we propose?

The information technology track record in process manufacturing

We and I speak for software and information technologists, have not always had a good record of solving the problems of the process manufacturing industry, instead we solve our own technical problems. Take for example the history of time series data management over the last 30 years:

  • You had a problem of instrument data, and we solved it with historians, but you came back and told us we were not listening because …
  • You had a problem with data visibility, and we solved it with portals, but you came back and told us we were not listening because …
  • You had a problem with analyzing what was going on, and we solved it with analysis servers, but you came back and told us we were not listening because …
  • You had a problem with deploying analysis, and we now solve it with plant models, but you come back and tell us we are not listening because …
  • You had a problem with following through on the results, and we now solve it with work-flow, but you come back and tell us we are not listening because …
  • You do not have an IT department nor the in-house skills to deploy such technology, perhaps we will solve that by moving in-house applications into the ‘cloud’, but what will you come back with to tell us we were not listening?

Perhaps the future problems will be that you do not really have the in-house skills to deploy applications in such a way as to gain their benefit. Thus we must think about solutions that are self-deployable, with built-in best practices. Apple iPad would not be very successful if along with the iPad you had to buy 2 days of consulting time to understand how to use the applications!

Another problem, if it can be called that, is that you will be employing GenY and GenX who are used to iPhone, Twitter, Facebook and other social applications. How will they respond to our applications that sometimes resemble work-overs of 1980’s mainframe applications?

The pace of change of the architecture, scope, and complexity of process manufacturing application solutions is far from slowing down. So if this thesis is correct, we might expect even bigger problems to be created by these supposed solutions. We need a strategy for anticipating those problems before they occur.

Avoiding future problems: Problem Displacement Assessment

So how do we avoid these mistakes in the future? This seems like an impossible goal, but in the event of a process incident there will be many who say that it should have been anticipated. And to respond to this the process industry has long used HAZOPS during the design or adaptation of critical processes and operations. A hazard and operability study (HAZOP) is a structured and systematic examination of a planned or existing process or operation in order to identify and evaluate problems that may represent risks to personnel or equipment, or prevent efficient operation. Why not apply the same structured examination directed towards the problems that we will create when we solve another problem. HAZOP pivots around examining Deviations from each intention, feasible Causes and likely Consequences. We can map deviations, causes and consequences to the impact of our ‘solutions’ as shown below:

HAZOP Problem Displacement Assessment
HAZOP Example PRODIS Example 1 Example 2
Asset Heat exchanger Application Excel Unit production report Database model-driven reporting system
Intention To heat 2.3 kg/s of 96% sulfuric acid from 20°C to 80 °C. Intention To provide report of balance of feed and productions, together with production qualities

(achieved using Excel with links to the data sources)

To provide report of balance of feed and productions, together with production qualities

(achieved using a database model to deduce the feeds and products streams and associated data sources)

Deviation MORE: 20°C to 100 °C. Deviation INACCURACY: regular imbalance of feed vs. products INACCURACY: regular imbalance of feed vs. products
Causes Reduced flow
Causes Excel macro mismatches actual material flows
Inaccurate mass flow measurements
Missing measurements
Model mismatches actual material flows
Inaccurate mass flow measurements
Missing measurements
Consequences Dangerously overheated sulfuric acid. Consequences Inability to understand the source of the errors, the confidence in the reports diminishes until it falls out of use. Requires new skills of DBA/analyst who can understand the underlying data structures needed to debug the problem. Since these skills are not available the confidence in the reports diminishes until it falls out of use.

 

So that is the proposal: whenever a new application is deployed or a significant adaptation is to be made to an existing application, let us undertake a systematic Problem Displacement Assessment using the same methodology as HAZOPS:

  • For each application,
    • For each intended behavior of that application,
      • For each possible deviation from the intended behavior,
        • Identify what might be the causes of the deviation,
          • Identify the consequences of the deviation.