A large community exists “that sees Linked Data, let alone the full Semantic Web, as an unnecessarily complicated technology”, Phil Archer. However early adopters are always known for zealotry rather than pragmatism: mainframe to mini, mini-to-pc, pc-to-web and many more are examples where the zealots often threw the baby out with the bath-water.

Rather than abandoning the semantic web perhaps we should take a pragmatic view and identify its strengths, together with some soul-searching to be honest about its weaknesses:

SPARQL is unlikely ever to be end-user friendly:

However one can say exactly the same of SQL (which, believe it or not, used to be called User-Friendly-Interface, or UFI for short). SQL is now hidden behind much more developer-friendly facades so that they can deliver user-friendly experiences such as Hibernate/JPA, C#+LINQ, Odata, and many more. Even so SQL remains the language of choice for manipulating and querying RDBMS.

So we still need SPARQL for manipulating semantic data, but we need developer- and user-friendly facades that make the semantic information more accessible. Some initiatives are JPA for RDF, LINQtoRDF, Odata-SPARQL but the effort is fragmented with few standard initiatives.

RDF/OWL is unlikely to supplant Entity-Relational, JSON-Object databases:

However we have all learnt the lesson that the data model is often at the core of any application. Furthermore any changes to that data model can be costly, especially in the later stages of the development lifecycle. Who has not created an object-attribute data model deployed in an RDBMS to retain design flexibility without the need to change the underlying database schema? Are we not reinventing the semantic data model?

So the principles and flexibility of semantic (RDF, RDFS, and OWL) data modeling of data could be adopted as a data modeling paradigm representing the evolution that started with Entity-Relationship (ER), went through Object-Role Modeling (ORM or NIAM), to Semantic Data Modeling (SDM). Used in combination with dynamic REST/JSON, such as Odata, one could truly respond to the Dynamic Business Application Imperative (Forrester).

Triple-stores are unlikely to supplant Big-Data:

However, of volume, velocity, and variety, many big data solutions are weaker when handling variety, especially when the variety of data is changing over time as it does with evolving user needs. Most of the time the variety at the data source will be solved by Extract-Transform-Load (ETL) into a big-data store. Is this any different than data warehousing which has matured over the last 20 years, except for the use of different data storage technology? Like any goods in transit, data gets damaged and contaminated when it is moved. It is far better to use the data in-situ if at all possible, but this has been the unachievable Holy Grail of data integration for many years.

So the normalization of any and all data into triples might not be the best way to store data but can be the way to mediate variable information from a variety of data-sources: Ontology Based Data Access (OBDA). The ontology acts a semantic layer between the user and the data. The semantics of the ontology are used to enrich the information on the sources and/or cope with incomplete information in them. 

In summary …

Semantic technologies will see more success if it pursues the more pragmatic approach of the solving those problems that are not satisfactorily solved elsewhere such as querying and reasoning, dynamic data models, and data source mediation; instead of resolving the already solved.  

References

Phil Archer http://semanticweb.com/tag/phil-archer

LINQtoRDF: https://code.google.com/p/linqtordf/

JPA/RDF: https://github.com/mhgrove/Empire

ODATA/SPARQL: https://github.com/brightstardb/odata-sparql

Dynamic Business Imperative: http://www.forrester.com/The+Dynamic+Business+Applications+Imperative/fulltext/-/E-RES41397?objectid=RES41397

Ontology Based Data Access: http://obda.inf.unibz.it/


Enterprises create data cathedrals with an enforced dogma to control data purity, causing much information to be outside its walls where informal information bazaars thrive. These information bazaars have suspect quality, uncertain provenance, yet are responsive to users’ needs. Metcalf’s law suggests that the benefit gained from integrated information grows geometrically1
 with the number of data communities that are integrated. How can we balance the dogma of the data cathedrals and the spontaneity of the information bazaar?

Enterprise’s database cathedrals reflect corporate dogma. Nothing gets changed without approval from high. Change is very slow. New databases orders get integrated only after a considerably long time assuming that the new data is 100% squeaky clean. So there are a lot of databases that are entirely outside the database cathedrals’ walls. Badly behaved sources of data might even be excommunicated.

Where does the other data go? It is not as though this other data does not exist, although many would like to pretend it to be so. Instead they are all in the information bazaar. Anyone with any information can set up their own information stall, and store their own data in Excel, Access, anywhere they want. They only specialize in their own data for their own use. This data is pretty good because that is all they need for their business. They share well with others but on a barter basis. In fact the information bazaar is chaotic, but lively, always changing to users’ demands, and a fun place to be. 

Why do we have the conflict between the database cathedral and the information bazaars?

The data cathedral offers security, quality, and good provenance. It provides the system of record for users who then should have complete confidence in their decision making. It does this using accurate relational models capturing enterprise information. But a relational model is designed by the cathedral hierarchy based on the closed model: only pure data can be entered into the database; impure data can lead to excommunication. 

The information bazaar has few rules of entry. As demonstrated by the web, it allows anyone to say anything about anything (AAA). Even with this deficiency we will regularly search the web to help us with our decision making, not exploring sources that are suspect, and filtering information that we feel lacks accuracy until we end up with information to support our decision.

Can we resolve these conflicting objectives?

Can we expect the cathedral hierarchy to relax its admittance criteria to let in as much of the information bazaar as possible? Somewhat, but we cannot expect miracles.

Can we expect the information bazaar to become more sober and responsible so that it can securely provide information with guaranteed quality and provenance? Somewhat, but we cannot expect an evangelical conversion?

Really this is not optimal, because the benefit of having data integrated grows geometrically with the number of interconnected sources, yet the database cathedral cannot grow because the information bazaar does not meet their purity dogma.

So how can these conflicting objectives be redeemed?

One path to redemption is to unite the information bazaar through a common semantic model. This allows all information to be available within a universal graph (model). Of course some riff-raff will get in, but again that is an advantage for the semantic model as you can also declare rules that will verify the accuracy of the data even though it is already stored. 

At the same time the data cathedral can continue to expand, hopefully at faster pace, by integrating those graphs that meet their criteria. 

However we allow users to access both the data cathedral, from where they can obtain the system of record, and information bazaar. We could even report results federating form the two data-sources annotating that information from the information bazaar with its provenance and hence less certain data quality. Doing this in a standards compliant way turns existing enterprise information resources into connectable, responsive and interoperable semantic assets.

Harmony

Using this approach we don’t need to force the data cathedral to relax its dogma, nor do we ask the information bazaar to shut down. Yet we can offer users access to 99% of the enterprise information providing users the ‘Metcalf’1 benefits of full integration. As semantic assets grow and connect, they enable a resilient semantic ecosystem of meaningful interactions between people, applications and data irrespective of the differences in structures, data schemas, governance and technologies. The dividing boundaries between the cathedral and the bazaar no longer need to be obstacles to information users. Semantic ecosystem seamlessly embraces and provides integrated access to data cathedrals and information bazaars alike.

 

1 If I have 10 database systems running my business that are entirely disconnected, then the benefits are 10 * K, some constant. If I integrate these databases in pairs (operations + accounting, accounting + payroll, etc), then the benefits increase to 10 * K * 2. If I integrate in threes, (operations + accounting + maintenance, accounting + payroll + receiving, etc), then the benefits increase four-fold (a corollary of Metcalf’s law) to 10 * K * 4. For quad-wise integration my benefits would be 10 * K * 8 and so on. Now it might not be 8 fold but the point is there is a geometric, not linear, growth in benefits as I integrate all of my information across my organization


A key process manufacturing problem yet to be solved is the management of knowledge, the know-how, know-who, and know-when. Just as we have been eliminating data silos by introducing common data repositories, and creating common information (semantic) services by adding structure to the data, we need a common rules repository rather than having them distributed in Excel spreadsheets, work-flows, documents, government regulations etc. which causes silos of rules, lack of consistency, difficulty to ensure consistent compliance, and many more problems.

Semantic models (RDF, RDFS, and OWL) are excellent at federating information from multiple sources, but there are competing approaches for information integration. However semantic models are also the perfect way to express rules (SPIN) because rules are also described as data within the same database, unlike other technologies where the rules have to be expressed as code.

Given the importance of consistent application of rules in the process manufacturing business, is this then the sweet spot for semantic technologies?

Is data transparent?

We have been struggling with managing knowledge about our process plants for many years. We tried to solve this problem starting in the 80’s with sophisticated data collection applications and real-time databases. However a data-centric alone solution only provides a view of the process plant as a very long list of measurement tags, reinforcing the definition of data as “being discrete, objective facts or observations, which are unorganized and unprocessed and therefore have no meaning or value because of lack of context and interpretation.

Thus our view of the knowledge about the plant, its equipment, its operation, its performance provides a new definition of ‘transparency’ or lack there-of. If we do want to obtain knowledge we must make use of application programs within which are encoded our knowledge-extraction rules. Most prevalent is the ubiquitous spreadsheet in which there are:

  • Object names written into cells
  • RTDB Tag names in hidden cells
  • A fixed number of feed/product rows
  • New spreadsheet for each unit/report

Consequences of this data-without-information or knowledge are a ‘gagging’ of spreadsheets: information is encoded in the tag names, knowledge is encoded in the Excel layout and formulae, action relies on a user running the application and using their experience to detect problems and deduce remedial actions, uncertainty as to the consistent application of the rules throughout the organization, and many more problems.

Data + Model = Information

The problems of the data-centric approach have driven the process manufacturing industries to seek better information management where information is defined as “organized or structured data, which has been processed in such a way that the information now has relevance for a specific purpose or context, and is therefore meaningful, valuable, useful and relevant.” Recognizing the deficiency of a data-only approach, much effort in the last 20 years has been expended on adding context to this data. This usually takes the form of a database schema which contextualizes the data within a model of the process business and plant.

One of the ways that structure is added to data is to use a relational data model. A cornerstone of the relational model is the use of referential integrity rules. An example of a referential integrity rule within the context of a process plant data model might be that a material movement must have one, and only one, source and destination. However there are limits to what rules can be expressed with referential integrity constraints alone.
To obtain knowledge using this information approach we must ask the database via a database query. The advantage of the information approach over the data approach is that it allows one to ask the database complex questions. For example, to obtain a unit’s material balance:

  • For selected unit
    • Find all feed streams
    • For each feed stream fetch desired criteria

This allows for the use of one report for all units throughout the enterprise, compared with a new spreadsheet for every unit. These reports are also more robust since, for example, if the plant changes in some way then the report will reflect those changes. This certainly tackles some of the inconsistency problems faced by a data-centric approach, however many problems remain. For example, it would be exceedingly difficult to express the rule that an operator’s training concurrency expires, say, 2 years after his last training. Thus we still need to resort to complex reports, application programs, Excel macros, etc into which we can encode our ‘rules’ about the information.

Information + Rules = Knowledge

As good as an information-centric approach may be, it still fails to solve many of the business problems that we face that can only be solved by creating knowledge, defined as “know-how, know-who, and know-when; knowledge is action, not a description of action”. Some of the business challenges that can only be solved by a rule-centric solution are shown below.

Business Rule Challenges
Business Issue Problem Rule-centric Solution Advantages
Business rules are distributed throughout  the business Very difficult to know all of the business rules in place, are they duplicated, are they consistently applied Provide common ‘Rules repository’ for the entire organization

Parallels concept of ‘data warehousing’ of data

Many Excel spreadsheets containing knowledge of how to handle information Data Management left to IT or DBA

End users cannot modify the model or *their* data

Instead they resort to Excel on the path to hell!

Use rules server to consistently apply the semantic rules that are then common to all applications

Use Excel for reporting against data, information, and inferred results of rules

Custom reports written against custom information models have encoded rules Reporting languages such as SQL end up containing many business rules

Rules are duplicated in similar but not identical reports

Report against information server and inferred results of application of rules against information
Manual work processes, some of which are not documented Exercises such as ISO90001 remain only as documented procedures with no means of automating those procedures Use common rules repository to define the rules.

Documentation of the rules can then be created from the rules repository

Regulatory compliance rules exist only as documents Difficult to assure compliance to regulations when it is left to individuals who must be familiar with entire regulation Translate regulations to rules deposited in the rules repository
Loss of skills with aging workforce Loss of knowledge in the form of the rules (aka experience) as to how to handle situations Capture the experience as rules within rules repository
Difficult to audit the adherence to rules Since the rules are not formalized it is difficult to ensure that procedures are followed.

Personnel might be trained on procedures, but if the system does not enforce them then management cannot be certain they are followed.

Since most actions are recorded, it is possible to verify that the actions taken comply with the rules even if they were not being forced to follow the rules in the form of a workflow
Complexity of data and information Difficult for users to determine what rules should apply The results of rules become inferred information that is available for reporting
Impossible end-user reporting The Holy-Grail but never achieved

Even if good informational model that provides context to the data, the ‘knowledge’ that should be the result of a report (‘poor yields’, ‘excessive material loss’, ‘pending equipment failure’’) is impossible for end-users to encode into their reports

Semantic information can be reported using ‘query by example’: far easier than any other reporting

Inferred results of rules is available for reporting using the same technique

Knowledge which is defined as “know-how, know-who, and know-when.” requires rules about information, which requires contextualized data Measurement tags disguise the model. Users forced to interact with abstract measurement tags (10FI107.OP)

General Information models become too complex.

Customers desire to support standards, but competing standards supported by different constituencies: PPDMA/ProdML/WitsML/ISA95/MIMOSA/IEC-CIM etc

Although rules can be applied to any informational model, a semantic informational model is a better match to semantic rules

To obtain knowledge from a data-centric approach we encoded many rules into the application such as Excel. Although the information-centric approach could encode certain types of rules into the database schema, such as the referential examples above, there are many rules about the business that cannot be expressed in this way. Throughout any business we have many rules distributed throughout spreadsheets, reports, application programs, work-flows, procedures, etc. Below are examples of rules that can be found throughout Process Manufacturing.

Example operational compliance rules throughout the Process Manufacturing Model (PMM)
  Validation Calculation Deduction Invocation
Validate that the information  is consistent with known rules such as a movement must have a source and a destination Calculating another (data) statement such as the power of a pump is the product of the flow and pressure rise Deducing additional (object) statements such as knowing that the measurement of something downstream is the same as the upstream measurement Invocating an external process to ensure the correct action is taken based upon the change of a rule
Materiel Chain Finance        
Safety Validate that equipment in use has a valid HAZOP assessment     Initiate the safety review work process after an incident.
Initiate the HAZOP review whenever major equipment changed
Docs       Initiate review of encoded rules when document containing rules is revised.
Assets HME Validate that the equipment has undergone appropriate repair or upgrade as recommended Calculate Overall Operational Efficiency (OEE) based on availability, planned, and actual Deduce the onset of increased operational risk based on past observations, and planned use of equipment Initiate maintenance or repair process
Fixed Plant   Provide efficiency, energy consumption calculation based on data and model Deduce links to MSDS, maintenance records, maintenance procedures and other documents  
Technical Infrastructure Validate that critical equipment has a valid security policy.
Validate that users with access to critical equipment have valid training.
  Deduce connectivity between critical equipment through the LAN Initiate remedial action to update software and utilities when risk identified.
Security Validate user has the correct access privileges to perform this action Calculate the currency of any users privileges to perform action Deduce that the building containing critical equipment has secure access controls Initiate security review when equipment moved to new location
People Validate the correct training status of individuals Calculate time remaining for currency of their training Deduce what assets and facilities an individual has based on their training Initiate retraining program when retraining is deduced to become necessary
Consumables Validate that inventory of consumables matches the measured consumption   Deduce the route of consumable materials (additives etc) into the product stream based on the topology so that the costs can be correctly calculated  
Raw
Materials
       
Utilities   Calculate the quantities of utilities in the absence of complete measurements Deduce the route of utilities (water, electricity, fuel etc) into the production facilities based on the topology so that the costs can be correctly assigned  
Emissions Validate that no measured emissions are exceeding regulations Calculate emissions that are not directly measured
calculate total emissions
Deduce the flow of regulated material from the plant topology  
Business Procedures Validate that the correct procedures are being followed.   Deduce which business processes should be applied in particular situations. For example, if an area is designated as secure, then all processes applied to sub-areas must follow that same designated work processes. Initiate a process to update work processes when deviates from following recommended work processes are detected.
Intellectual
Property
       
Product Resource Validate that there is a valid exploration-rights associated with options Calculate the time remaining to take advantage of exploration rights   Initiate review of exploration rights prior to their expiration.
Field Validate that the field has active contracts Calculate the royalty payments based on the individual contracts Deduce the applicable contract rules Initiate contract reviews and payment processes.
Well Validate that each well has an active contract.
Validate that each well is operating in accordance with its operating permits.
Calculate the actual flow based on pressure and temperature (in absence of flow measurement).
Calculate variables required for regulatory reporting.
Infer the line-up between well and receiving station based on topology of lines. Initiate transmission of regulatory reporting requirements
Pipeline Validate that the nomination and routing information is complete: source and destination, quantity, and quality      
Crude Storage Validate that a new crude from pipeline is not being run into the incorrect storage Calculate the overall assay of the inventory based on component assays Deduce the assay available at the crude unit based on the line-up of the crude tanks to units Initiate rerouting of incoming crude to more appropriate storage.
Initiate crude-switchover on crude unit based on assay of new crude tank
Processing Validate that the configured mode of operation matches the planned or scheduled mode of operation Calculate material balances, yields, qualities, efficiencies. Deduce measurements of downstream elements based on the operational configuration and knowledge of location of actual measurements.
Deduce the operational configuration and flow model of the plant given the material movement and battery limit flows
Initiate a work flow to switch modes of operation
Storage Check to ensure that material planned to flow into storage is compatible with in-store material Calculate the actual contents of the storage Deduce the grade of the stored material based on existing stored material and inbound movements Invoke rescheduling of blends when an actual blend is found to be out-of-specification
Pipeline Validate that material is not planned for a line that would contaminate the contents of the line or the planned material. Calculate the material movement based on either source or destination quantity measurements   Invoke custody transfer dispute when transfer outside of acceptable measurement deviation
Port Storage Validate that quantity available for planned shipments Calculate inventory remaining after current shipment commitments Deduce the grade of material based on mixed assay of storage or stockpile Initiate pull-through of more inventory when commitments exceed current inventory and planned receipts
Shipping Check that the vessel is compatible with the scheduled berth Calculate demurrage charges based on agreed rules Deduce the stored material destination based on the vessel berth. Initiate loading re-schedule in the event that a vessel is delayed
Customer Validate that customer order has valid contract upon which transfer can be based      

 

These rules have similar characteristics that caused us to resolve the original data-silo problem. A simple example is that we would want to calculate the corrected custody transfer quantities both for operational and financial needs. We can also observe that rules span multiple business areas. For example the currency of an operator’s privileges span the training records, access control to the building housing the equipment, maintenance records of the equipment, and more. Finally we do not want the rules to be passive. Instead we want any deviation of the rules to initiate, or at least recommend, the correct remedial action. Thus we want our rules and information to be combined to achieve active knowledge, as shown below.

Knowledge + Action = Results

Even if we have the perfect set of rules, they have no business value unless we act upon the know-how, know-when, and know-who. Thus it is important to close the business loop by taking action on the knowledge to produce the desired results; no action, then no results. This means that the knowledge must have a mechanism for invoking the remediation process.

Realizing a rule-centric solutions

How is it possible to abstract the handling of rules away from the individual applications into which they are encoded? Our recommendation is that there should be a separate rules repository; a container that defines all of the rules. Although there are several candidates for describing rules, one favored choice would be to semantically define the rules using something like SPARQL Rules or SPIN. Since information is conveniently modeled semantically, it then makes sense to harmonize the technologies and use the same for the rules repository. The complete rules-centric architecture is shown below.

 

Data:                Raw measurements collected from the instruments and data entry, stored in real-time databases and historians.

Technology:      Real-time historians

Model:             Context and structure added to the data to create information. It takes the form database schema in the case of a relational model, or ontology in the case of a semantic model plus the configuration that represents the plant: equipment, topology, etc.  

Technology:      Relational schema, object structure or semantic ontology

Information:     The combination of data and model manifested as a database, relational, object, or semantic.

Technology:      Relational, object or semantic data store

Rules:               A repository of the rules. Traditional information system architectures fail to separate this as a separation element. Instead the rules are distributed throughout the application systems. We propose that all rules should be held in a common repository, just like data and information. This repository should be able to handle all rules: validations, calculations, deductions, and invocations. The best choice for organizing such a repository is semantically as this allows both information and rules to share the same technology.

Technology:      Semantic rules data store

Knowledge:     The combination of information and knowledge manifested as an inference engine capable of executing the rules. However it is unrealistic to expect rules to be only executed within the inference engine, so rules within spreadsheets, workflows, applications, and calculation engines should be synchronized with the rules repository.

Technology:      Rules inference engine together with synchronization interfaces

Action:             The actions invoked by the knowledge manifested as a workflow engine capable of invoking external actions via web-service interfaces.

Technology:      Workflow or temporal rules engine.

Visualize:         A portal through which the data, information, knowledge, and actions can be presented, as well as through which the model and rules can be configured.

Technology:      Portal, preferably one whose presentation is semantically deduced from the action, knowledge, information, and data

Control:            Either part of the visualization portal or a separate application through which the users can execute control based on the actions.

Technology:      Conventional control system interface through which users can control the plant.

Is it feasible?

One of the key architectural elements is the management rules as data, along with the closely related model and action elements. Outside of the process manufacturing industry and especially in finance and insurance rules engines have been in use for some time, thus there are several vendors, shown below. Of particular interest is TopQuadrant, the sponsor of SPARQL Rules or SPIN, that provides a standards-based way to define rules and constraints for Semantic Web data, and OntoRule an EU project that brings together leading vendors of knowledge-based systems and a handful of top research institutions to develop the technology that will empower business policies in the enterprise of the future.

  • Corticon  Decision table or rulesheet-centric business rules management system
  • FICO Blaze Advisor General purpose business rules management system with .Net, Java and COBOL deployment
  • IDIOM Decision-centric business rules management system
  • IBM ILOG RulesGeneral purpose business rules management system with .Net, Java and COBOL deployment
  • InRule.Net based business rules management system
  • JBoss Drools/JBoss Enterprise BRMSOpen source business rules management system that is working on updating its Decision Tables)
  • ModellicaEuropean business rules management system focused on the credit risk business available in the US through GDS Link
  • OntoRuleLeading vendors of knowledge-based systems and a handful of top research institutions join their efforts to develop the technology that will empower business policies in the enterprise of the future.
  • OpenRules Decision Management SystemOpen source Excel-based business rules management system.
  • PegasystemsA unified business rules and process management environment now including the Chordiant decision management products.
  • Progress BRMSDrools-based business rules management system acquired with Savvion
  • Sparkling LogicA “social-logic” platform for managing business rules
  • TopQuadrantTopBraid Suite™ leverages emerging technology to help our customers connect silos of data, systems and infrastructure and to build flexible applications from linked data models. SPIN is a standards-based way to define rules and constraints for Semantic Web data.
  • Visual RulesJava-based business rules management system from Bosch Innovations
  • XpertRuleXpertRule develops advanced Business Rules Management and Expert System software that helps organizations:
    • Capture expertise and skills in risk assessment, advising, and performance improvement as well as in selling and supporting both products and services.
    • Comply with regulations, policies, laws and legislations.
    • Automate process orchestration both for intelligent front-end user interface navigation, back-end process flow and data interchange.
  • ZementisA cloud-based execution platform for business rules and analytic models.


Excel Solver has some limitations for production-ready data reconciliation, despite Excel being used often for process manufacturing production and operational accounting. To overcome these limitations, Resolver is an Excel add-in using the same model configuration as Solver, but using a solver that is optimized for the data reconciliation problem. This results in a fast, robust, and very accurate solver for linear, nonlinear, equality, and inequality data reconciliation problems. The combination of Excel and robust solver could bring ubiquity to process data reconciliation.

Resolver is an Excel add-in that performs data reconciliation using a high performance, accurate, and robust algorithm optimized for the data reconciliation problem. It uses the same problem setup as the Solver Excel-add-in. In fact the problems are interchangeable. You can setup and solve a problem in Resolver and then use the same setup and problem definition to try and solve it with Solver, just in case you want verification of the results!

Resolver came about as a result of trying to use Solver to solve process data reconciliation problems, as reported in Is Excel Solver suitable for Production Data Reconciliation?. As noted, Excel has a lot of advantages for setting up process data reconciliation problems, and is widely used for operational and production accounting. Along with some advantages, some issues were revealed which have been overcome with Resolver.

Excel Data Reconciliation Problem Setup

Resolver uses a reconciliation model defined via constraint equations expressed as Excel cell formula in the same way as Solver.

Solving with Resolver

The data reconciliation problem is defined in a very similar way to Excel Solver. In fact the model definitions are interchangeable, although Resolver can only solve for the Gaussian objective function, and Solver cannot produce the reconciliation diagnostics.

Resolving will write the results back into the variable cells, and also produces a report to help diagnose the reconciliation problems. This produces the well established gross, constraint, and measurement error diagnostics.

Solver versus Resolver Excel Process Data Reconciliation

Advantages of Resolver-based Data Reconciliation

  • The use of symbolic differentiation of the constraint equations eliminates the need for numeric differentiation and its inherent inaccuracy
  • Resolver does not rely on Excel to recalculate the objective function which makes it very fast.
  • The constraint-sequential solution technique ensures solution stability even when the problem is poorly formulated, such as zero tolerances, and redundant constraints.
  • Covariance factorization improves numeric stability and accuracy.
  • Compatible (and interchangeable) with Solver problems
  • Data reconciliation can take place within the same spreadsheet environment with which many production and operational accounting systems exist.
  • Constraints are defined as Excel cell formula, not strings within Excel cells that use the reconciliation application’s own syntax.

Disadvantages of Resolver-based Data Reconciliation

  • There is no visual modeler or designer of the flowsheet or problem.
  • Constraint formulae need to be differentiatable.

Advantages of Solver-based Data Reconciliation

  • “Very easy to prepare data within Excel and to post process the results for presentation.”
    • Resolver uses the same setup and problem definition. In fact they are interchangeable.
  • “Solver is not limited to linear-only constraints, thus component balance (bi-linear) reconciliation problems can be handled, although they were not tested here.”
    • Resolver can handle the same range of problems which includes the ‘traditional’ linear balances, non-linear balances, and inequality constraints. To achieve its accelerated and robust convergence Resolver uses symbolic differentiation of the constraint formula provided in the Excel cells and their precedents. At present not all Excel functions are supported. Additional functions will be supported in further releases of Resolver.
  • “Solver is not limited to equality constraints, thus for example all estimates can be forced to be positive.”
    • Resolver supports inequality constraints, as well as the convenient options that forces all variables to be positive (AssumeNonNegative)

Limitations of Solver-based Data Reconciliation

  • “Solver uses a generic solver but is not optimized for the characteristics of a data reconciliation problem, such as linear or bi-linear problems, and quadratic cost functions.”
    • Resolver uses an algorithm4 optimized for the data reconciliation problem that is fast, accurate and robust.
  • “It is easy to define a problem that becomes unstable. In the example, adding the redundant constraint of the ‘world’ causes solution instability.”
    • Resolver uses an algorithm[1] that is very efficient and robust. It is optimized to handle some of the quirks of the process data reconciliation problem such as variables that differ significantly in numeric value, unobservable variables, and variables that we wish to ‘fix’[2].
  • “Irrespective of objective function, Solver took 25~75 iterations to converge. This could be an issue for realistically sized (100~1000’s of variables) reconciliation problems.”
    • Resolver will solve linear data reconciliation problems with equality and inequality constraints immediately. Nonlinear problems are usually solved within a few (5~15) iterations.
  • “There is no access to such intermediate products such as the covariance matrix with which to perform more sophisticated gross error detection.”
    • Resolver produces all of the intermediate diagnostics needed to perform gross error detection, measurement error detection, and constraint error detection. These are made available so users can superimpose their own constraint and measurement removal algorithms.

Excel Data Reconciliation Model Setup

Model Constraints

Resolver uses a reconciliation model defined via constraint equations expressed as Excel cell formula in the same way as Solver.

Thus we can express the reconciliation model constraints as Excel formula within cells using the familiar Excel syntax. As usual Excel will display the numeric result of these formulae as shown below. I chose to organize the formula to show the total feed and total product in adjacent cells, but other formulations are possible.

However Excel also allows us to ‘Show Formulae’, offering the best of both worlds. Since I chose to use named ranges, the constraint formulae are very readable. Not only that, we can use ‘Trace Precedents’, and ‘Trace Dependents’ to track the relationship between constraints, a level of user interaction that is difficult to duplicate in dedicated reconciliation applications.

Also, of course, there is nothing to limit you to linear reconciliation problems in Excel.

Model Variables

Reconciled variables are organized as named cells so that the constraint equations are more readable. These will be the variables adjusted by Solver and Resolver. In principle these variables could be distributed throughout the spreadsheet. However it is more convenient to group them with the tolerances and estimates (measurements) in adjacent cells:

Furthermore the named ranges in the Labeled Variable column were in fact ‘arrays’ pointing to the Solution, Tolerance, and Estimates.

This is a convenient way of grouping the solutions with the corresponding tolerances and estimates in a way that is compatible with both Solver, that only requires access to the variable value, and Resolver, that needs access to the corresponding measurement and tolerance. As a reminder, to create an array variable select the cells as shown below and complete the edit with ‘Control-Shift-Enter’

This will be shown as follows in the Excel edit box. Note the curly-brace {…} indicating that this is an array formula.

Reconciliation objective function

Resolver minimizes the straight-forward sum-of-weighted-squares (Gaussian) formulation. Most reconciliation systems will use a straight-forward sum-of-weighted-squares (Gaussian) formulation for this function. It can be argued that the measurement errors are drawn from a Gaussian distribution. Aside from this fact the Gaussian formulation is popular mainly because, in the case of linear constraints such as mass balances, the objective function can be solved without the need for an iterative search.

For convenience the Resolver add-in includes a Gaussian function that will calculate the same objective function as used by Resolver. When a model is saved, a cell is created that contains a suitable target which Solver can use.

Reconciling with Resolver

Define and Solve a Reconciliation Problem

On the Data tab, in the Resolver group, click Resolver.

  • If the Resolver command is not available, load the Resolver Add-in (add-in: A supplemental program that adds custom commands or custom features to Microsoft Office.)
  1. Click the Microsoft Office Button , click Excel Options, and then click the Add-ins
  2. In the Manage box, click Excel Add-ins, and then click Go.
  3. In the Add-ins available box, select the Resolver Add-in check box, and then click OK.

In the Subject to constraints frame, enter any constraints

(constraints: The limitations placed on a Solver problem. You can apply constraints to adjustable cells, the target cell, or other cells that are directly or indirectly related to the target cell.) that you want to apply.

  • How to add a constraint or a range of constraints
  1. In the Constraint frame box, under Subject to the Constraints, Click on the ellipsis (…) to identify the cell or range that defines the right-hand or left-hand side of the constraint.
  2. In the Cell Reference box, enter the cell reference (cell reference: The set of coordinates that a cell occupies on a worksheet. For example, the reference of the cell that appears at the intersection of column B and row 3 is B3.) or name (name: A word or string of characters that represents a cell, range of cells, formula, or constant value. Use easy-to-understand names, such as Products, to refer to hard to understand ranges, such as Sales!C20:C30.) of the cell range for which you want to constrain the value.

  1. Repeat for the other side of the constraint, and select an alternative equality if required. This equality will be applied to all of the constraints within the selected range.

  1. Click the Add button to add this constraint to the set of constraints that define the reconciliation problem. The constraint will be checked to ensure the same number of cells is defined in the ranges on the left-hand and right-hand side of the constraint.
  2. This can be repeated until all constraints have been added to the reconciliation problem.
  • How to update or delete a constraint
  1. Select a constraint to be updated or deleted from the constraints list.

  1. If Edit is chosen the constraint is transferred to the Constraint frame where it can be updated.
  2. If Delete is chosen the constraint is removed from the model.

In the By adjusting variables frame, enter the variables that you want to reconcile.

  • How to add a variable or range of variables
  1. In the Variable frame click on the ellipsis (…) to identify the cell or range that defines the variables. These are the array formula cells described above. The use of an array formula provides consistency with Solver and the ability for Resolver to access the corresponding measurement and tolerance values.

  1. Click the ‘Add’ button to add this variable or variables to the set of variables that define the reconciliation problem. This can be repeated until all variables have been added to the reconciliation problem.

  • How to update or delete a variable
  1. Select a variable or range of variables to be updated or deleted from the variable list.

  1. If Edit is chosen the variable is transferred to the Variable frame where it can be updated.
  2. If Delete is chosen the variable is removed from the model.
  • How to let Resolver guess the reconciled variables
  1. Use the Guess button to use the defined constraints to guess the adjustable (reconciled) variables.
  2. Guess will add any additional variables to the set of variables. It usually does a good job, but success is not guaranteed.
  • How to consolidate the list of reconciled variables
  1. Use the Consolidate button to merge the variables into one or more ranges if the list of variables becomes fragmented.

Click Resolve to initiate the reconciliation.

  • To use the reconciled values on the worksheet, click Keep Resolver Solution in the Resolver Results dialog box.
  • To restore the original state, click the Restore Original Values.
  • To create a report that is based on your solution, you can click Answer Report. The report is created on a new worksheet in your workbook. If Resolver does not finds a solution the report option is unavailable.

Save or load a problem model

  1. In the Resolver Parameters dialog click Save Model or Load Model
  2. When you click Save Model enter the reference for the first cell of a vertical range of empty cells in which you want to place the problem model. The number of cells used depends on the number of constraint ranges defined. The saved model is compatible with Solver, but has an additional cell that defines an equivalent objective function that Solver would need to use as a target.

  1. When you load a model enter a reference for the first cell.

Control how Resolver finds solutions

  1. In the Resolver Parameters dialog, click Options.

The Option button provides access to the convergence control options available to Resolver. These options are designed to be consistent with Solver.

  • How to control solution time and iterations
    1. In the Max Time box, type the number of seconds that you want to allow for the solution time.
    2. In the Iterations box, type the maximum number of iterations that you want to allow.
  • How to control convergence and accuracy
    1. In the Precision box, type the degree of precision that you want, the smaller the number the higher the precision. The precision is a measure of the magnitude of deviations of constraints.
    2. In the Convergence box, type in the amount of relative change that you want to allow in the objective function (Gaussian) before Resolver stops with a solution. The smaller the number, the less relative change is allowed.
  • How to initialize estimates to improve convergence
    1. Select Initialize Values if you want Resolver to use the current values of the adjusted variables as a starting point. Otherwise it will use measurements as the starting point where ever available.
  • How to add non-negativity constraints
    1. Select Assume Non Negative if you want Resolver to force all variables to be greater than zero. This is equivalent to adding additional inequality constraints, one for each adjusted variable.

Solver Options

You can control advanced features of the solution process. Each option has a default setting that is appropriate for most problems.

Max time Limits the time taken by the solution process. While you can enter a value as high as 32,767, the default value of 10 (seconds) is adequate for most problems.

Iterations Limits the time taken by the solution process by limiting the number of interim calculations. While you can enter a value as high as 32,767, the default value of 30 is adequate for most problems.

Precision Controls the precision of solutions by using the number that you enter to determine whether a composite measure of the constraint residuals is small enough. Precision must be indicated by a fractional number between 0 (zero) and 1. Higher precision is indicated when the number that you enter has more decimal places — for example, 0.0001 is higher precision than 0.01.

Convergence When the relative change in the target cell value is less than the number in the Convergence box for the last five iterations, Solver stops. Convergence applies only to nonlinear problems and must be indicated by a fractional number between 0 (zero) and 1. A smaller convergence is indicated when the number that you enter has more decimal places — for example, 0.0001 is less relative change than 0.01. The smaller the convergence value, the more time Solver takes to reach a solution.

Assume Non-Negative Causes Resolver to assume a lower limit of 0 (zero) for all adjustable variables.

Interpreting the Results

Error Detection Diagnostics

The diagnostics to detect gross measurement and balance errors are just as (if not more) important than the reconciled results. Thus Resolver provides a comprehensive set of diagnostics as follows:

Scope Measure Metric Names
Measured

(A Priori)

Metric calculated based on measurements results and only use reconciled values where appropriate

Reconciled

(A Posteriori)

Metric calculated based on reconciled results and only use reconciled values where appropriate

Global Cost Measured Cost: The value of the objective function at initialization. Reconciled Cost: The value of the objective function at convergence, as calculated by Resolver
Critical Value n/a Global Critical Value: the global gross error detection value as defined by Madron[3]. If the Reconciled Cost exceeds this value, then there is likely to be a gross error … somewhere.
Variables Value Measured Value: the value of the measurement, if any. Reconciled Value: the values of the adjusted variable at convergence.
Tolerance Measured Tolerance: the tolerance (95% confidence level), of the measurement Reconciled Tolerance: the tolerance of the result as determined by Resolver.
Test Measured Test: the difference between the Measurement and Reconciled Value normalized by the Measurement Tolerance. This is used only as an indicator of the magnitude of the measurement adjustment. Reconciled Test:  the difference between the Measurement Value and Reconciled Value, normalized by the Reconciled Tolerance. This should not exceed the Measurement Critical Value unless there is a gross error.
Critical Value n/a Measurement Critical Value: the measurement gross error detection value as defined by Tamhane[4]. If a measurement error exceeds this value, then it is likely to be a gross error
Balance Residual Measured Residual: the residual of the constraint calculated using the original Measurement Value values where available and the Reconciled Value where not. Reconciled Residual: the residual of the constraint using the reconciled results. This should be zero or close to zero. Closeness to zero is controlled by the Precision option.
Deviation Measured Deviation: the expected variability of residual errors as defined by the measured tolerances. Reconciled Deviation:  the expected variability of residual errors as determined by Resolver.
Test Measured Test: the Measured Residual normalized by the Measured Deviation. This is used only as an indicator of the magnitude of constraint adjustments. Reconciled Test: the Measured Residual normalized by the Reconciled Deviation. This should not exceed the Constraint Critical Value unless there is a gross error.
Critical Value n/a Constraint Critical Value: the constraint or nodal gross error detection value as defined by Mah et al[5]. If a constraint error exceeds this value, then it is likely to be a gross error caused either by an associated measurement error or a missing variable from the constraint equation.

 

Reconciliation Report

The reconciliation report has three sections that provide both the results and the diagnostics.

  1. Problem
    1. Total Constraints: The total number of linear, nonlinear, equality, and inequality constraints specified in the model.
    2. Linear: Number of linear equality constraints.
    3. Nonlinear: Number of nonlinear equality constraints
    4. Active Linear: Number of linear inequality constraints set to active.
    5. Active Nonlinear: Number of nonlinear inequality constraints set to active.
    6. Variables: Number of adjustable (reconciled) variables.
    7. Fixed variables: Number of variables that are fixed by virtue of a 0.0 tolerance.
  2. Convergence
    1. Converged: True if converged within the specified precision and convergence.
    2. Termination: The condition that caused termination of convergence, whether or not successful.
    3. Time: Time in seconds to achieve convergence within the specified precision and convergence.
    4. Iterations: Number of iterations needed to achieve convergence.
    5. Precision: The precision at convergence.
    6. Convergence: The measure of convergence, the relative change of the cost, at convergence.
  3. Target
    1. Cell: The cell containing the Gaussian objective function that would be minimized by Solver
    2. Reconciled Cost: The value of the objective function at convergence, as calculated by Resolver
    3. Global Critical Value: the global gross error detection value as defined by Madron[6]. If the Cost exceeds this value, then there is likely to be a gross error … somewhere.
    4. Redundancy Degree: the number of degrees of freedom in the defined model.
    5. Measurement Critical Value: the measurement gross error detection value as defined by Tamhane[7]. If a measurement error exceeds this value, then it is likely to be a gross error.
    6. Constraint Critical Value: the constraint or nodal gross error detection value as defined by Mah et al[8]. If a constraint error exceeds this value, then it is likely to be a gross error caused either by an associated measurement error or a missing variable from the constraint equation.
  4. Adjusted Variables
    1. Cell: the cell address or name containing the array of solution, tolerance, measurement values.
    2. Reconciled Value: the values of the adjusted variable at convergence.
    3. Measured Value: the value of the measurement, if any.
    4. Solvability: the solvability of the variable:
      1. Unobservable: the variable is neither measured nor deducible from the model.
      2. Observable: the variable is unmeasured but is deducible from the model.
      3. Determined: the variable is measured but there is insufficient information in the model of other measurements to provide alternate estimates.
      4. Redundant: the variable is measured and there is information in the model and other measurements to provide alternate estimates.
      5. Fixed: the variable is effectively constraint by virtue of its 0 (zero) tolerance.
    5. Reconciled Test: the difference between the Measurement Value and Reconciled Value, normalized by the Reconciled Tolerance. This should not exceed the Measurement Critical Value unless there is a gross error.
    6. Measured Test: the difference between the Measurement and Reconciled Value normalized by the Measurement Tolerance. This is used only as an indicator of the magnitude of the measurement adjustment.
    7. Measured Tolerance: the tolerance (95% confidence level), of the measurement.
    8. Reconciled Tolerance: the tolerance of the result as determined by Resolver.
  5. Constraints
    1. Cell: a simplified expression of the two cells and equality that define the constraint.
    2. Active: an indicator of the state of the constraint. Inequality constraints are False until admitted to the active-set.
    3. Measured Residual: the residual of the constraint calculated using the original Measurement Value values where available and the Reconciled Value where not.
    4. Reconciled Residual: the residual of the constraint using the reconciled results. This should be zero or close to zero. Closeness to zero is controlled by the Precision
    5. Measured Test: the Measured Residual normalized by the Measured Deviation. This is used only as an indicator of the magnitude of constraint adjustments.
    6. Reconciled Test: the Measured Residual normalized by the Reconciled Deviation. This should not exceed the Constraint Critical Value unless there is a gross error.
    7. Reconciled Deviation: the expected variability of residual errors as determined by Resolver.
    8. Formula: the constraint equation spelt out using the variable names, and including any dependent cells.

Resolver Frequently Asked Questions

Does Resolver always converge to the correct solution?

If the problem is linear, then convergence is guaranteed. In fact it will converge to the global minimum in a single iteration.

If the problem includes nonlinearities then convergence might take several iterations before it arrives at the minimum. There is no absolute guarantee that this is the global minimum, but for most data reconciliation-like problems it is invariably the correct solution. 

If the problem includes inequalities as well as nonlinearities then it is possible to define an infeasible problem.

Can Resolver access cells on other sheets?

No. All cells that are directly or indirectly referred to in the constraints equations need to be on the same worksheet.

What are the limits on the constraint equations?

The constraint equations need to involve only differentiable functions: +, -, /, *, ^, SUM(…), PRODUCT(…),… We are working on ways to handle the other, non-differentiable, functions.

Can I use my own user defined functions?

Not yet. Resolver needs to be able to differentiate the constraint, and this is not possible for user-defined functions.

Why do I need a license to run Resolver?

Resolver needs either a trial license valid for 20 days, or a full license usable for an unlimited duration but limited to a single machine. The license price is modest to encourage the use of process data reconciliation.

What Algorithm and methods used by Resolver?

The unique Resolver solution algorithm is best described as constraint sequential. It starts off with a solution surface based on the initial estimates and no constraints. It then takes one of the constraints and adjusts the estimates taking into account this individual constraint. This then moves the solution to that new position, and the process is repeated for the next constraint. If the constraints are equality or inequality linear, then each step is the same as ‘Newtonian’ convergence. The advantages of this algorithm are very high numerical stability, due in part to the UDUT covariance factorization, even for badly formed numerical problems. However, if the constraints are non-linear then Newtonian convergence can be too severe, the algorithm reduces the size of the step that the algorithm takes towards the optimum. At the same time the direction of the step swings away from ‘Newtonian’ to ‘steepest descent’ direction. As the step size reduces, the algorithm guarantees convergence, albeit more slowly.

One pass through all of the constraints is sufficient for linear-only constraints. If there are non-linear constraints then the solution is used as a new starting point, and the procedure repeated.

Inequality constraints are handled by the use of the active-set methodology. At each stage the most ‘active’ inequality constraints are added until no more inequality constraints are in violation.

References

[1] “Multi-frequency Testing of Nonlinear Systems”; Lawrence, P (Oct 1980); PhD Thesis, University of Wales Institute of Science and Technology, Chap 7

[2] “Filtering and Error Analysis via the UDUT Covariance Factorization”; Thornton, C; Bierman, G (Oct 1978); IEEE Trans AC-23 Issue 5, pp 901-907

[3] “Statistical Analysis of Material Balance of a Chemical Reactor”; Madron, T., Veverka, V., and Venecek, V. (1977); The American Institute of Chemical Engineers Journal, 23, 482-486.

[4] “A Note on the Use of Residuals for Detecting an Outlier in Linear Regression”; Tamhane, A. C. (1982); Biometrika, 69, 488- 489.

[5] “Reconciliation and Rectification of Process Flow and Inventory Data”; Mah, R. S. H., Stanley, G. M., and Downing, D. M (1976); Industrial and Engineering and Chemical Process Design and Development, 15, 175-183.

[6] “Statistical Analysis of Material Balance of a Chemical Reactor”; Madron, T., Veverka, V., and Venecek, V. (1977); The American Institute of Chemical Engineers Journal, 23, 482-486.

[7] “A Note on the Use of Residuals for Detecting an Outlier in Linear Regression”; Tamhane, A. C. (1982); Biometrika, 69, 488- 489.

[8] “Reconciliation and Rectification of Process Flow and Inventory Data”; Mah, R. S. H., Stanley, G. M., and Downing, D. M (1976); Industrial and Engineering and Chemical Process Design and Development, 15, 175-18


If you do not plan where you are going you will not get there, but you will probably get what you deserve. This software development model, that accurately predicts resources and schedule given scope, can greatly help planning when you will get there and what will be needed on the way. This may not be a perfect model, but perfection is the enemy of the good. What software development model experience can you share?

In a previous posting Innovation != R&D$, I reasoned that the impact of R&D expenditure is not well correlated with gross operating margins because a significant proportion of R&D monies is absorbed by the less revenue generating activities of maintenance and support rather than innovation, by which I mean in this context the creation of new software applications. However it is still important to ensure that whatever monies are invested in innovation are wisely invested. Unfortunately our industry is beset with project delivery problems. We have earned the IT Rule-of-5: 5 times over-budget, 5 times schedule, 1/5th functionality.  Perhaps an exaggeration but there is a problem that few would deny.

One problem is that of creating unrealistic expectations. The development team’s answer to a feature request of ‘it is just a small matter of programming’ (one unit of SMOP) gets heard as ‘it will be on my desk tomorrow morning, tested and with a quality and quantity of documentation that would shame Charles Dickens’.  

The answer is a good looking model. I love models. Not the type you are thinking of, but simple mathematical formulae that allow me estimate what will happen.  Having been long involved with automation, MES, and software development I always want to know how long a software development will take. ‘As long as a piece of string’ is not the most useful answer when customer expectations or development budgets need to be met. So over the years I have developed a model that estimates total development effort and tracks development progress over the life-cycle of the project with surprising accuracy. Before you say that most models have so many ‘tuning factors’ that you can of course make it always fit, I want to point out that this model has just one factor. Also I want to point out that most of this model originates with Capers-Jones seminal work ‘Applied Software Measurement’

Estimating Model

The objective of the estimating model is to use a measure of the size of the development and come up with estimates for project duration, the number of project resources, total development effort and average productivity. From these estimates the project cost can be derived.

Tuning factor (J)

This is the only factor you need for this model. Fortunately Capers-Jones also provides a range of suggested values, as shown below. I would suggest 0.4 as a starting point.

Kind of software Best in class Average Worst in class
Systems 0.43 0.45 0.48
Business 0.41 0.43 0.46
Shrink-wrap 0.39 0.42 0.45

Estimated Scope (LOC)

Estimated size for the project (lines-of-code). OK, it can be difficult to come up with a really accurate estimate, but there is much written on the subject. Quick ways are to simply say that this application is approximately the same size as a similar one done in the past. For greater accuracy, Function Point or Story Point estimating can be used and then converted to lines-of-code.

Estimated Function Point (FP)

= Lines-of-Code (LOC) / LOC-per-FP, where LOC-per-FP is taken as 54 for languages such as C#

Estimated Project Duration (D)

= FPJ  (months)

The ideal project duration given the ideal number of resources for the project.

Development Resources (R)

= FP2*J/27 (persons)

The ideal number of full-time project-persons to complete the project.

Total development effort (Effort)

= D * R (person-months)

Average Productivity (P)  

= LOC/Effort (lines-of-code per person-month)

Below is an example of applying this model to an example project:

Development Estimate Model
Factor Units Formula Example Project
Tuning Factor dim J 0.4
Lines of Code LOC LOC 75000
LOC/FP LOC/FP LOC-per-FP 54
Function Points FP FP 1389
Estimated project duration months FPJ 18
Development resources persons FP2*J/27 12
Total development effort person-months FP3*J/27 219
Average productivity LOC/person-month 27*FP (1-3*J) 343

 

The graphs below shows the project effort and productivity plotted for various sizes of project. As would be expected we can see the productivity falling off as project size increase (see Mythical Man Month)

Tracking Model

The Tracking Model recognizes that the assumptions in the original estimate might not apply in practice. For example the number of resources assigned to the project might change, or the scope decreases or, more likely, increases. Thus the tracking model uses known measurements of the project such as code produced to date, resources actually assigned, and current total project size to calculate what the progress to date should be and to predict into the future the revised project completion.

Estimated Scope (LOC)

Estimated size for the project (lines-of-code). This is the estimate at the beginning of the project because scope creep and scrope additions will inevitably occur.

Scope Creep (Creep)

Percent change month-on-month of the project scope (%). This requires careful measurement because, as has been shown elsewhere, projects can only tolerate a small amount of scope creep before they become ‘runaways’.

Estimated Scope including creep (ELOC)

= LOC * (1+Creep) + Any additional scope     

Estimated Scope (EFP)

= ELOC / LOC-per-FP

Project Duration (D)                              

Duration of project, measured from the original project start, assuming optimal resource allocation, no scope creep, and sustained productivity

= EFPJ

Required Project Resources (R)

Resources that should be allocated to the project for the duration

= EFP2*J/27

Actual Project Resources (AR)

Actual person-month assigned to the project for the period.

Estimated productivity (P)

Estimated lines-of-code per assigned person-month. Note that productivity reduces as the assigned resources increase (see Mythical Nan Month)

= ELOC*27*(27*AR)(1-3*J)/(2*J)

Estimated Production (PR)

Estimated lines-of-code produced in period based on assigned resources and estimated productivity.

= P * AR

Accumulated lines-of-code (ALOC)

Accumulated lines of code to-date based on assigned resources and estimated productivity.

Remaining Scope                                

Estimated scope less estimated accumulated code

= ELOC – ALOC

Accumulated Cost (AC)

Cost of actual resources assigned based on annual rate.

Cost per Line-of-code

A metric indicating how much each line-of-code is costing

= Accumulated Cost / Accumulated lines-of-code

Agile Productivity Ratio  

The promise of Agile/SCRUM is that productivity will increase as large teams as split into Scrum teams. As indicated before, productivity reduces as team size increases. This is the ratio of the productivity of a single team versus multiple Scrum teams with the same total resources.

Allowable scope change

Scope creep is a project killer. If each month the project scope is allowed to increase, then the project size will increase. If the project size increases, then duration and number of project resources follow. As project resources increase then productivity will reduce, further decreasing project duration. As project duration increases, then more scope changes can accumulate, further increasing project duration, and so on. The tipping point when the project becomes a runaway has been determined and is calculated as follows:

= 1/EFPJ

The graphs below show the results from a typical project. We see the project scope increasing (scope-creep), resulting in more resources being pulled in to tackle the retreating deadline. Note that without scope creep the project would have completed in month 23 even with the same assigned resource profile. This signals the hidden dangers of even mild scope-creep.

The above graph is based on the actual resources assigned. The graph below compares the estimated with the actual production of code over the same period, confirming the accuracy of the model.

Observations

  • Although there is only one factor in this model, estimates are quite sensitive to the value chosen. Therefore it is best to track the estimates with actual measurements to ensure the accuracy of the model and hence the tuning factor.
  • Include test code or not? If the development teams are using test-driven development (TDD) or any form of Agile/SCRUM then I think it is important to include test code along with production code in the estimates, at the same time including the test resources with the overall project resources. Generally I expect to see test code lines-of-code to be approximately 50% of the production code.
  • Counting lines-of-code should use the same tool, such as Visual Studio, for consistency. There is much debate about what should be included and excluded. However I think it is more important to simply be consistent because you will end up deriving your own tuning factor based on your assumptions.
  • There is a difference between development productivity, and productivity of customer expectations. Just because the lines-of-code have been efficiently produced, albeit error free and with great documentation, it does not mean that customer expectations have been met because they might have wanted an entirely different solution.
  • Scaling lines-of-code to function points. Despite the superiority of Function Points (or Story Points) over lines-of-code as a measure of software size, lines-of-code seems to be more tangible to management. Thus although the model is expressed in function points, I expect most will estimate lines-of-code or convert (‘back-cast’ to quote Capers-Jones) from lines-of-code to function points using a factor based on the type of programming language being used. For example there are approximately 54 C# lines-of-code per function point.

Associated Spreadsheet

To those who have got this far, I am sharing a spreadsheet version of this model that you are free to download and use for your own estimating. I hope it works as well for you as it has for me.

References

  1. Applied Software Measurement: Global Analysis of Productivity and Quality. Third Edition; Capers Jones; 2008
  2. The Mythical Man-Month: Essays on Software Engineering; Frederick Brooks;1995


Our desire to add features and functionality to applications in order to solve the ever-changing challenges facing Process Manufacturing has driven up complexity. There has been a lot of focus on application product and program management using such techniques as
Lean, Agile, and SCRUM, justifiable given the failure rate of IT projects. However our thesis is that sustainability of an application over its 5-7 year deployment life deserves far greater attention; what is the advantage of having an optimal application that becomes shelf-ware after 18 months due to its user-complexity, when a sub-optimal application remains in use throughout its life? Such minimalism has been embraced so successfully at Apple (Steve Jobs and Minimalism).

IT Project Rule of Five

Some may have heard of the IT rule of 5: 5 times projected cost, 5 times projected schedule, 1/5th projected functionality! Not entirely unjustified when The Standish Group reported:

Specifically, 32 percent of IT projects were considered successful, having been completed on time, on budget and with the required features and functions. Nearly one-in-four (24 percent) IT projects were considered failures, having been cancelled before they were completed, or having been delivered but never used. The rest (44 percent) were considered challenged: they were finished late, over budget, or with fewer than the required features and functions.

Thus there is justifiable focus on the development and deployment phases of a Process Manufacturing application with some notable successes. For example we believe that adopting Lean Software Principles, using Agile development methodologies such as SCRUM, provides many advantages over the traditional waterfall approach, allowing usability problems to be detected early and avoiding excessive functionality that is not required by the end-user.

Projects are justified by the ongoing benefits to the organization. However we find that many successfully deployed applications are no longer in use a few years down the road. Ask yourself how many applications that you know have become shelf-ware, no longer used and fulfilling no useful purpose for the organization.

Given that most applications are justified using an assumption of 7-year use, an active life measured in months or a few years puts that application into the category of ‘challenged’ along with those that were late , over budget, or with fewer than the required features and functions.

Declining Application Benefits

What is happening to cause the ongoing benefits to decline? There are many reasons, some of which are as follows:

  • Degradation of quality of application configuration:
    • Applications are becoming increasingly complex so they depend on configuration that is an accurate reflection of the underlying process. As soon as the originally deployed configuration starts to err away from the actual changes that are occurring on the plant, the application becomes less useful and consequently less beneficial. An example might be an enterprise material balancing application whose model no longer reflects reality.
  • Loss of expertise, no business processes:
    • There is usually an internal application champion who sees the application through to successful deployment. This catches the eye of management so the champion is moved on to higher and better things. However without a champion succession plan, the application may lose its appeal, and usage will drop.
  • Failure to adapt to changing business processes:
    • Most applications are designed to reflect the business process, whether or not that is the ‘as-is’ or ‘to-be’. Unfortunately business processes are a moving target. New regulations might change the way you have to do things. Better operational procedures might demand a change in business processes.

Process Manufacturing Application Focus over the Years

Perhaps the problem of sustainability lies in the process manufacturing application focus. Once the problem was that of measurement data in silos, solved by the introduction of real-time historians. However that created the problems of data visibility, solved by the introduction of graphical user interfaces. This introduced data overload which was partially solved by the introduction of analytical tools to digest the information and produce diagnostics. Unfortunately these tools were difficult to deploy across all assets within an organization, so we have been trying to solve that problem with information models. The current problem is how to convert the diagnostics into actionable knowledge with the use of work-flow engines.

As we solve one problem, another emerges. Each layer of technology, however, increases the overall complexity making it more challenging to sustain any application.

Process Manufacturing Application Problems and Solutions over the Years
  1985-

1995

1990-

2000

1995-

2005

2000-

2010

2005-

2015

2010-
Problem Measurement data in silos Data access and visualization Analysis and business intelligence Contextualized information Consistent actioning Sustainability
Industry

Response

Real-time databases collecting measurements

(proprietary)

Graphical user interfaces, trending and reporting tools

(proprietary)

Analytical tools to digest data into information and diagnostics Plant data models (ProdML, ISA-95, ISO15926, IEC 61970/61968, Proprietary) ISO-9001 Outsourcing

Standards

Consequence Data but no user access Data overload Deployability of analysis to all assets Interpretation limited to experts Complexity, much more than RTDB, limiting sustainability Improved ongoing application benefits
Response now Composable integration

(Semantic Federation of external data)

Composable UI (Portal) Composable analysis (Complex Event Processing, Odata-Entity Descriptions) Composable model

(Semantic model)

Composable workflow Composable platform and data(PaaS/DaaS)

Composable applications

 

Changing Focus to Sustainability

So if we want to sustain an application’s benefits we need to start focusing on those aspects of the application’s technology, development, and deployment that is designed to improve sustainability. This may lead to less than optimal applications but what would you rather have; the optimal application that yields 100$/year for 2 years, or a sub-optimal application that yields 80$/year for 7 years?

Sustainability of an application is a multifaceted problem. Below is a catalog of some of the problems, together with the proposed solutions.

All problems lead to sustainability
Problem Proposed Solutions Issue  
We cannot afford experts to create a rigid 5 year plan nor will we follow it with the staff turnovers we have. Solutions must deliver incremental value with additional solutions building upon the installed platform. Agility Sustainability
We are too small to have a large IT department. Solutions cannot increase the size of the internal IT. Outsource IT. Affordability
We cannot afford to be good custodians of our data, nor want to manage its increasing complexity. Outsource data management to the ‘cloud’: Data management as a Service, DaaS = Google Fusion/Dydra. Capability
We do not have in-house experts to continually adapt to my changing business needs and changing regulations. We must make the solutions easily adaptable with built-in best practices (data, presentation, analysis, and work flows). Adaptability
We are growing, and do not want the IT apps to impede that growth. We must make the solutions self-configurable or configurable with the minimum of skill. Configurability
We do not have the in-house skills to deploy such technology. We must make the solutions self-deploying
with built-in best practices.
Deployability
We will be employing Gen Y/Gen Z who are used to iPhone, Twitter, Facebook, and other social applications where they can change their own experience. “A true web2.0 app gets better the more people use it…” Tim O’Reilly. Sociability
I only want to pay for what I use, and want to avoid large capital investments. PaaS, SaaS, DaaS Cloud. Affordability
I don’t want my people to have to become IT experts to use any application. Applications should have the same intuitiveness of use as the iPad, Facebook, etc. Simplicity
My business and how I run it is different than everyone else’s; this is my competitive advantage. Composability of applications without constructing new, custom code. Composability
I do not want to put all my bets on a single vendor because they cannot be an innovative as the ‘crowd’. We should use standards and de-facto standards so that additional applications and utilities can be ‘crowd-sourced’. Constructability
I want my best practices to be owned and managed by the business rather than individual experts within the company. Analytics, diagnostics, work-flow automation, event-processing, inferencing, knowledge management. Quality
Regulations demand that we demonstrate strong governance over information and decisions Audited data and transactions, change management procedures. Auditability

Ensuring Solution/Application Sustainability

In the articleSolving or simply displacing problems: Application HAZOPS?’ the issue of problem displacement was discussed: do information technology ‘solutions’ create more problems than they solve, do information technology ‘solutions’ simply displace problems to another group? HAZOPS have been used to systematically assess safety risks in processes and operations. Why not ‘HAZOP’ the supposed information technology solutions that we propose?

Thus given the potential problem of sustainability, we propose conducting a Solution Sustainability Assessment, based on HAZOP principles as shown below. This should be included within the solution evaluation along with conventional criteria such as project and deployment costs and schedules. 

Solution Sustainability Assessment (HAZOP) – Example
Solution Characteristic Intention Deviation Causes Consequences
Agility Solution must deliver incremental value with additional solutions building upon the installed platform. Business requires extended functionality due to change of business practice or regulation. Solution does not offer the required extended functionality. Add an independent solution offering the extra features and integrate together.
Affordability Solution cannot increase the size of the internal IT. Solution has increased internal IT support requirements beyond that which we can sustain. Reduced IT budget.

Increased solution portfolio without increasing IT budget.

IT support deteriorates.
Capability Solution cannot depend on establishing our own good custodianship of our data, nor manage its increasing complexity. Increased complexity of custodianship of configuration data. Effective use of the solution requires increased complexity in the custodianship of the configuration data. Trust in solution results deteriorates.
Adaptability Solutions easily adaptable with built-in best practices (data, presentation, analysis, and work flows) Business requires adapted business practices encoded in solution. Business practices change in response to new competition and regulations. Failure to capture benefits of improved business practices.
Configurability Solutions should be self-configurable or configurable with the minimum of skill. User fails to configure application to meet their changed needs. User training insufficient to allow self-configuration.

Solution more complex than expected inhibiting users from making their own changes

Users rely on in-house experts or IT to make changes, increasing costs.
Deployability Solutions should be self-deploying
with built-in best practices.
Upgrade of solution not self-deploying. Solution is within a domain with rapidly changing regulations. Solution upgrades delayed so that users are forced to use obsolete solution and outdated business practices.
Sociability Value of solution should increase with more users (see Should Enterprise Applications be evolving to Socialized Enterprise?)

 

Addition of more users Addition of more users simply increases IT load with no improvement of business. Users not taking advantage of the intra-solution features (notifications, document sharing etc) to improve their business practices. Solution does not improve interactions between departments as planned.
Affordability Should only pay for what I use, and want to avoid large capital investments. Cost not proportional to users or usage. Licensed by server or site, not by users or usage. Reluctance to sustain license as cost per user greater than anticipated.
Simplicity Solution should not depend on my people becoming IT experts to use any application. Needs a local expert for effective use. Solution more complex or intricate than expected or solution’s configuration became overly complex. Hidden cost of sustainment.
Composability Composability of applications without constructing new, custom code. Need to improve efficiency in use. User needs to create new mash-up to avoid viewing separate reports and/or pages. User’s efficiency degrades
Constructability Solution should use standards and de-facto standards so that additional applications and utilities can be ‘crowd-sourced’. Need another utility. Need another utility. Expensive to add a new utility.
Quality Solution should have built-in process quality management: Analytics, diagnostics, work-flow automation, event-processing, inferencing, knowledge management. Measurability of process quality. Need to ensure that the process quality is adhering to Sarbanes-Oxley, ISO-9001, Lean and other guidelines/regulations. Lack of confidence in the quality of the processes that are automated by the solution.
Auditability Audited data and transactions, change management procedures. Some transactions now need to be audited. Change of business practice now requires additional transactions be audited. Solution may need to be replaced if auditing is mandated.


Excel is the application platform of choice for many process manufacturing applications. Conversely, data reconciliation is not as widely used as it should be despite demonstrable benefits. Can we use Excel Solver to provide production-ready data reconciliation within Excel? This article describes an Excel template that uses Solver for data reconciliation. And the conclusions are …?

Why is data reconciliation not ubiquitous?

Excel is ubiquitous in process manufacturing as the end-user development and reporting platform of choice. It is used for data collection, end of shift reporting, accounting, modeling, scheduling, planning and (too?) much more, often frustrating the efforts of corporate IT who want more formal analysis tools and applications. Conversely over the last 25 years data reconciliation has not achieved the same ubiquitous status although it is an application that provides proven benefits of reduced accounting losses, better yield management, unreported material movement detection, and improved meter management amongst many other operational improvements. Why?

There are many aspects to the answer, but perhaps if data reconciliation were part of Excel rather than Excel just being a host for the raw data and reconciled results it could be used more easily in conjunction with the many data collection, reporting, accounting, modeling, scheduling and planning spreadsheet applications already in use.

The core computational engine of data reconciliation is the solver. Excel has an excellent built-in solver called, not surprisingly, Solver!  Since Solver uses the “Generalized Reduced Gradient (GRG2) nonlinear optimization code , which was developed by Leon Lasdon, University of Texas at Austin, and Alan Waren, Cleveland State University…” it should be well suited for data reconciliation. Is it?

The best way to answer this question is to setup Excel using Solver to reconcile a typical reconciliation problem. Below is a description of my use of Solver which is contained in the attached Excel template, Reconciliation Example.xslm

Reconciliation model constraints

Although we use Excel to prepare reconciliation data or host the results of reconciliation, the reconciliation application will formulate its own model using its own syntax. I based this evaluation on this simple material balance:

By using Solver we can express the reconciliation model constraints as Excel formula within cells using the familiar Excel syntax. As usual Excel will display the numeric result of these formulae as shown below (extracted from the attached Reconciliation Example.xslm). I chose to organize the formula to show the total feed and total product in adjacent cells, but other formulations are possible.

However Excel also allows us to ‘Show Formulae’, offering the best of both worlds. Since I chose to use named ranges, the constraint formulae are very readable. Not only that, we can use ‘Trace Precedents’, and ‘Trace Dependents’ to track the relationship between constraints, a level of user interaction that is difficult to duplicate in dedicated reconciliation applications.

Also, of course, there is nothing to limit you to linear reconciliation problems in Excel.

Reconciled variables

I organized the reconciled variables as named cells so that the constraint equations are more readable. These will be the variables adjusted by Solver. In principle these variables could be distributed throughout the spreadsheet. However I chose to group them with the tolerances and estimates (measurements) in adjacent cells:

 

Furthermore the named ranges in the Labeled Variable column were in fact ‘arrays’ pointing to the Solution, Tolerance, and Estimates;  this is a convenient way of grouping the solutions with the corresponding tolerances and estimates.

Reconciliation objective functions

Solver minimizes (or maximizes or drives to a value) a target cell which contains the value of the objective function. Most reconciliation systems will use a straight-forward sum-of-weighted-squares (Gaussian) formulation for this function. It can be argued that the measurement errors are drawn from a Gaussian distribution. Aside from this fact the Gaussian formulation is popular mainly because, in the case of linear constraints such as mass balances, the objective function can be solved without the need for an iterative search. However Solver uses a numeric search algorithm so we are not restricted to this Gaussian formulation. For this reason I created 4 alternative objective functions as VBA functions that can be called from within Excel. This avoids having to duplicate complex cell formulae, and thus reduces the risk of those pesky cell formula mistakes! The objective functions I chose were as follows, all taking one or more ranges of variables as arguments. The Fair and ContaminatedGaussian functions have additional parameters.

These functions are described below:

Gaussian

The ‘normal’ reconciliation objective function

            e = (measurement-estimate)/tolerance

            Gaussian= minimize sum{ e^2 }

Fair

A ‘robust’ estimator which is a generalization of the maximum-likelihood estimator:

Fair = minimize sum{ c^2 ( abs(e)/c – log( 1 + abs(e)/c)) }

Where c is a tuning parameter with approximate values in the range 10~20

Lorentzian

This objective function tries to avoid weighting gross errors unduly so they do not distort the overall result

Lorentzian = maximize sum{ 1/(1 + e^2/2) }

ContaminatedGaussian or Tjoa-Biegler

This objective function attempts a balance between the probability of a gross and Gaussian error; ‘nu’ is the probability of gross errors, and ‘b’ is the ratio defining the larger variance of gross errors with respect to the normal errors. Thus a value of b=10 means that the variance of gross errors is 100 (10*10) that of the normal errors. A value of nu=0.05 means there is a 5% probability of a gross error.

ContaminatedGaussian = minimize sum{ – log((1-nu) * exp(-e^2/2)

+ nu/b * exp(-e^2/(2*b)) )

+ log(sqrt(2*pi) * tolerance) }

ErrorCriticalLevel

A helper function that calculates critical error (global, constraint, or measurement) level given the degrees of freedom (number of constraints or measurements), and confidence level required, typically ~ 0.05, or 5%.

Using Solver

To solve, select Solver to display the dialog in which you can define the target cell containing the objective function, the range of cells containing the variables to be adjusted, and the range of cells containing the constraints. This setup can be conveniently saved into a range of cells for re-use.

Options allow various Solver properties to be tuned to optimize solving. However I found that the default setting of Solver would not converge to the actual solution and would drift off to an entirely different solution! I eventually found that the following parameters would usually converge to the expected solution irrespective of starting conditions, which I set to the measurements where available.

Comments on Objective Functions

The objective of this evaluation was the feasibility of using Solver for production data reconciliation, but since I tested different objective functions I want to share my observations, admittedly from a very small sample.

  1. Each objective function was given the same starting point (the measurements where available) and Solver options.
  2. I knew that the recycle was in gross error; it should have been 1505.
  3. Gaussian, Fair, and ContaminatedGaussian all produce approximately the same estimates with the Recycle at 1485.
  4. Lorentzian was far better at ignoring this gross error on the Recycle measurement, but is this what we wanted because it simply produced erroneous estimates for the unmeasured variables?
  5. All took about 25~75 iterations to converge which was not an issue given the size of this reconciliation but could be significant for larger problems.
  6. Given the similarity of results between the Gaussian, Fair, and ContaminatedGaussian objective functions, is it worth the complexity of moving away from the Gaussian?

Advantages of Solver-based Data Reconciliation

  • Very easy to prepare data within Excel and to post process the results for presentation.
  • Solver is not limited to linear-only constraints, thus component balance (bi-linear) reconciliation problems can be handled, although they were not tested here.
  • Solver is not limited to equality constraints, thus for example all estimates can be forced to be positive.

Limitations of Solver-based Data Reconciliation

  • The built-in Solver allows only for a maximum of 200 variables, although one can upgrade to the full version
  • Solver uses a generic solver but is not optimized for the characteristics of a data reconciliation problem, such as linear or bi-linear problems, and quadratic cost functions.
  • It is easy to define a problem that becomes unstable. In the example, adding the redundant constraint of the ‘world’ causes solution instability.
  • The Solver was very sensitive to the estimates and derivative estimation methods when applied to the example problem and objective functions. Unfortunately it still indicated converged solution even though the results were way off. In the example we know the ‘correct’ answer allowing us to tune the solver until it gets the ‘right’ answer; in a real problem we do not.
  • Irrespective of objective function, Solver took 25~75 iterations to converge. This could be an issue for realistically sized (100~1000’s of variables) reconciliation problems.
  • There is no access to such intermediate products such as the covariance matrix with which to perform more sophisticated gross error detection.

Recommendations

Despite loving the ability to define non-standard objective functions and the use of Excel formula and pre-and post processing, the unreliability of the convergence to the known ‘correct’ solution means that I could not recommend it for production use especially when an ‘expert’ is not on hand to scrutinize the results.

However if we could use the Solver problem formulation within Excel but use a different optimization engine tuned for reconciliation then we would have the perfect production data reconciliation application… watch this blog!


It is known that software R&D expenditure positively impacts the gross operating margins or the market-to-book values of a company. However the correlation is not strong. Is it because maintenance, sustainment as well as innovation are lumped together as R&D, yet the return is not equal throughout the software product life-cycle. This articles shows that there may be far less of the high-returning innovation development than management believes. In fact innovation is being starved out by the need to maintain and sustain the existing software product portfolio.

R&D = Innovation + Maintenance + Sustainment

It is generally agreed that R&D expenditure positively impacts gross operating margins or the market-to-book values of a company[1], but how strong a correlation is hotly debated[2]. What is more difficult is the measure of R&D expenditure productivity, or return on investment. For example, if the majority of R&D expenditure is going into existing product maintenance then it is unlikely to offer the same ROI as investing into new innovations.

In their study Booz Allen Hamilton presents their results exploring the smart spenders of R&D. They question the degree of correlation between R&D and company financial performance: “There are no significant statistical relationships between R&D spending and the primary measures of financial or corporate success: sales and earnings growth, gross and operating profitability, market capitalization growth, and total shareholder returns. Gross profits as a percentage of sales is the single performance variable with a statistical relationship to R&D spending.”

However Booz Allen Hamilton assumes that R&D is limited to ideation, project selection, product development and commercialization, as illustrated below.

In practice we know that the R&D expense continues throughout the innovated product life-cycle to include sustainment and maintenance, where

  • Sustainment is the addition of new features to an existing product to maintain or gain market share
  • Maintenance is to ensure quality and hence customer loyalty.

As important as sustainment and maintenance are, R&D investment into these later phases of a product’s lifecycle is never likely to offer the same returns as investment in innovation, the more disruptive the better[3].

Maintenance and Sustainment is an inevitable consequence of Innovation

To truly evaluate the return-on-investment of R&D expenditure we need to distinguish between the R&D investments at different phases of the product life cycle as they surely offer different ROI.

So what is the typical distribution of expenditure for software products? A simple model, verified by actual observations, reveals some surprises.

  • Sustainment (adding new features) cost is 8-15% per annum of the original development cost and accumulated sustainment investment to date in any supported code. This 8-15% creates additional code that needs to be sustained and maintained in the future. For example if the original development was $100,000, budget $8,000-$15,000 per annum to add new features requested by customers in order to sustain a competitive product. Note that this would lead to a doubling of the code base over the typical 7-year life-cycle of a product.
  • Maintenance (providing bug fixes) cost is 8-15% per annum of development investment to date in any maintained product code. For example if the original development was $100,000, budget $8,000-$15,000 per annum for maintenance in the first year but expect that to grow as sustainment investment increases the code base to be maintained.

So lets us apply this model to a start-up company that has decided to invest $150,000 per annum to create their new product. For the first few years this model works well as the sustainment and maintenance costs are relatively minor. However after a few years the sustainment and maintenance costs are starving out continued innovation until in Year 7 when there is scarcely any innovative development at all. Does this look familiar to you?


No wonder the ROI of R&D does not correlate well; it depends where a company is within this cycle. Code created for the initial release (innovation code) allows one to capture new markets or market share, which is surely more valuable than code added to supply additional features (sustainment code) to ensure that customers are satisfied, and that the product retains a competitive position, which is more valuable than code added to fix problems (maintenance code) to ensure quality and hence customer loyalty. Unfortunately the high-performing innovation investment reduces to less than 12% of R&D total. Over the life-cycle of a product, accumulated sustainment and maintenance can be 180-600% of the original innovation investment.

One could argue that a company which had a successful innovation would be growing, so its R&D budget would be growing proportionately. However, even if we modify our investment strategy and decide to maintain the innovation investment at a constant level, innovation as a percent of R&D would be reduced to 35% by Year 7 as shown below:

How to solve the Innovation Dilemma

Starvation of innovation is caused by the need to sustain and maintain existing code. Therefore, aside from making the innovation investment more productive, the following are suggested strategies: Deprecate old products as soon as possible

Old products do not freely sit on the shelf. They are like volcanoes that have not erupted for some time. Are they extinct? It could be less profitable for that old product to be sold since that will then perpetuate the sustainment and maintenance. However many careers may be wedded to these old products so they become very difficult to kill.

Minimize the code investment in the original product (Lean!)

If you own a larger house, then sustainment and maintenance inevitably costs more. Code is no different. Thus any opportunities to downsize that code yet still meet the functional requirements will reduce the long term sustainment and maintenance costs, releasing more R&D spend for future innovations.

  • Code that is created quickly and efficiently allows the product to be released to the market earlier, ensuring an increased internal rate of return or net present value of the investment.
  • Code that adds functionality to solve customer problems is likely to be more valuable than core component code that could be purchased from OEMs.

Capitalize code investment

This will probably make accountants pay attention, but expensing code as it is created disguises the fact that it really behaves like a capital investment; it will need sustainment and maintenance investment over the years to retain its value.

[1] Hall, Bronwyn, Jacques Mairesse & Pierre Mohnen, 2010, Measuring the Returns to R&D, in: Hall, B. and Rosenberg, N, Handbook of the Economics of Innovation, Elsevier, Amsterdam, pp. 1034-1076

[2] Booz-Allen-Hamilton. (2006), Smart Spenders: The Global Innovation 1000

[3] The Innovator’s Dilemma.  Clayton M. Christensen Cambridge, Massachusetts: Harvard Business School Press, 1997


Icebergs of information loiter throughout process manufacturing IT waiting to sink any information integration project. The impact of semantic technologies is being felt in medicine, life sciences, intelligence, and elsewhere but can it solve this problem in process manufacturing? The ability to federate information from multiple data-sources into a schema-less structure, and then deliver that federated information in any format and in accordance with any standard schema uniquely positions semantic technology. Is this a sweet spot for semantic technologies?

Process Manufacturing Application Focus over the Years

Over the years we have been solving problems within process manufacturing IT only to uncover more problems. Once the problem was that of measurement data in silos which was solved by the introduction of real-time data historians. However that created the problems of data visibility, solved by the introduction of graphical user interfaces. This introduced data overload which was partially solved by the introduction of analytical tools to digest the information and produce diagnostics. Unfortunately these tools were difficult to deploy across all assets within an organization, so we have been trying to solve that problem with information models. The current problem is how to convert the diagnostics into actionable knowledge with the use of work-flow engines and ensuring the sustainability of applications as solutions increases in complexity.

Process Manufacturing Application Problems and Solutions over the Years
  1985-

1995

1990-

2000

1995-

2005

2000-

2010

2005-

2015

2010-
Problem Measurement data in silos Data access and visualization Analysis and business intelligence Contextualized information Consistent actioning Sustainability
Industry

Response

Real-time databases collecting measurements

(proprietary)

Graphical user interfaces, trending and reporting tools

(proprietary)

Analytical tools to digest data into information and diagnostics Plant data models (ProdML, ISA-95, ISO15926, IEC 61970/61968, Proprietary) ISO-9001 Outsourcing

Standards

Consequence Data but no user access Data overload Deployability of analysis to all assets Interpretation limited to experts Complexity, much more than RTDB, limiting sustainability Improved ongoing application benefits

 

However it is not only the increased technological complexity that is causing problems. Business decisions now cross many more business boundaries. When measurement data was trapped in silos we were content with unit-wide or plant-wide data historians. Now a well performance problem might involve a maintenance engineer located in Houston accessing a Mimosa[1]-based maintenance management system, an operations engineer located in Aberdeen accessing an OPC-UA[2]-based data historian, a production engineer located in London accessing a custom system driven by WITSML[3]-based feeds, and a facilities engineer using an ISO-15926[4] facilities management model. Not only are the participants in different locations and business units, but they also rely on different systems using different models to support their decision making. However they all should be talking about the same well, measured by the same instruments, producing the same flows, and processed by the same equipment.

The problem is that these operational support systems are not simply data silos whose homogeneous data we need to merge into one to answer our questions. In fact these operational support systems are icebergs of information. Above the surface they publish a public perspective focused on the core operational function of the application. However this data needs context, so below the surface is much of the same information that is contained in other systems. This information provides the context to the operational data so that the operational system can perform its required functions. For example the historian needs to know something about the instruments that are the source of its measurements; maintenance management systems need to know not only about the equipment to be maintained but the location of that equipment, physically and organizationally.

Figure 1: Icebergs of Information

Icebergs of information are not limited to the operational data stores deployed in organizations. An essential practice in these days of interoperability requirements is the adoption of model standards. However even these exhibit the same problems as shown by the diagram below. This diagram maps the available standards to its focus within the hydrocarbon supply chain.

Figure 2: Multiple Overlapping Model Standards

Increasing regulatory and competitive demands on the business are forcing decision making to be more timely, and to be more integrated across the traditional business boundaries. However these icebergs are getting in the way of effective decision making.

One way to make any or all of this information available to consumers is to create the bigger iceberg. ‘Simply’ create the relational database schema that covers every past, current, and future business need, and build adapters to populate this database from the operational data stores. Unfortunately this mega-store can only get more complex as it has to keep up with an expanding scope of information required to support the decision making processes.

Figure 3: Integration using the Bigger Iceberg

Alternatively we can keep building data-marts every time someone has a different business query.  However these do not provide the timeliness required to support operational decision making.

The Need for a Babel-Fish

We cannot meet the needs of the business, and solve their decision making needs by having one mega-store because it will never keep up with the changing business requirements. Instead we need a babel-fish (with thanks to the Hitchhikers Guide to the Galaxy).

This babel-fish can consume all of the different operational data in different standards, and translate them into any standard that the end-consumer wants. Thus the babel-fish will need to know that OPC UA’s concept ‘hasInstrument’ has the same meaning as Mimosa’s concept of ‘Instrumented’. Similarly 10FIC107 from an OPCUA provider is the same as 10-FIC-1-7 from Mimosa.

  1. Information providers (operational data stores) within the business will want to provide information according to their capabilities, but preferably using the standards appropriate for their application. For example measurements should be OPC UA, maintenance should use Mimosa
  2. Information consumers will want to consume information in the form of one or more standards appropriate for their application.

Figure 4: Integration babel-fish

The Semantic/RDF model comes to the rescue

First of all a definition: a semantic model means organizing all data and knowledge as RDF triples {subject, property, object}. Thus {:Peter, :hasAge, 21^^:years}, and {:Pump101, :manufacturedBy, :Rotek} are examples of RDF triples. RDF triples can be persisted in a variety of ways: SQL table, custom organizations, NoSQL, XML files and many more. If we were designing relational database to hold these RDF triples we would only have one ‘table’ so it may appear that we have no schema, in the relational database design-sense when we have key relationships to enforce integrity, and unique indices to enforce uniqueness. However we can add other statements about the data such as {:Pump101, :type, :ReciprocatingPump} and {:ReciprocatingPump, :subClassOf, :Pump}[5]. Used in combination with a reasoner we can infer consequences from these asserted facts, such as :Pump101 is a type of :Pump, and Peter is not a :Pump, despite rumors to the contrary.   These triples can be visualized as the links in a graph with the subject and object being the nodes of the graph, and the property the name of the edge linking these nodes:

Figure 5: RDF Triples as a graph

Over the years, new modeling metaphors have been introduced to solve perceived or actual problems with their predecessors. For example the Relational Model had perceived difficulties associated with reporting, model complexity, flexibility, and data distribution. A semantic model helps solve these problems.

Figure 6: Evolution of Model Metaphors

  • In response to the perceived reporting issues, OLAP techniques were introduced along with the data warehouse. This greatly eased the problem of user-reporting, and data mining. However it did introduce the problem of data duplication.
    • A semantic model can query against a federated model in which information is distributed throughout the original data sources.
  • In response to the perceived complexity issues, various forms of object-orientated modeling were introduced. There is no doubt that it is easier to think of one’s problem in terms of an object model rather than a complex relational or ER model, especially when there are a large number of entities and relations.
    • The semantic model is built around the very simple concept of statements of facts such as {:Peter, :hasAge, 21^^:years}, and {:Pump101, :manufacturedBy, :Rotek} combined with statements that describe the model such as {:Pump101, :type, :ReciprocatingPump} and {:ReciprocatingPump, :subClassOf, :Pump}.
  • The model flexibility problem occurs when, after the model has been designed, the business needs the model to change. In response to this flexibility issue, the choice is to make the original model anticipate all potential uses but then risk complexity, or use an object-relational approach in which it is possible to add new attributes without changing the underlying storage schema.
    • In semantic models these relationships are expressed in triples, using RDFS, SKOS, OWL, etc. Thus RDF is also used as the physical model (in RDF stores, at least).
  • There have been various responses to data distribution.
    • In the relational world there is not much choice other than to replicate the data from heterogeneous data stores using Extract-Transform-Load (ETL) techniques. In the case of homogenous but distributed databases distributed queries are possible, although it does require intimate knowledge of all the schemas in all of the distributed databases.
    • In the object-orientated world we are in a worse situation: it is very difficult to manage a distributed object in which different objects are distributed or attributes are distributed.

The good news is that a semantic approach is the ideal (or even the only) approach that can solve the information integration problem as follows:

  1. Convert to RDF normal form: Convert all source data into RDF. The data can be left at source and fetched on demand (federated) or moved into temporary RDF storage
    • There are already standard ways of doing this for any spreadsheet, relational database, XML schema, and more. For example, TopBraid Suite (http://www.topquadrant.com/products/TB_Suite.html) provides converters and adaptors for all common data sources. It is relatively easy to create more mappings such as OPCUA. The dynamic adapters act as SPARQLEndpoints[6].
  1. Federated data model: Create ‘rules’ that map one vocabulary to another.
    • The language of these rules would be RDFS, SKOS and OWL. For example you can declare {OPCUA:hasInstrument, owl:sameAs, Mimosa:Instrumented}. Note that these are simply additional statements expressed in RDF which are then used by a reasoner to infer the consequences such as :FI101 is actually the same as :10FIC101.
    • More sophisticated rules can also be created using directly RDF and SPARQL. For some examples, see SPIN or SPARQL Rules at http://spinrdf.org/ and http://www.w3.org/Submission/2011/SUBM-spin-overview-20110222/
  1. Chameleon data services: Create consumer queries that extract the information from the combined model into the standard required using SPARQL queries.
    • For example even though all instrument data is in OPCUA, a consumer could use a Mimosa interface to fetch this data. The results can then be published as web-services for consumption by external applications using SPARQLMotion (http://www.topquadrant.com/products/SPARQLMotion.html)

Figure 7: Federation End-to-End

Let’s look into these steps in detail:

Convert to RDF normal form

Despite the fact that data will be stored in different formats (relational, XML, object, Excel, etc) according to different schemas they can always be converted into RDF triples. Always is a strong word, but it really does work. There are already ways of doing this for any spreadsheet, relational database, XML schema, and more and it is relatively easy to create more mappings such as OPC-UA. The data can be left at source and fetched on demand (federated) or moved into temporary RDF storage. For example, TopBraid Suite (http://www.topquadrant.com/products/TB_Suite.html) provides converters and adaptors for all common data sources.

Figure 8: Conversion to RDF Normal Form

Federated Data Model

A federated data model allows different graphs (aka databases) to be aggregated by linking the shared objects. This applies to real-time measurements (OPC-UA), maintenance (MIMOSA), production data (ProdML), or any external database. We can visualize this as combining the graphs of the individual operational data stores into a single graph.

Of course there will be vocabulary differences between the different data-sources. For example, in the OPC-UA data-source you might have a property OPCUA:hasInstrument, and in a MIMOSA data-source the equivalent is called Mimosa:Instrumented. So the federated data model incorporates ‘rules’ that map one vocabulary to another. The language of these rules would be RDFS, SKOS, and OWL. For example, in OWL, you can declare {OPCUA:hasInstrument owl:sameAs Mimosa:Instrumented}. Note that these are simply additional statements expressed as RDF triples which are then used by a reasoner to infer consequences such as :FI101 is actually the same as :10FIC101.

There will also be identity differences between the different data-sources. These can also be handled by additional statements, such as {:TANK#102, owl:sameAs, :TK102 }. This allows a reasoner to infer that the statement {:TK102, :has_price, 83^^:$} also applies to :TANK#102, implying {:TANK#102, :has_price, 83^^:$}.

Figure 9: Information Federated from MulTiple Datasources

Chameleon Data Services

To extract information from the federated information, the best choice is SPARQL, the semantic equivalent of SQL only simpler. Whilst SQL allows one to query the contents of multiple tables within a database, SPARQL matches patterns within the graph. With SQL we need to know in which table each field belongs. With SPARQL we define the graph pattern that we want to match, and the query engine will search throughout the federated graphs to find the matches. In the example illustrated below we do not need to know that the price attribute comes from one data source, whilst the volume comes from another. In fact SPARQL allows even further flexibility. The price attribute for Tank#101 could come from a different data source than the price attribute for Tank#102. This is part of the magic of the semantic technology.

  

Figure 10: Graph Pattern matching with SPARQL

SPARQL can be used to directly query the federated graph for reporting purposes, however most consumers of the information will expect to interface to a web-service, with SOAP or REST being the most popular. These services do not have to be programmed. Instead they can be declared using SPARQLMotion (http:www.sparqlmotion.org) to produce easily consumed and adaptable web-services. The designer for SPARQLMotion is shown below:

Figure 11: Example SPARQLMotion

Semantic/RDF advantages for the Process Manufacturing

Despite solving a complex data integration problem, Semantic/RDF is inherently simpler. Can there be anything simpler than storing all knowledge as RDF triples?  Despite this simplicity, we do not lose any expressivity.

There is no predefined schema to limit flexibility. However the schema rules, encoded as tables and keys in the relational model, can still be expressed using RDFS, OWL, and SKOS statements.

Deconstructing all information into statements (triples) allows data from distributed sources to be easily merged into a single graph.

Any information model can be reconstructed from the merged graph using SPARQL and presented as web-services (SOAP or REST).

References

[1] MIMOSA is a not-for-profit trade association dedicated to developing and encouraging the adoption of open information standards for Operations and Maintenance in manufacturing, fleet, and facility environments. MIMOSA’s open standards enable collaborative asset lifecycle management in both commercial and military applications.

[2] The Unified Architecture (UA) is THE next generation OPC standard that provides a cohesive, secure and reliable cross platform framework for access to real time and historical data and events. 

[3] WITSML™ (Wellsite Information Transfer Standard Markup Language) is an industry initiative to provide open, non-proprietary, standard interfaces for technology and software that monitor and manage wells, completions and workovers.

[4] ISO 15926 provides integration of life-cycle data for process plants including oil and gas production facilities

[5] I should really be using URIs instead of text labels for subject, property, and objects, but the intent of the semantic model is conveyed more simply if we avoid identifiers like ‘http://www.example.org/equipment#Pump101’ and use :Pump#101

[6] SPARQL is a query language for RDF. A SPARQL endpoint is a protocol service that makes it possible to query a data source using SPARQL. The source itself does not need to be in RDF. It can, for example, be a traditional relational database. Later in this article we will describe SPARQL in more detail and show some query examples


Henry Ford had one-size-fits-all, and he wanted to own the entire supply chain. The world has been turned on its head to where every user wants to personalize everything, including their workspace.  How does this new world order fit with process manufacturing applications?

Not so long ago, process manufacturing applications (real-time databases (RTDB), manufacturing execution systems (MES), manufacturing operational management systems (MOM), enterprise resource planning systems (ERP)) were being developed by an enterprise’s own IT department. The first systems to be truly productized in the 1990’s were the real-time databases, with thanks to OSI, and the enterprise resource planning, with thanks to SAP.

At that time, in-house computing consisted mainly of big-iron mainframes supporting financial, HR, and similar corporate systems. Networks were limited to supporting remote terminal access. Mini-computers (a ridiculous label given today’s sizes, but they were truly transformational in their day) such as DEC’s PDP series were the best choice for departmental computing initiatives since corporate IT were not that interested in real-time data, recipe management, blend planning, and all the other detailed work necessary to operate a process plant. The mini-computer technology was a great choice for many reasons: they could be purchased under-the-radar of IT; they were local devices so had no dependencies on the immature networks; and they were close to the individuals that understood the actual business problems they were used to solve. 

Either the success of these departmental applications, the concern that corporate IT had regarding the increasing expenditure on these departmental initiatives, or increasing legislative or governance requirements meant that these applications were being pulled under the control of corporate IT. This was not entirely a bad thing because it could be argued that the focus should be on operating the plant, not creating and operating applications and associated infrastructure. However local ownership had its advantages as well: users would feel a sense of ownership and hence quality of the information; if there were missing applications they could be quickly created with Excel spreadsheets or Access databases, even though this contributes to spreadsheet-hell; locally-created applications could be quickly adapted to changing user requirements.  In summary, this was a period when users had a considerable amount of influence over their computing environment. However this influence has gradually declined over the last 20 years.

At the same time as we have seen a decline in the ability an enterprise application user has over the kind of and configuration of applications, we have seen the reverse trend in a user’s personal computing environment. Users can configure their own portals using iGoogle; they can set up their own meeting groups with FaceBook; they can browse for information using Bing and Google; and they can assess, locate, and install new applications by browsing app stores.

The challenge facing enterprise IT is to provide that framework so that end-users can create their own environment via a configurable portal, and even their own applications via configurable information, displays, analysis, applications, and work-flows. However LinkedIn, FaceBook, and Google have far more developer resources than enterprise IT can dream of. So instead of trying to emulate these social applications, why not embrace them to provide the capabilities the enterprise end-users need. As well as relieving enterprise IT of the development cost burden, it immediately provides access to those utilities (gadgets for iGoogle, web-parts for SharePoint) with which a user can fashion their own experience. 

Component Old Model: Enterprise IT New Model: Socialized Enterprise
Sourcing Role Sourcing Role Example
Platform Corporate IT architects Define and implement the platform upon which corporate applications will run. Cloud-sourcing Cloud-based PaaS platform reducing/eliminating the need for local platforms. Use Google App Engine
Framework Corporate IT Define development standards for in-house developed or purchased applications. Developer-sourcing Definition of Cloud-based Frameworks (data, workflow, BI, portals, etc) and SaaS designed for end-user configuration. Cordys or RunMyProcess cloud-hosted 
Applications Corporate Developers or Application vendors Buy/Build applications that met user-requirements as determined by requirements analysis. User-sourcing User-configured applications that adapt to the changing business needs without involving corporate resources: agile. User composes iGoogle with gadgets from the marketplace.
Utilities Users Locate utilities (Excel, mini-apps etc) that fill in the functionality gaps and can be deployed ‘under-the-radar’.  Crowd-sourcing User-acquired utilities from the open-market but which conform to the cloud-framework. iGoogle gadgets that solve specific user problems

 

Perhaps we are just turning the clock back to when the user within the process manufacturing plant had considerable control over their computing environment.