A large community exists “that sees Linked Data, let alone the full Semantic Web, as an unnecessarily complicated technology”, Phil Archer. However early adopters are always known for zealotry rather than pragmatism: mainframe to mini, mini-to-pc, pc-to-web and many more are examples where the zealots often threw the baby out with the bath-water.
Rather than abandoning the semantic web perhaps we should take a pragmatic view and identify its strengths, together with some soul-searching to be honest about its weaknesses:
SPARQL is unlikely ever to be end-user friendly:
However one can say exactly the same of SQL (which, believe it or not, used to be called User-Friendly-Interface, or UFI for short). SQL is now hidden behind much more developer-friendly facades so that they can deliver user-friendly experiences such as Hibernate/JPA, C#+LINQ, Odata, and many more. Even so SQL remains the language of choice for manipulating and querying RDBMS.
So we still need SPARQL for manipulating semantic data, but we need developer- and user-friendly facades that make the semantic information more accessible. Some initiatives are JPA for RDF, LINQtoRDF, Odata-SPARQL but the effort is fragmented with few standard initiatives.
RDF/OWL is unlikely to supplant Entity-Relational, JSON-Object databases:
However we have all learnt the lesson that the data model is often at the core of any application. Furthermore any changes to that data model can be costly, especially in the later stages of the development lifecycle. Who has not created an object-attribute data model deployed in an RDBMS to retain design flexibility without the need to change the underlying database schema? Are we not reinventing the semantic data model?
So the principles and flexibility of semantic (RDF, RDFS, and OWL) data modeling of data could be adopted as a data modeling paradigm representing the evolution that started with Entity-Relationship (ER), went through Object-Role Modeling (ORM or NIAM), to Semantic Data Modeling (SDM). Used in combination with dynamic REST/JSON, such as Odata, one could truly respond to the Dynamic Business Application Imperative (Forrester).
Triple-stores are unlikely to supplant Big-Data:
However, of volume, velocity, and variety, many big data solutions are weaker when handling variety, especially when the variety of data is changing over time as it does with evolving user needs. Most of the time the variety at the data source will be solved by Extract-Transform-Load (ETL) into a big-data store. Is this any different than data warehousing which has matured over the last 20 years, except for the use of different data storage technology? Like any goods in transit, data gets damaged and contaminated when it is moved. It is far better to use the data in-situ if at all possible, but this has been the unachievable Holy Grail of data integration for many years.
So the normalization of any and all data into triples might not be the best way to store data but can be the way to mediate variable information from a variety of data-sources: Ontology Based Data Access (OBDA). The ontology acts a semantic layer between the user and the data. The semantics of the ontology are used to enrich the information on the sources and/or cope with incomplete information in them.
In summary …
Semantic technologies will see more success if it pursues the more pragmatic approach of the solving those problems that are not satisfactorily solved elsewhere such as querying and reasoning, dynamic data models, and data source mediation; instead of resolving the already solved.
References
Phil Archer http://semanticweb.com/tag/phil-archer
LINQtoRDF: https://code.google.com/p/linqtordf/
JPA/RDF: https://github.com/mhgrove/Empire
ODATA/SPARQL: https://github.com/brightstardb/odata-sparql
Dynamic Business Imperative: http://www.forrester.com/The+Dynamic+Business+Applications+Imperative/fulltext/-/E-RES41397?objectid=RES41397
Ontology Based Data Access: http://obda.inf.unibz.it/
Sean Martin says:
At the risk of coming across as one of those baby out with the bathwater zealots you mention, I thought it might be useful for readers of this interesting post to hear about some more recent innovations in the field of semantic technologies that are beginning to up end widely held assumptions.
While SPARQL can never itself be user friendly, it differs in at least one important regard to its pre-genitor SQL in that it is both unambiguous and has accompanying it a decent logical modeling standard, namely OWL. By unambiguous, I mean that SPARQL systems tend to operate at the logical level particularly when the predicates are all drawn from an OWL specification of some domain. SQL systems are hamstrung by being anchored to physical artifacts that hold together a RDBMS (foreign keys, indexes, join tables etc.) which all add friction and complexity in practice. The applied effect of this is that it is possible to automatically generate _far_ more complex [many join] queries in OWL model guided SPARQL than it is in using SQL [or even manually in SQL]. BI tools based on SPARQL are dramatically better at providing relatively unsophisticated users direct interactive access to relationship and entity laden rich data (including even sub-class relationships) and thus more accurate digital representations of some underlying reality without having to learn any query language at all. With the SPARQL and OWL combination is it practical to both make far more accurate data models, and the data in them can be made entirely self-service accessible with decent tooling. Here is a real-life example of this at work and we do the same trick to enable OData access.
Given the above it is unclear to me why RDF/OWL system will if not supplant JSON-Object/RDBM databases will certainly find their place besides them. And yes it definitely is possible to use OWL as a specification standard that can also be operationalized. Increasingly industry wide bodies are adopting this as the modeling standard underlying their own domain model standards – for example FIBO in financial services, CDSIC in clinical trial reporting and HL7 FHIR in health care. These industry domain models can be operationalized either using triple stores or through translation. For example we are able to generate highly scalable Apache Spark ETL/ELT jobs from source target mappings that use OWL as a business understandable enterprise level canonical model in a “virtual hub & spoke” style architecture for organizations that are looking to implement a semantic layer to guide their data movement. The Spark jobs generate optimized to point-to-point data movement instructions and does not ever materialize data as triples unless that is the intended ETL target.
It is certainly true that most big data solutions are weaker when handling variety of data. It is very interesting to see the crop of current Hadoop family analytics solutions requiring increased amounts of de-normalization to achieve acceptable performance as they take over tasks previously reserved for Relational data warehouses. For the most part they don’t seem to JOIN very well yet and they suffer from nearly all the same kinds of practical data modeling and query writing problems that they inherited by adopting SQL and other RDBMS attributes. However I would take issue with the statement that triple stores are unlikely to supplant Big Data. They are perfectly capable of being big data handling tools themselves when the right minds and experience is applied. Interactive GOLAP (graph online analytics processing) for line of business analytics (and not just some graph analytics niche) at massive scale is now a reality. Again my example is drawn from near to home. The Anzo GQE is a SPARQL 1.1 engine that can load a trillion triples into a cluster in under 30 minutes directly from cheap data object stores like S3, GCS and HDFS and perform sophisticated interactive analytics on them. Unlike big data analytics & reporting tools like SQL Spark, Presto, Impala, Drill etc, model richness is never an issue and the many way join performance is off the charts by comparison. GQE is a third generation massively parallel processing all in-memory data warehouse assembled by the same folks who delivered Netezza & ParAccel/Red Shift. The latest system is built on graph and semantic standards allowing operations against data entirely in the logical sphere and like other big data tools can be erected in minutes through cloud automation.
While I certainly agree that semantics used pragmatically will find a home, given my own experiences I am not ready to call it quits on where those places should be just yet. Just because things take a long time to mature is not a good reason to count them out. I do not feel that the traditional (is big data considered traditional already!) tools have adequately solved the many end to end data analytics workflow challenges. Semantic technologies can help across the board and already offer far better solutions than those currently accepted as just adequate or a compromise.
Sean Martin says:
At the risk of coming across as one of those baby out with the bathwater zealots you mention, I thought it might be useful for readers of this interesting post to hear about some more recent innovations in the field of semantic technologies that are beginning to up end widely held assumptions.
While SPARQL can never itself be user friendly, it differs in at least one important regard to its pre-genitor SQL in that it is both unambiguous and has accompanying it a decent logical modeling standard, namely OWL. By unambiguous, I mean that SPARQL systems tend to operate at the logical level particularly when the predicates are all drawn from an OWL specification of some domain. SQL systems are hamstrung by being anchored to physical artifacts that hold together a RDBMS (foreign keys, indexes, join tables etc.) which all add friction and complexity in practice. The applied effect of this is that it is possible to automatically generate _far_ more complex [many join] queries in OWL model guided SPARQL than it is in using SQL [or even manually in SQL]. BI tools based on SPARQL are dramatically better at providing relatively unsophisticated users direct interactive access to relationship and entity laden rich data (including even sub-class relationships) and thus more accurate digital representations of some underlying reality without having to learn any query language at all. With the SPARQL and OWL combination is it practical to both make far more accurate data models, and the data in them can be made entirely self-service accessible with decent tooling. Here is a real-life example of this at work and we do the same trick to enable OData access.
Given the above it is unclear to me why RDF/OWL system will if not supplant JSON-Object/RDBM databases will certainly find their place besides them. And yes it definitely is possible to use OWL as a specification standard that can also be operationalized. Increasingly industry wide bodies are adopting this as the modeling standard underlying their own domain model standards – for example FIBO in financial services, CDSIC in clinical trial reporting and HL7 FHIR in health care. These industry domain models can be operationalized either using triple stores or through translation. For example we are able to generate highly scalable Apache Spark ETL/ELT jobs from source target mappings that use OWL as a business understandable enterprise level canonical model in a “virtual hub & spoke” style architecture for organizations that are looking to implement a semantic layer to guide their data movement. The Spark jobs generate optimized to point-to-point data movement instructions and does not ever materialize data as triples unless that is the intended ETL target.
It is certainly true that most big data solutions are weaker when handling variety of data. It is very interesting to see the crop of current Hadoop family analytics solutions requiring increased amounts of de-normalization to achieve acceptable performance as they take over tasks previously reserved for Relational data warehouses. For the most part they don’t seem to JOIN very well yet and they suffer from nearly all the same kinds of practical data modeling and query writing problems that they inherited by adopting SQL and other RDBMS attributes. However I would take issue with the statement that triple stores are unlikely to supplant Big Data. They are perfectly capable of being big data handling tools themselves when the right minds and experience is applied. Interactive GOLAP (graph online analytics processing) for line of business analytics (and not just some graph analytics niche) at massive scale is now a reality. Again my example is drawn from near to home. The Anzo GQE is a SPARQL 1.1 engine that can load a trillion triples into a cluster in under 30 minutes directly from cheap data object stores like S3, GCS and HDFS and perform sophisticated interactive analytics on them. Unlike big data analytics & reporting tools like SQL Spark, Presto, Impala, Drill etc, model richness is never an issue and the many way join performance is off the charts by comparison. GQE is a third generation massively parallel processing all in-memory data warehouse assembled by the same folks who delivered Netezza & ParAccel/Red Shift. The latest system is built on graph and semantic standards allowing operations against data entirely in the logical sphere and like other big data tools can be erected in minutes through cloud automation.
While I certainly agree that semantics used pragmatically will find a home, given my own experiences I am not ready to call it quits on where those places should be just yet. Just because things take a long time to mature is not a good reason to count them out. I do not feel that the traditional (is big data considered traditional already!) tools have adequately solved the many end to end data analytics workflow challenges. Semantic technologies can help across the board and already offer far better solutions than those currently accepted as just adequate or a compromise.