The Good, the Bad, and the Standard

in Blog

Standards undeniably have their use in the computer industry. And for a reason: the life of a programmer would be entirely different if computer languages like Java, C++, SQL, or Cobol were not based on standards that vendors (more or less) conform to. We've had our fair share of blunders in standards world though, and some standardisation efforts are doomed from the beginning. Recently I've noticed that the OGC seeks comment on a candidate GeoSPARQL standard. Since we (Open Sahara) have our own GeoSPARQL implementation with the IndexingSail, I could not refuse that inquiry. But first things first. Let's learn from success and failure of some of the more prominent computer related standards I've worked with.

C++ and Template Meta Programming

My professional programming days started with the C++ language. Definitely not the first, or the most elegant, object oriented computer language. But unlike others, it managed to attract a following in the business world. The standardisation of C++ was not without glitches though. Not only took it years to get a ratified C++ standard. It took many more years for compiler vendors to be able to implement it. Being able to write C++ code that actually compiled on two different compilers was an engineering skill those days. The complexity of the compile-time polymorphism features in the C++ programming language did not only result in non-conformant compilers. They have also lead to uses that I think none of its inventors had envisioned, now labelled as Template Meta Programming (TMP).

It is generally accepted that TMP was an unforeseen by-product of the template system in the C++ standard that just happened to be Turing-complete (thus capable of computing anything that is theoretically computable). According to Wikipedia, Erwin Unruh was the first one to demonstrate this. He wrote a program that computed prime numbers although it did not actually finish compiling: the list of prime numbers was part of an error message generated by the compiler on attempting to compile the code.

The Legacy called Enterprise Java Beans

In the Java world J2EE's EJB standard is probably the most notable example of a debacle. EJB's were promoted to solve two conflicting issues at the same time: the logical design principle of building a program from smaller components, and the performance optimisation that distributes processing across different computers. Since an optimal distributed architecture rarely can be achieved by splitting across functional a-priori design components this was doomed to fail.

And fail it did. Which gave opportunities for a few alternative technologies: the Spring Dependency Injection framework, and Hibernate. The success of Hibernate eventually lead to the Java Persistence API (JPA) standard, that replaces the serialisation technology from the EJB standard. The Spring framework positioned itself smartly as a lightweight alternative to the bloated J2EE standards, and managed to become a de facto standard for Dependency Injection. Although a recent trend might be that software engineers are looking for alternative Dependency Injection frameworks, that are less dependent on XML.

XML? Yes XML, a prime example of a standard that was hyped so much it is used for just about anything, despite being verbose, hard to read, and frankly quite inelegant. Remember Jelly, the programming language that uses XML as syntax? XML has its use though, and Open Sahara uses XML (via Spring) to provide powerful configuration options.

That Sun hasn't really learned from the EJB debacle is apparent with the overly-complex JavaServer Faces. Another standard from Sun that, according to many Java programmers, should have stayed under the Moon.

GeoSPARQL

But enough of the standard committee bashing. I wrote this Blog because there is a new shining star on the standards horizon: GeoSPARQL. And since we have some experience with designing geospatial SPARQL features ourselves (see: Geospatial Search for RDF Data), we think we are in a position to discuss some of its aspects.

The previous examples have illustrated some reasons why a standard may fail:

  • Too much complexity.
  • Not enough experience with or demand for the kind of technology that is going to be standardised.
  • Trying to solve too many problems at the same time.
  • Promoting its use for purposes that are better implemented with alternative technologies.

SPARQL is a Graph Based Query Language

SPARQL is a graph based query language, and works on RDF, which basically is an elegant, standardised way of persisting data structured as graphs. Since geometries are graphs in themselves it seems only natural to use the graph-based features of SPARQL and RDF when designing something called GeoSPARQL. Right? No! Well actually it may have been a valid approach, but I am really glad the GeoSPARQL proposal does not take that route. Although it may be a very elegant way to store and query geometries, it is also very complex to index data that is stored like that and to make performant query engines on top of it. Instead the GeoSPARQL committee decided to store and index geometries as literals (thus each geometry is one value, field, property or column), making it possible to build upon well established and researched technology that are also used in e.g. Spatial Features for the SQL language.

The Neo4J community (Neo4J is an open-source Graph database) is actually experimenting with both approaches for storing geometry information. It will be interesting to see if and how they manage to do efficient and scalable query processing for geometries that are stored as linked nodes and edges in the database.

With the store geometries as literals approach, the committee has achieved two things: it reduces the complexity of technology that implements the standard to something manageable, and it makes it possible to base the standard on well understood comparable technologies that are used in the relational database world.

On a sidenode: Open Sahara has made the same design choice as the GeoSPARQL committee, and stores geometries as literals in Well Known Text format (a few other formats are also supported).

A Standard should Not force Redundancy

Although in my opinion it's a well built-up, and actually quite usable proposal, there are a few things I don't like about GeoSPARQL. One of those issues is best illustrated with some example data from the proposal document itself:

<sfg:Polygon rdf:about="http://somewhere/ApplicationSchema#CExactGeom">
  <geo:asWKT rdf:datatype="http:www.opengis.net/def/dataType/OGC-SF/1.0/WKTLiteral">
     <![CDATA[
       <http://www.opengis.net/def/crs/OGC/1.3/CRS84>
       Polygon((-83.2 34.3, -83.0 34.3, -83.0 34.5,
                -83.2 34.5, -83.2 34.3))
     ]]>
  </geo:asWKT>
</sfg:Polygon>

In addition to a geometry literal, the standard prescribes the use of an RDFS class Geometry (with subclasses such as Polygon in this example), and RDFS classes for Features (things that have a geometry). The Geometry literals are supposed to be wrapped inside instances of a Geometry class. Hence, users of the standard are forced to declare the type of their geometry two times: one time in the literal, and one time in the instance declaration that wraps the literal.

Redundancy has its place in computer industry, but only to achieve performance gains in the technical implementation of software. In my opinion it's a very bad idea to force redundancy at the user-level through a standard. It also introduces complexities that are not addressed in the standard. What if a user erroneously wraps a Point literal in an instance of a Polygon RDFS class?

The reason for this redundancy seems to be that the GeoSPARQL committee wants to add reasoning capabilities as functional properties to geometries, and this is not possible in RDFS without having a Geometry class in addition to a Geometry literal. They could have achieved the same goal differently though. And they certainly did not have to duplicate the hierarchy of Geometry types between literals and RDFS classes. A Geometry class that offers a functional property for the type of geometry in the wrapped literal would have achieved the same reasoning capabilities, without forcing to store redundant facts in an RDF database.

The Missing Layer

The proposal follows a layered design, and implementers can choose to only implement some of the layers. Much to my regret, the base or core layer is defined to be the RDFS class hierarchy for Features and Geometries, and not the geometry literal and the spatial functions that are designed to work on that literal. The Feature and Geometry RDFS classes certainly have no purpose without a way to store the actual geometry. The opposite is certainly not true: Our Open Sahara IndexingSail demonstrates that storing and efficiently querying geometry data without a prescribed RDFS ontology is possible.

While I can see that such an ontology may be beneficial, and that it can be useful to define reasoning capabilities on top of it, it should not have been defined as the core layer of the standard. The core layer should be what is currently described as a Serialization Extension. This extension basically describes how to store geometries in literals and what functions can be used on them. There are various reasons to define the RDFS ontology as an extension, and the literal serialisation as core, instead of the other way around:

  • It's just not very useful to have an RDFS class Geometry, without a means to store the geometry itself. So even without labelling it Core, there exists no useful implementation of the standard that doesn't also offer one of the literal Serialization components
  • Implementors are required to support RDFS reasoning, while this just doesn't scale well with current implementations. (Although the standard is quite vague about how complete the reasoning should be, and only has explicit requirements for rdfs:subclassOf transitivity).
  • The forced use of an ontology is not a good idea. Not every potential use of a Geometry maps well to what GeoSPARQL defines as a Feature. Prescribing its use is like saying that the xsd:dateTime literal type can only be used inside classes of type DateBearingFeature. Not a a very useful thing to say.

Conclusion

The proposed OGC GeoSPARQL standard has managed to steer clear from some of the pitfalls that have plagued many computer related standards. This was achieved by building upon well established and researched ideas from spatial applications for relational databases. If it will be accepted as a standard, Open Sahara will probably try to incorporate at least some of its ideas in the IndexingSail. Nonetheless the standard regretfully fails to acknowledge that the most basic use of a geometry is just that: the geometry itself. I hope the proposal will be revised to be useful also for end-users that don't require or even want to use prescribed ontologies. If not, the only left to say is: Redundancy is redundant.

Comments

Frans Knibbe's picture

Hi Gerjon,

I just read your official comments on the GeoSPARQL proposal. They made a lot of sense. Some of your comments address things that I also wondered about, but you were able to describe the weaknesses much clearer, and make good suggestions for change too.

I am glad you took the trouble to contribute to getting a good standard for this subject, which I think has a huge potential.

Frans

Post new comment

The content of this field is kept private and will not be shown publicly. If you have a Gravatar account associated with the e-mail address you provide, it will be used to display your avatar.
By submitting this form, you accept the Mollom privacy policy.
Open Sahara is an initiative of Talking Trends and the University of Amsterdam