Daily Archives: April 2, 2016

Unconstrained JSON-LD Performance Is Bad for API Specs

I am still arguing fiercely with some of my enterprise architect friends whether we should use JSON or JSON-LD to define our APIs. I did some research this morning that I think is broadly applicable so I figure I would share it widely.

You might want to read as background the following 2014 blog post from Many Sporny who is one of the architects of JSON-LD:

http://manu.sporny.org/2014/json-ld-origins-2/

Here are a few quotes:

I’ve heard many people say that JSON-LD is primarily about the Semantic Web, but I disagree, it’s not about that at all. JSON-LD was created for Web Developers that are working with data that is important to other people and must interoperate across the Web. The Semantic Web was near the bottom of my list of “things to care about” when working on JSON-LD, and anyone that tells you otherwise is wrong. :)

TL;DR: The desire for better Web APIs is what motivated the creation of JSON-LD, not the Semantic Web. If you want to make the Semantic Web a reality, stop making the case for it and spend your time doing something more useful, like actually making machines smarter or helping people publish data in a way that’s useful to them.

In the vein of Manu’s TL;DR: above I will add my own TL;DR for this post:

TL;DR: Using unconstrained JSON-LD to define an API is a colossal mistake.

There is a lot to like about JSON-LD – I am glad it exists. For example, JSON-LD is far better than XML with namespaces, better than XML Schema, and better than WSDL. And JSON-LD is quite suitable for long lived documents that will be statically stored and have data models that slowly evolve over time where any processing and parsing is done in batch mode (perhaps like the content behind Google’s Page Rank Algorithm).

But JSON-LD is really bad for APIs that need sub-millisecond response times at scale. Please stop your enterprise architects from making this mistake just so they gain “cool points” at the enterprise architect retreats.

Update: I removed swear words from this post 4-Apr-2016 and added the word “unconstrained” several places to be more clear. Also I made a sweet web site to show what I mean by “unconstrained JSON-LD” – I called it the JSON-LD API Failground.

Update II: Some real JSON-LD experts (Dave Longley and Manu Sporney) did their own performance tests that provide a lot more detail and better analysis than my own simplistic analysis. Here is a link to their JSON-LD Best Practice: Context Caching – they make the same points as I do but with more precision and detail.

Testing JSON-LD Performance

This is a very simple test simulating parsing of a JSON-only document versus a JSON-LD document. The code is super-simple. Since JSON-LD requires the document be first parsed with JSON and then augmented by JSON-LD to run an A/B performance test we simply turn on and off the additional required JSON-LD step and time it.

This code uses the JSON-LD PHP library from Manu Sporny at:

https://github.com/digitalbazaar/php-json-ld

I use the profile sample JSON-LD for the Product at:

http://json-ld.org/playground/

Methodology of the code – it is quite simple:

    require_once "jsonld.php";

    $x = file_get_contents('product.json');
    $result = array();
    for($i=0;$i<1000;$i++) {
       $y = json_decode($x);
       $y = jsonld_compact($y, "http://schema.org/");
       $result[] = $y;
    }

To run the JSON-only version simply comment out the `jsonld_compact` call. We reuse the $y variable to make sure we don't double store any data and accumulate the 1000 parsed results in an array to get a sense of whether or not there is a different memory size for JSON or JSON-LD.

I used `/usr/bin/time` on my MacBook Pro 15 with PHP 5.5 as the test.

Output of the test runs

si-csev15-mbp:php-json-ld-test-02 csev$ /usr/bin/time -l php j-test.php
            0.09 real         0.08 user         0.00 sys
      17723392  maximum resident set size
             0  average shared memory size
             0  average unshared data size
             0  average unshared stack size
          4442  page reclaims
             0  page faults
             0  swaps
             0  block input operations
             6  block output operations
             0  messages sent
             0  messages received
             0  signals received
             0  voluntary context switches
             6  involuntary context switches
    si-csev15-mbp:php-json-ld-test-02 csev$ /usr/bin/time -l php jl-test.php
          167.58 real         4.94 user         0.51 sys
      17534976  maximum resident set size
             0  average shared memory size
             0  average unshared data size
             0  average unshared stack size
          4428  page reclaims
             0  page faults
             0  swaps
             0  block input operations
             0  block output operations
         14953  messages sent
         24221  messages received
             0  signals received
          2998  voluntary context switches
          6048  involuntary context switches

Results by the numbers

Memory usage is equivalent - actually slightly lower for the JSON-LD - that is kind of impressive and probably leads to a small net benefit for long-lived document-style data. Supporting multiple equivalent serialized forms may save space at the cost of processing.

Real time for the JSON-LD parsing is nearly 2000X more costly than JSON - well beyond three orders of magnitude [*]

CPU time for the JSON-LD parsing is about 70X more costly - almost 2 orders of magnitude [*]

[*] Some notes for the "Fans of JSON-LD"

To stave off the obvious objections that will arise from the Enterprise-Architect crowd eager to rationalize JSON-LD at any cost, I will simply put the most obvious reactions to these results here in the document

  1. Of course the extra order of magnitude increase in real-time is due to the many repeated re-retrievals of the context documents. JSON-LD evangelists will talk about "caching" - this of course is an irrelevant argument because virtually all of the shared hosting PHP servers do not allow caching so at least in PHP the "caching fixes this" is a useless argument. Any normal PHP application in real production environments will be forced to re-retrieve and re-parse the context documents on every request / response cycle.
  2. The two orders of magnitude increase in the CPU time is harder to explain away. The evangelists will claim that a caching solution would cache the post-parsed versions of the document - but given that the original document is one JSON document and there are five context documents - the additional parsing from string to JSON would only explain a 5X increase in CPU time - not a 70X increase in CPU time. My expectation is that even with cached pre-parsed documents the additional order of magnitude is due to the need to loop through the structures over and over, to detect many levels of *potential* indirection between prefixes, contexts, and possible aliases for prefixes or aliases.
  3. A third argument about the CPU time might be that json_decode is written in C in PHP and jsonld_compact is written in PHP and if jsonld_compact were written in C and merged into the PHP core and all of the hosting providers around the world upgraded to PHP 12.0 - it means that perhaps the negative performance impact of JSON-LD would be somewhat lessened "when pigs fly".

Conclusion

Unconstrained JSON-LD should never be used for non-trivial APIs - period. Its out of the box performance is abhorrent.

Some of the major performance failure can be explained away if we could magically improve hosting plans, and make the most magical of JSON-LD implementation - but even with this there is over an order of magnitude of performance cost to parse JSON-LD than to parse JSON because of the requirement to transform an infinite number of equivalent forms into a single canonical form.

Ultimately it means if a large scale operator started using JSON-LD based APIs heavily to enable a distributed LMS - so we get to the point where the core servers are spending more time servicing standards-based API calls rather than generating UI markup - it will require somewhere between 10 and 100 times more compute power to support JSON-LD than simply supporting JSON.

Frankly in the educational technology field - if you want to plant a poison pill in the next generation of digital learning systems - I cannot think of a better poison pill than making interoperability standards using JSON-LD as the foundation.

I invite anyone to blow a hole in my logic - the source code is here:

https://github.com/csev/json-ld-fail/blob/master/README.md

A Possible Solution

The only way to responsibly use JSON-LD in an API specification is to have a canonical serialized JSON form that is *the* required specification - it can also be valid JSON-LD but it must be possible to deterministically parse the API material using only JSON and ignoring the @context completely. If there is more than one @context because of extensions, then the prefixes used to represent the contexts other then the @vocab then the prefixes used by those other contexts must also be legislated so once again, a predictable JSON-only parse of the document without looking at the contexts is possible.

It is also then necessary to build conformance suites that validate all interactions for simultaneous JSON and JSON-LD parse-ability. It is really difficult to maintain the sufficient discipline - because if a subset of the interoperating applications start using JSON-LD for serialization and de-serialization - it will be really easy to drift away from "also meeting the JSON parse-ability" requirements. Then when those JSON-LD systems interact with systems that use JSON only for serialization and de-serialization - it will get ugly quickly. Inevitably uninformed the JSON-LD advocates will claim they have the high moral ground and won't be willing to comply with the JSON-only syntax tell everyone they should be using JSON-LD libraries instead - and it won't take much of a push for interoperability to descend into interoperability finger-pointing hell.

So while this compromise seems workable at the beginning - it is just the Semantic Web/RDF Camel getting its nose under the proverbial tent. Supporting an infinite number of equivalent serialization formats is neither a bug nor a feature - it is a disaster.

If the JSON-LD community actually wants its work to be used outside the "Semantic Web" backwaters - or in situations where hipsters make all the decisions and never run their code into production, the JSON-LD community should stand up and publish a best practice to use JSON-LD in a way that maintains compatibility with JSON - so that APIs and be interoperable and performant in all programming languages. This document should be titled "High Performance JSON-LD" and be featured front and center when talking about JSON-LD as a way to define APIs.


Update: The JSON-LD folks wrote a very good blog post that looks at this in more detail: JSON-LD Best Practice: Context Caching. This is a great post as it goes into more detail on the nature of performance issues and is a good start towards making JSON-LD more tractable. But to me the overall conclusion is still to use highly constrained JSON-LD syntax but not use JSON-LD parsers in high-performance applications.