Monthly Archives: April 2016

More Tsugi Refactoring – Removal of the mod folder

I completed the last of many refactoring steps of Tsugi yesterday. when I moved the contents of the “mod” folder into its own repository. The goal of all this refactoring was to get it to the point where checking out the core Tsugi repository did not include any end-user tools – it just would include the administrator, developer, key management, and support capabilities (LTI 2, CASA, ContentItem Store). The key is that this console will also be used for the Java and NodeJS implementations of Tsugi until we build the functionality in the console in each of those languages and so it made no sense to drag in a bunch of PHP tools if you were just going to use the console. I wrote a bunch of new documentation showing how the new “pieces of Tsugi” fit together:

https://github.com/csev/tsugi/blob/master/README.md

This means that as of this morning if you do a “git pull” in your /tsugi folder – the mod folder will disappear. But have no fear – you can restore it with the following steps:

cd tsugi
git clone https://github.com/csev/tsugi-php-mod mod

And your mod folder will be restored. You will now have to do separate git pulls for both Tsugi and the mod folder.

I have all this in solid production (with the mod restored as above) with my Coursera and on campus Umich courses. So I am pretty sure it holds together well.

This was the last of a multi-step refactor for this code to modularize it in multiple repositories so as to better prepare for Tsugi in multiple languages as well as plugging Tsugi into various production environments.

Ring Fencing JSON-LD and Making JSON-LD Parseable Strictly as JSON

My debate with my colleagues[1, 2] about the perils of unconstrained JSON-LD as an API specification is coming to a positive conclusion. We have agreed to the following principles:

  • Our API standard is a JSON standard and we will constrain our JSON-LD usage so as to make it so that the API can be deterministically produced and consumed using *only* JSON parsing libraries. During de-serialization, it must be possible to parse the JSON deterministically using a JSON library without looking at the @context at all. It must be possible to produce the correct JSON deterministically and add a hard-coded and well understood @context section that does not need to change.
  • There should never be a requirement in the API specification or in our certification suite that forces the use of JSON-LD serialization or de-serialization on either end of the API.
  • If some software in the ecosystem covered by the standard decides to use JSON-LD serializers or de-serializers and and they cannot produce the canonical JSON form for our API – that software will be forced to change and generate the precise constrained JSON (i.e. we will ignore any attempts to coerce the rest of the ecosystem using our API to accept unconstrained JSON-LD).
  • Going forward we will make sure that our sample JSON that we publish in our specifications will always be in JSON-LD Compacted form with either a single @context or a multiple contexts with the default @context included as “@vocab” and all fields in the default context having no prefixes and all fields outside the default @context having simple and predictable prefixes.
  • We are hopeful and expect that Compacted JSON-LD is so well defined in the JSON-LD W3C specification that all implementations in all languages that produce compact JSON-LD with the same context will produce identical JSON. If for some strange reason, a particular JSON-LD compacting algorithm starts producing JSON that is incompatible with our canonical JSON – we will expect that the JSON-LD serializer will need changing – not our specification.
  • In the case of extending the data model, the prefixes used in the JSON will be agreed upon to maintain predictable JSON parsing. If we cannot pre-agree on the precise prefixes themselves then at least we can agree on a convention for prefix naming. I will recommend they start with “x_” to pay homage to the use of “X-” in RFC-822 and friends.
  • As we build API certification mechanisms we will check and validate incoming JSON to insure that it is valid JSON-LD and issue a warning for any flawed JSON-LD but consider that non-fatal and parse the content using only the deterministic JSON parsing to judge whether or not an implementation passes certification.

It is the hope that or the next 3-5 years we can rely on JSON-only infrastructure but at the same time lay the groundwork for a future set of more elegant and expandable APIs using JSON-LD once performance and ubiquity concerns around JSON-LD are addressed.

Some Sample JSON To Demonstrate the Point

Our typical serialization starts with the short form for a single default @context as in this example from the JSON-LD playground:

{
  "@context": "http://schema.org/",
  "@type": "Person",
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "telephone": "(425) 123-4567",
  "url": "http://www.janedoe.com"
}

But lets say we want to extend this with a http://dr-chuck.com/ field – the @context would need to switch from a single string to an object that maps prefixes to IRIs as shown below:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "csev": "http://dr-chuck.com/"
  },
  "@type": "Person",
  "url": "http://www.janedoe.com",
  "jobTitle": "Professor",
  "name": "Jane Doe",
  "telephone": "(425) 123-4567",
  "csev:debug" : "42"
}

If you compact this with a single schema for http://schema.org – all extensions get expanded:

 
{
  "@context": "http://schema.org/",
  "type": "Person",
  "http://dr-chuck.com/debug": "42",
  "jobTitle": "Professor",
  "name": "Jane Doe",
  "telephone": "(425) 123-4567",
  "schema:url": "http://www.janedoe.com"
}

The resulting JSON is tacky and inelegant. If on the other hand you compact with this context:

{
  "@context": {
    "@vocab" : "http://schema.org/",
    "csev" : "http://dr-chuck.com/"
  }
}

You get JSON that is succinct and deterministic with predictable prefixes and minus the context looks like clean looking JSON that one might design even without the influence of JSON-LD.

 
{
  "@context": {
    "@vocab": "http://schema.org/",
    "csev": "http://dr-chuck.com/"
  },
  "@type": "Person",
  "csev:debug": "42",
  "jobTitle": "Professor",
  "name": "Jane Doe",
  "telephone": "(425) 123-4567",
  "url": "http://www.janedoe.com"
}

What is beautiful here is that when you use the @vocab + extension prefixes as the @context, it means that our “canonical JSON serialization” can be read by JSON-LD parsers and produced deterministically by a JSON LD compact process.

In a sense, what we want for our canonical serialization is the output of a jsonld_compact operation and if you were to run the resulting JSON through jsonld_compact again – you would the the exact same JSON.

Taking this approach and pre-agreeing on all the official context and all prefixes for official contexts as well as a prefix naming convention for any and all extensions – means we should be able to use pure-JSON libraries to parse the JSON whilst ignoring the @context completely.

Conclusion

Comments welcome. I expect this document will be revised and clarified over time to insure that it truly represents a consensus position.

Abstract: Massively Open Online Courses (MOOCs) – Past, Present, and Future

This presentation will explore what it was like when MOOCs were first emerging in 2012 and talk about what we have learned from the experience so far. Today, MOOC providers are increasingly focusing on becoming profitable and this trend is changing both the nature of MOOCS and university relationships with MOOC platform providers. Also, we will look at how a university can scale the development of MOOCs and use knowledge gained in MOOCs to improve on-campus teaching. We will also look forward at how the MOOC market may change and how MOOC approaches and technologies may ultimately impact campus courses and programs.

Unconstrained JSON-LD Performance Is Bad for API Specs

I am still arguing fiercely with some of my enterprise architect friends whether we should use JSON or JSON-LD to define our APIs. I did some research this morning that I think is broadly applicable so I figure I would share it widely.

You might want to read as background the following 2014 blog post from Many Sporny who is one of the architects of JSON-LD:

http://manu.sporny.org/2014/json-ld-origins-2/

Here are a few quotes:

I’ve heard many people say that JSON-LD is primarily about the Semantic Web, but I disagree, it’s not about that at all. JSON-LD was created for Web Developers that are working with data that is important to other people and must interoperate across the Web. The Semantic Web was near the bottom of my list of “things to care about” when working on JSON-LD, and anyone that tells you otherwise is wrong. :)

TL;DR: The desire for better Web APIs is what motivated the creation of JSON-LD, not the Semantic Web. If you want to make the Semantic Web a reality, stop making the case for it and spend your time doing something more useful, like actually making machines smarter or helping people publish data in a way that’s useful to them.

In the vein of Manu’s TL;DR: above I will add my own TL;DR for this post:

TL;DR: Using unconstrained JSON-LD to define an API is a colossal mistake.

There is a lot to like about JSON-LD – I am glad it exists. For example, JSON-LD is far better than XML with namespaces, better than XML Schema, and better than WSDL. And JSON-LD is quite suitable for long lived documents that will be statically stored and have data models that slowly evolve over time where any processing and parsing is done in batch mode (perhaps like the content behind Google’s Page Rank Algorithm).

But JSON-LD is really bad for APIs that need sub-millisecond response times at scale. Please stop your enterprise architects from making this mistake just so they gain “cool points” at the enterprise architect retreats.

Update: I removed swear words from this post 4-Apr-2016 and added the word “unconstrained” several places to be more clear. Also I made a sweet web site to show what I mean by “unconstrained JSON-LD” – I called it the JSON-LD API Failground.

Update II: Some real JSON-LD experts (Dave Longley and Manu Sporney) did their own performance tests that provide a lot more detail and better analysis than my own simplistic analysis. Here is a link to their JSON-LD Best Practice: Context Caching – they make the same points as I do but with more precision and detail.

Testing JSON-LD Performance

This is a very simple test simulating parsing of a JSON-only document versus a JSON-LD document. The code is super-simple. Since JSON-LD requires the document be first parsed with JSON and then augmented by JSON-LD to run an A/B performance test we simply turn on and off the additional required JSON-LD step and time it.

This code uses the JSON-LD PHP library from Manu Sporny at:

https://github.com/digitalbazaar/php-json-ld

I use the profile sample JSON-LD for the Product at:

http://json-ld.org/playground/

Methodology of the code – it is quite simple:

    require_once "jsonld.php";

    $x = file_get_contents('product.json');
    $result = array();
    for($i=0;$i<1000;$i++) {
       $y = json_decode($x);
       $y = jsonld_compact($y, "http://schema.org/");
       $result[] = $y;
    }

To run the JSON-only version simply comment out the `jsonld_compact` call. We reuse the $y variable to make sure we don't double store any data and accumulate the 1000 parsed results in an array to get a sense of whether or not there is a different memory size for JSON or JSON-LD.

I used `/usr/bin/time` on my MacBook Pro 15 with PHP 5.5 as the test.

Output of the test runs

si-csev15-mbp:php-json-ld-test-02 csev$ /usr/bin/time -l php j-test.php
            0.09 real         0.08 user         0.00 sys
      17723392  maximum resident set size
             0  average shared memory size
             0  average unshared data size
             0  average unshared stack size
          4442  page reclaims
             0  page faults
             0  swaps
             0  block input operations
             6  block output operations
             0  messages sent
             0  messages received
             0  signals received
             0  voluntary context switches
             6  involuntary context switches
    si-csev15-mbp:php-json-ld-test-02 csev$ /usr/bin/time -l php jl-test.php
          167.58 real         4.94 user         0.51 sys
      17534976  maximum resident set size
             0  average shared memory size
             0  average unshared data size
             0  average unshared stack size
          4428  page reclaims
             0  page faults
             0  swaps
             0  block input operations
             0  block output operations
         14953  messages sent
         24221  messages received
             0  signals received
          2998  voluntary context switches
          6048  involuntary context switches

Results by the numbers

Memory usage is equivalent - actually slightly lower for the JSON-LD - that is kind of impressive and probably leads to a small net benefit for long-lived document-style data. Supporting multiple equivalent serialized forms may save space at the cost of processing.

Real time for the JSON-LD parsing is nearly 2000X more costly than JSON - well beyond three orders of magnitude [*]

CPU time for the JSON-LD parsing is about 70X more costly - almost 2 orders of magnitude [*]

[*] Some notes for the "Fans of JSON-LD"

To stave off the obvious objections that will arise from the Enterprise-Architect crowd eager to rationalize JSON-LD at any cost, I will simply put the most obvious reactions to these results here in the document

  1. Of course the extra order of magnitude increase in real-time is due to the many repeated re-retrievals of the context documents. JSON-LD evangelists will talk about "caching" - this of course is an irrelevant argument because virtually all of the shared hosting PHP servers do not allow caching so at least in PHP the "caching fixes this" is a useless argument. Any normal PHP application in real production environments will be forced to re-retrieve and re-parse the context documents on every request / response cycle.
  2. The two orders of magnitude increase in the CPU time is harder to explain away. The evangelists will claim that a caching solution would cache the post-parsed versions of the document - but given that the original document is one JSON document and there are five context documents - the additional parsing from string to JSON would only explain a 5X increase in CPU time - not a 70X increase in CPU time. My expectation is that even with cached pre-parsed documents the additional order of magnitude is due to the need to loop through the structures over and over, to detect many levels of *potential* indirection between prefixes, contexts, and possible aliases for prefixes or aliases.
  3. A third argument about the CPU time might be that json_decode is written in C in PHP and jsonld_compact is written in PHP and if jsonld_compact were written in C and merged into the PHP core and all of the hosting providers around the world upgraded to PHP 12.0 - it means that perhaps the negative performance impact of JSON-LD would be somewhat lessened "when pigs fly".

Conclusion

Unconstrained JSON-LD should never be used for non-trivial APIs - period. Its out of the box performance is abhorrent.

Some of the major performance failure can be explained away if we could magically improve hosting plans, and make the most magical of JSON-LD implementation - but even with this there is over an order of magnitude of performance cost to parse JSON-LD than to parse JSON because of the requirement to transform an infinite number of equivalent forms into a single canonical form.

Ultimately it means if a large scale operator started using JSON-LD based APIs heavily to enable a distributed LMS - so we get to the point where the core servers are spending more time servicing standards-based API calls rather than generating UI markup - it will require somewhere between 10 and 100 times more compute power to support JSON-LD than simply supporting JSON.

Frankly in the educational technology field - if you want to plant a poison pill in the next generation of digital learning systems - I cannot think of a better poison pill than making interoperability standards using JSON-LD as the foundation.

I invite anyone to blow a hole in my logic - the source code is here:

https://github.com/csev/json-ld-fail/blob/master/README.md

A Possible Solution

The only way to responsibly use JSON-LD in an API specification is to have a canonical serialized JSON form that is *the* required specification - it can also be valid JSON-LD but it must be possible to deterministically parse the API material using only JSON and ignoring the @context completely. If there is more than one @context because of extensions, then the prefixes used to represent the contexts other then the @vocab then the prefixes used by those other contexts must also be legislated so once again, a predictable JSON-only parse of the document without looking at the contexts is possible.

It is also then necessary to build conformance suites that validate all interactions for simultaneous JSON and JSON-LD parse-ability. It is really difficult to maintain the sufficient discipline - because if a subset of the interoperating applications start using JSON-LD for serialization and de-serialization - it will be really easy to drift away from "also meeting the JSON parse-ability" requirements. Then when those JSON-LD systems interact with systems that use JSON only for serialization and de-serialization - it will get ugly quickly. Inevitably uninformed the JSON-LD advocates will claim they have the high moral ground and won't be willing to comply with the JSON-only syntax tell everyone they should be using JSON-LD libraries instead - and it won't take much of a push for interoperability to descend into interoperability finger-pointing hell.

So while this compromise seems workable at the beginning - it is just the Semantic Web/RDF Camel getting its nose under the proverbial tent. Supporting an infinite number of equivalent serialization formats is neither a bug nor a feature - it is a disaster.

If the JSON-LD community actually wants its work to be used outside the "Semantic Web" backwaters - or in situations where hipsters make all the decisions and never run their code into production, the JSON-LD community should stand up and publish a best practice to use JSON-LD in a way that maintains compatibility with JSON - so that APIs and be interoperable and performant in all programming languages. This document should be titled "High Performance JSON-LD" and be featured front and center when talking about JSON-LD as a way to define APIs.


Update: The JSON-LD folks wrote a very good blog post that looks at this in more detail: JSON-LD Best Practice: Context Caching. This is a great post as it goes into more detail on the nature of performance issues and is a good start towards making JSON-LD more tractable. But to me the overall conclusion is still to use highly constrained JSON-LD syntax but not use JSON-LD parsers in high-performance applications.