I’m so old, that I came from age of XML. XML was going to revolutionize simplicity in data transfers. You might still work with it, if you work with XML web services provided by Google Cloud. But every API that is XML, Google also seems to provide in JSON. So you might as well go with the trends. Esp since XML seems preferred by commercial software, and JSON by free software (and Javascript, which the format was designed for).
It will do all of the above, if you save the XSD and XSL, and change the XML file to reference the previous. But it won’t transform badly formatted HTML into XML. That is why I created the HTMLparser in .NET2.0.
But JSON is supported by Python, and your web browser’s Javascript engine, to automatically serialize it into one of the platform’s native data objects. Javascript’s JSON.parse() will turns JSON into in a Javascript object. Python will turn JSON into a dictionary. This means you don’t have to write code, to validate if it is correct, and extract the data out of it, into one of the native datatypes (to a certain degree, dates don’t seem supported very well).
jq is a Linux utility that will query files with JSON inside, and if you wish turn it into another JSON object. If you just wish to search for data, it has filters, and you can turn it into shorter JSON that is easier to read. But it’s group-by function is complex, implementing complicated functions in it can get unwieldy, and I don’t think a SQL style join is supported (although I could be wrong, b/c all my search results 2 years ago came up negative, or with the MERGE command which was like python’s ZIP).
Why do you call it JOIN?
I’m old, and learned SQL and relational database technology, which gave businesses a way to process and examine their data. SQL calls cross-referencing data, to specific other records, INNER JOIN’ing. This way of working with data was prevalent in my day. And when a hammer sees anything, it thinks it is a nail. When I see mostly rigid datatypes with many objects to be organized into a useful form, I think it is a nail, and turn it into data as JOIN’s, and GROUP BY’s.
This is probably all old tech nowadays
So I implemented SQL style INNER JOIN in Python(jqoin.py), and you can download it here.
There is also a easy way (histfrom.py) to turn JSON lists into single histogram.
And a way, to implement a computed field for as-a-bird-flies distance between 2 coordinates, using a external program(distfrom.py).
And a easier way represent aggregate calculations (groupbyjson.py), while being able to separate bins, that is still simple enough to customize for your own purposes (meaning “runs slow” vs “implement fast”).
What is JSON?
It is a data format, invented by Javascript developers, so that it looks like source code for a Javascript object declaration. But is much more strict. You can represent primitives such int, floats, and strings. Often dates are serialized into strings correctly, but rarely parsed correctly. The more complex types, that JSON can represent, are objects(name/value pairs), and arrays. Arrays and objects can be nested for complex data structures.
{ "field1": "value1", "field2": [value2,value3, "value4"], "field3": { "field4":"value5"} }
is a single object, with fields that contain 1) a primitive string, 2) nested array, and 3)nested object.
JSON can be as simple as a string primitive:
"I am string"
but what good would that be, to anyone?
Why use JSON?
Javascript and other JSON parser, infer the type based on the context.
This is a integer
5
calling JSON.parse(“5”) return the string into the int type.
This is a float
5.4
calling JSON.parse(“5.4”) return the string into the float type.
This is a object
{"a":"c"}
calling JSON.parse(‘{“a”:”c”}’) return the string into a object in Javascript, and a dictionary in Python.
This is a array
[1,2,3]
calling JSON.parse(“[1,2,3]”) return the string into a array
This is fairly consistent usage in consumers and producers of storing data in JSON. And JSON.stringify() consistently creates the same string format everytime, given a datatype it was submitted. So it makes saving and making data portable, easy, 1-line.
And any system that keeps it’s data portability this simple, is probably easy to debug.
Why not use JSON?
Other than the simple rules above, there is no way to validate JSON, comforms to what your code is expecting. So you better have a lot of code, checking the data is correct. But it’s ease of use, and simplicity to understand, made it widespread.
Other technology, such as XML, has XSD technology to validate the data format complies w what the receiver expects. This can even be used, to generate code. SOAP is often implemented by using the XML technology to create proxy classes via a code generator, and then you just use the objects created by code generator and just set the values, and proxy class generates the XML. You can’t do this in JSON.
Error checking is a important feature to have in complex systems, where you want failures occurring as close to the source of the problem as possible, to make it easier to trace.
Why use jq?
If you store lots of JSON data, they are formatted the same and you want a way to process them by interactively by command-line, rather writing a python or javascript program, jq will read several JSON objects, as if it were an array.
{"a":"1"} {"a":"2"} {"a":"3"} {"a":"4"}
is 4 objects obviously, but normally in javascript JSON.parse() only accepts 1 object. JSON.parse(‘{“a”:”1″}{“a”:”2″}{“a”:”3″}{“a”:”4″}’) returns an error. But add comma delimters and wrap in [] brackets and JSON.parse(‘[{“a”:”1″},{“a”:”2″},{“a”:”3″},{“a”:”4″}]’) and it will parse as single array.
jq however will accept multiple objects,
echo '{"a":"1"}{"a":"2"}{"a":"3"}{"a":"4"}' | jq '.'
and it will even turn it into an array, if you ask
echo '{"a":"1"}{"a":"2"}{"a":"3"}{"a":"4"}' | jq -s '.'
but is best when you ask it, to return data, so you don’t have to write a long program, to get it:
echo '{"a":1}{"a":2}{"a":3}{"a":4}' | jq '.|select(.a > 2)'
Rather that accepting data pipe redirected from stdin, you can supply filename or filename matching
cat files* | jq '.|select(.a > 2)'
equals
jq '.|select(.a > 2)' file*
Sometimes each of your JSON objects are long, and you only want a small part of each object. Here we only display the “a” field, and turn it into list of strings.
echo '{"a":"1","data":"..............................................."}{"a":"2","data":"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"}{"a":"3","data":"#####################################"}{"a":"4","data":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"}' | jq '.a'
Or you can turn it into shorter JSON, if you wish
echo '{"a":"1","data":"..............................................."}{"a":"2","data":"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"}{"a":"3","data":"#####################################"}{"a":"4","data":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"}' | jq '{"id":.a}'
Similar functionality above is often used in batch processing for SQL databases, in the old days. Or you can just use it, to look up something in your JSON datafile(s).
What is this program you named horribly as “jqoin.py”?
It is a portmanteaux of JQ and join. I come from the era when Ben Affleck and Jennifer Lopez, was cool to refer to them as Ben-nifer. I’m sorry. It is hard outgrowing somethings. It will join 2 streams of JSON files, the way SQL inner joins records between 2 tables. You have to specify to jqoin how the two JSON correlate to each other, and will give you the two JSON that correlates to each other. SQL supports multiple ways to correlate. jqoin only supports equality on (sub)strings.
echo '{"a":"1","data":"..............................................."}' >a1.txt echo '{"a":"2","data":"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"}' >a2.txt echo '{"a":"3","data":"#####################################"}' >a3.txt echo '{"a":"4","data":"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"}' >a3.txt echo '{"b":"2","data":"@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"}' >b2.txt echo '{"b":"3","data":"%%%%%%%%%%%%%%%%%%%%%%%%%%%"}' >b3.txt jqoin.py 'a*.txt' '.a' 'b*.txt' '.b'
should give you 2 objects where “.a” in the a*.txt, and “.b” in the b*.txt match.
{ "key"="2", "a":{ "filename":"a2.txt", "json": {"a":"2","data":"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"} } "b": { "filename":"b2.txt", "json": {"b":"2","data":"@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"} } { "key"="3", "a":{ "filename":"a3.txt", "json": {"a":"3","data":"#####################################"} } "b": { "filename":"b3.txt", "json": {"b":"3","data":"%%%%%%%%%%%%%%%%%%%%%%%%%%%"} }
You can “cascade” this, as I saw jq is supposed to support this (but saw few examples, and not before when it mattered to me).
echo '{"c":"2","data":"*************"}' >c2.txt echo '{"c":"3","data":"-------------"}' >c3.txt jqoin.py 'a*.txt' '.a' 'b*.txt' '.b' | jqoin.py 'c*.txt' '.c'
try it and it will try to match on same value as the rest, and create a “c” field. You can join on something other than the .key field, but then it will nest the {key:””, a:{},b: {json:{key:””, a:{},b:{}}} }, rather than add a “c” field on same level.
There are other programs. What are they
histfrom.py creates histogram, using the bins provided in the arguments.
distfrom.py creates a computed field, that creates a estimated distance between coordinates. Assuming the data has “.lat” and “.lon” fields. You have to provide it coordinates, from which to measure.
How useful are these?
They are free. How useful is that?
You want to run a query fast, and with a lot of complexity, on a lot of data, with minimal of coding? I suggest you load your data in a commercial SQL RDBMS, or even MySQL, and query the data with SQL.