Thoughts on Canonical S-Expressions

October 19, 2019

Datashards currently uses Canonical S-Expressions as a data format and after using it for a few months, I have some thoughts.

First things first: If you aren't familiar with the format, let me give you a quick rundown. Canonical S-Expressions are a bit like regular S-Expressions, with a twist. If you already know Lisp, none of this will be new, but for the rest of you, there are two items in an S-Expression- a list and an atom. A list is what it sounds like- a sequence of things. And an atom is a thing. An S-Expressions looks like:

(item1 item2 item3 item4)

If you're familiar with Python or Javascript, you can think of that as the same as:

[item1, item2, item3, item4]

In Canonical S-Expressions (csexp), every atom is actually a byte object, and we say the size of the byte object by prepending it with the number of bytes, followed by a color:

(5:hello5:world)

That's a list of two items, 'hello' and 'world'. I'm putting these in quotes but the values aren't strings, they're bytes. That means it's very efficient to put raw binary data in a csexp. If you put binary data in JSON, you'd have to do something like base64 encode it. No need in csexp!

You can also give a “type hint” in csexp, so if you have a binary object that represents an image, you can stick the mimetype in the csexp, such as:

([image/jpeg]1024:)

You can also store other lists inside of a csexp, such as

(9:groceries(4:milk5:bread))

The Good

The good things about Canonical S-Expressions is how darn easy they are to write and to write a parser for. You can write a csexp parser/generator in an afternoon. It's really that easy!

It's also a very efficient format. You can store image data, text data, anything you want!

And it's extremely versatile. The simplicity is the power!

The Bad

The worst problem I have with csexp is that despite its simplicity, if you want to use it, you're probably going to end up writing your own parser/generator for it. I found a library for Python 2.7, but it didn't work for Python 3, so I had to write my own. My friend Chris Webber wre the implementation for Racket. As of the time of writing, I don't know of an implementation for Javascript, Ruby, Golang or Rust. Writing your own library for something this fundamental isn't fun, even if it's not hard.

The second problem that I have with csexps is that they're not very useful for describing data. For example in Datashards, we will represent the a file size by an integer, 1000, for example. But in csexp, this is represented as 4:1000 which means that my program has to know to convert the value from bytes to an integer.

I could use type hints for the type of data, such as [int]4:1000 but this doesn't help in practice because the program reading