emacsen

Datashards is finally getting traction in the world and so it's time to reflect on where we are and where the project is going.

Datashards is a project that offers up a new storage primitive for secure data storage and transmission. With Datashards the data at rest is encrypted and also protected against data shape attacks. Datashards is designed to work either online or offline and even lets you store your data on someone's machine even if you don't trust them.

Datashards has the opportunity to be an entirely transformational technology in terms of being able to safely store and transmit data.

We've already proven the concept works and we can implement it in multiple languages as we have Fixed Datashards (previously Immutable Datashards) implemented in Racket and Python, and we have Updatable Datashards (previously Mutable Datashards) in Racket.

In the next few months, we'll be working to get Updatable Datashards implemented in Python.

We're also working with a talented and dedicated software developer to get a Javascript implementation of Datashards (both Fixed and Updatable), which we hope will open up many new opportunities.

We will be highlighting these libraries on the Datashards website, along with documentation on how Datashards works and implementation guidelines.

In even more exciting news, we're starting work on a protocol built on top of Datashards designed to enable Datashards servers to communicate.

Datashards is a storage primitive. In that way, it's a bit like the concept of a file- useful as a concept but without implementations and application, nothing more than an interesting idea. The protocols that we're building on top of Datashards are akin to a filesystem built on top of those primitives and that will allow developers to build interesting things using Datashards.

In order to accomplish this task well, Chris and I have been working with possible users of the technology as well as spending time researching similar systems in the past, as well as various peer to peer messaging technologies and patterns in order to build something that is pratical, scalable and build on solid engineering principles.

Thoughts on Canonical S-Expressions

Datashards currently uses Canonical S-Expressions as a data format and after using it for a few months, I have some thoughts.

First things first: If you aren't familiar with the format, let me give you a quick rundown. Canonical S-Expressions are a bit like regular S-Expressions, with a twist. If you already know Lisp, none of this will be new, but for the rest of you, there are two items in an S-Expression- a list and an atom. A list is what it sounds like- a sequence of things. And an atom is a thing. An S-Expressions looks like:

(item1 item2 item3 item4)

If you're familiar with Python or Javascript, you can think of that as the same as:

[item1, item2, item3, item4]

In Canonical S-Expressions (csexp), every atom is actually a byte object, and we say the size of the byte object by prepending it with the number of bytes, followed by a color:

(5:hello5:world)

That's a list of two items, 'hello' and 'world'. I'm putting these in quotes but the values aren't strings, they're bytes. That means it's very efficient to put raw binary data in a csexp. If you put binary data in JSON, you'd have to do something like base64 encode it. No need in csexp!

You can also give a “type hint” in csexp, so if you have a binary object that represents an image, you can stick the mimetype in the csexp, such as:

([image/jpeg]1024:)

You can also store other lists inside of a csexp, such as

(9:groceries(4:milk5:bread))

The Good

The good things about Canonical S-Expressions is how darn easy they are to write and to write a parser for. You can write a csexp parser/generator in an afternoon. It's really that easy!

It's also a very efficient format. You can store image data, text data, anything you want!

And it's extremely versatile. The simplicity is the power!

The Bad

The worst problem I have with csexp is that despite its simplicity, if you want to use it, you're probably going to end up writing your own parser/generator for it. I found a library for Python 2.7, but it didn't work for Python 3, so I had to write my own. My friend Chris Webber wre the implementation for Racket. As of the time of writing, I don't know of an implementation for Javascript, Ruby, Golang or Rust. Writing your own library for something this fundamental isn't fun, even if it's not hard.

The second problem that I have with csexps is that they're not very useful for describing data. For example in Datashards, we will represent the a file size by an integer, 1000, for example. But in csexp, this is represented as 4:1000 which means that my program has to know to convert the value from bytes to an integer.

I could use type hints for the type of data, such as [int]4:1000 but this doesn't help in practice because the program reading

Datashards currently uses Canonical S-Expressions as a data format and after using it for a few months, I have some thoughts.

First things first: If you aren't familiar with the format, let me give you a quick rundown. Canonical S-Expressions are a bit like regular S-Expressions, with a twist. If you already know Lisp, none of this will be new, but for the rest of you, there are two items in an S-Expression- a list and an atom. A list is what it sounds like- a sequence of things. And an atom is a thing. An S-Expressions looks like:

(item1 item2 item3 item4)

If you're familiar with Python or Javascript, you can think of that as the same as:

[item1, item2, item3, item4]

In Canonical S-Expressions (csexp), every atom is actually a byte object, and we say the size of the byte object by prepending it with the number of bytes, followed by a colon:

(5:hello5:world)

That's a list of two items, 'hello' and 'world'. I'm putting these in quotes but the values aren't strings, they're bytes. That means it's very efficient to put raw binary data in a csexp. If you put binary data in JSON, you'd have to do something like base64 encode it. No need in csexp!

You can also give a “type hint” in csexp, so if you have a binary object that represents an image, you can stick the mimetype in the csexp, such as:

([image/jpeg]1024:<bytes>)

You can also store other lists inside of a csexp, such as:

(9:groceries(4:milk5:bread))

What I Like

There's a lot to like about Canonical S-Expressions. They're extremely space efficient, very flexible and super easy to parse. Writing a reader for a csexp is fairly trivial. And even if your language doesn't already have a csexp library, you can easily write one in a day, if not an afternoon.

The other thing I like about Canonical S-Expressions is that they do what they claim to do and nothing else. They're a binary format that only does byte strings and lists.

What's Not to Like About Canonical S-Expressions

Working with CSEXP data can be a pain. You're always stuck writing a reader for your data. Your reader will take the resulting abstract parsed data and convert it into something your application will actually consume. In some cases this conversion is easy, 3:100 becomes the integer 100. If you want to store more complex data structures, such as associative arrays, however, then you'll need to think about it.

Since CSEXP doesn't have associative arrays, only lists, you'll end up writing the serialization/deserialization format on your own. You could store them as lists of lists, ((key val) (key val)) or the more compact form of (key val key val) or you could (ab)use the hint system, such as ([key]value [key]value). Whatever choice you make, it will be specific to your application and someone who reads the document will need to think about the choices you made beforehand. Or if you're inheriting data in this format, you may end up having to guess at the meaning of the data structure.

This type of step is necessary for many serialization formats. In some, like Protobufs, it's a requirement. In XML, it was not strictly necessary but almost always done, and in some applications using JSON, it may not be necessary at all.

Canonical S-Expressions occupy a strange middle ground where having a formal schema is not strictly necessary, as it's schema-less, but it's also challenging to work without one.

Flexible (Schema-less) Data Serialization Formats

Flexible data formats are a topic of deep discussion and debate. In the 90s, it seemed that the world had converged on XML as the One Format to Rule Them All. The problem with XML is that even though the format is self-documenting in some ways, ie <tag></tag>, the value inside tags needed to be converted during a secondary reader, separate from the parser.

Since this distinction isn't always clear, the parser parses the raw data into a machine readable data structure, while a reader parses the data (usually post-parsed) into application specific data structures.

Canonical S-Expressions have the same problem in regards to needing a reader that XML does, but unlike XML, you don't have the storage or bandwidth issues of the tags.

JSON seems to have won out the generic data format wars by offering some types, making writing a reader trivial (or in some cases, unnecessary) but anyone who has ever had to work with JSON knows that its thin layer of types is misleading. As an example, “How do you store a date in Javascript?”

You could store it as Unix time, seconds after the epoch, or you could store it in an ISO 8601 formatted string, ie "2008-09-15T15:53:00+05:00" or an RFC 822 date format, or something else entirely. Your parser will happily give you a string, but you're stuck needing a reader to do that final conversion, just like you did with XML.

JSON-LD solves some of this by giving your values semantic meaning, but it makes the parser more complex.

And neither XML nor JSON handle binary data well. To store binary data in either format, you must first convert it to Base64, which introduces an enormous amount of storage and transmission overhead.

Canonical S-Expressions offers none of the overhead of XML and doesn't claim to do type conversions. Since you'll need a reader anyway, you can do your type conversions in that step.

Further Thoughts and Alternatives

In practice, having some type data assistance does offer benefits. It makes your reader simpler, and it makes the format more pleasant to work with, and so while I appreciate cxesp's simplicity, I find working with it to be more challenging than it should be.

One thought that I keep having while I'm using csexp is to use the type hints to store information such as the data type. Imagine if instead of:

20:2019-10-02T07:11:07Z

We instead stored:

[iso8601]20:2019-10-02T07:11:07Z

That would give us the data type and we could let the reader take some of the work off of our programming logic. This is similar to JSON-LD's method of storing semantic data.

I personally like this idea, but it requires changes to the readers to recognize a new “Semantic Canonical S-Expression”.

A simpler idea would be to store some type information alongside the data, so instead of 3:253, you might store I3:253, with “I” indicating that the value is an integer. This is exactly what the Bencoding format does. Bencoding offers many of the same benefits of CSEXP, but because it also supports types, is a bit easier to work with. The downside, as always, is that this helpfulness comes at the cost of storage and bandwidth.

Other formats exist as well. I previously mentioned Bencoding, but there is also MessagePack, ASN.1, CBOR, and the newest, Preserves. Each of these has a different approach, though they center around the same problem- making it easy to store arbitrary data, especially binary data, on disk and on the network.

It's beyond the scope of this post to delve into each of them. I think Preserves is the most interesting of the formats. It's certainly the most expressive despite being compact, but since I haven't used it I don't know if that expressiveness will be something I need or if I could simply use Bencoding or MessagePack to the same effect.

Conclusion

Canonical S-Expressions are a great, flexible, compact data format. It's very fast and efficient. If you have straightforward needs, it's certainly worth checking out. In my use case, Datashards, it fits our current needs. If we end up wanting to store more complex data structures in the format, such as associative arrays, that will be the time to re-evaluate the format choice to see if something else would be a better fit.

On Long-Form Blogging

In 1995, I got my first taste of the World Wide Web. That's a funny thing to think about now, but at the time it was very new and most websites that I found were weird, off the wall, and amazingly amateurish. I found sites about Bonsai Kittens, connecting soda machines to the internet, lucid dreaming and a bunch of vanity websites from people just wanting other people to know they existed.

In 1997, I ran my very own website from my dorm room. It was thanks to Microsoft Personal Web Server, and it let my humble desktop PC present me to the world. I used it to host essays for school... before it crashed.

In the early 2000s, I found LiveJournal and at the time, LiveJournal filled the same role in my life that the Fediverse does now. I had real life friends who followed me on LiveJournal. I had friends from LiveJournal, I met people through people on LiveJournal and was exposed to new thoughts, ideas and experiences through reading about others' lives.

I loved it so much, I was not only a paid subscriber, but I paid for a lifetime membership. ...Until the site was bought out by a Russian company and I closed my account.

When Twiter came onto my radar, it was through geeky friends who had seen it at a conference. It was a Rails project and it was a bridge to SMS texting. It felt more like IRC than blogging. Blogging was at least a few paragraphs, and they spoke to something about the person's experience. They might be personal, or technical, but they felt intimate and connecting. Twitter was 140 characters.

Mastodon made a choice to be 500 characters, which was more than three times better! But as time has gone on, I've found myself writing posts that span three, four or five toots. This isn't a limitation of ActivityPub- it's a design choice of Mastodon itself to limit itself to microblogging.

But I miss blogging, and if Medium has taught me anything, it's taught me that other people miss it too, and they're even willing to put up with Medium to have it!

So I'm using Write Freely/Write.as to blog again. With ActivityPub, people can subscribe to my posts just as easily as they could on LiveJournal, either from ActivityPub or RSS. And who knows, maybe this whole thing will take off and I'll be able to feel like I really know people's thoughts and feeling again. Maybe we can bring the humanity back to social networking.