Datamodel

Ssb accepts schemaless data, adhering to a certain format specified in this section. This is the data that application developers care about. Internally, ssb then adds some more metadata to maintain the append-only logs, discussed in the next chapter. This chapter only describes the data model of the schemaless content data.

It is important to clearly distinguish between the abstract data model of a format, and various encodings. For example, the strings 1, 1.0 and 0.1e1 are all different json encodings of the same abstract value. And there can even be different encoding formats for the same abstract value, e.g. the bytes 0xf93c00 also encode the floating point number 1.0, but in cbor rather than json.

Ssb messages all use the same abstract data model, but there are different situations where they use different encodings. An encoding used for signing the data has different requirements than an encoding used for transmitting the data, which again has different requirements than an encoding used for persisten storage.

Encodings for persistent storage are not specified in this document, but is crucial for all ssb implementations to compute the exact same signatures, and to send data in a format that can be understood by other implementations. We call these encodings signing encoding and transport encoding respectively.

The ssb protocol was initially implemented in javascript, and it relied heavily on implicit behavior of the node js runtime. People (including the authors of this specification) have called the message format "bizarre" and worse, and that is entirely justified. But when during reading you inevitably thing "How could anybody end up with this, I could have done a much better job.", then remember: Maybe you could have, but you didn't.

Abstract Data Model

The data model describes all of the free-form data that can be carried in a message. It is close to the json data model, but with a few differences. The definition came about as the set of javascript values created by JSON.parse(json) for which json === JSON.stringify(JSON.parse(json)) (javascript code).

Defined in a language-agnostic way:

Null

null is a value that carries no information.

Booleans

true and false are values, called booleans.

Strings

An ordered sequence of bytes which form valid utf8 of length between 0 and 2^53 - 1 (inclusive) is a value, called a string. Such a string may include null bytes.

Floats

An IEEE 754 double precision (64 bit) floating point number that is none of Infinity, -Infinity, -0 or NaN is a value, called a float.

Arrays

Let n be a natural number less than 2^32, and let v_0, ..., v_n be values.

The ordered sequence [v_0, ..., v_n] is a value, called an array.

Objects

Let n be a natural number less than 2^32, let s_0, ..., s_n be pairwise distinct strings, and let v_0, ..., v_n be values.

The (unordered) set of pairs (s_i, v_i) for all 0 <= i <= n is a value, called a object. The pairs are called entries, the strings are called keys, and the values are called values.

Signing Encoding

The encoding to turn values into a signeable array of bytes is based on json (the set of valid encodings is a subset of ECMA-404 json). There are multiple valid encodings of a single value, because some of the entries of an objects can be encoded in an arbitary order. But up to object entry order, the encoding is unique. When receiving a message over the network, the order of the freely-orderable object entries in the transport encoding is the order that must be used for verifying the signature. Thus the network encoding induces a unique signing encoding to use for signature checking.

The signing encoding is defined as follows:

Signing Encoding Null

null is encoded as the utf-8 string null ([0x6E, 0x75, 0x6C, 0x6C]).

Signing Encoding Booleans

true is encoded as the utf-8 string true ([0x74, 0x72, 0x75, 0x65]).
false is encoded as the utf-8 string false ([0x66, 0x61, 0x6c, 0x73, 0x65]).

Signing Encoding Strings

A string containing the unicode code points c_0, ..., c_n is is encoded as follows:

  • begin with the utf-8 string " (0x22)
  • for each code point c_i in c_0, ..., c_n:
    • if c_i is unicode code point 0x000022 (quotation mark "), append the utf-8 string \" ([0x5C, 0x22])
    • else if c_i is unicode code point 0x00005C (reverse solidus \), append the utf-8 string \\ ([0x5C, 0x5C])
    • else if c_i is unicode code point 0x000008 (backspace), append the utf-8 string \b ([0x5C, 0x62])
    • else if c_i is unicode code point 0x00000C (form feed), append the utf-8 string \f ([0x5C, 0x66])
    • else if c_i is unicode code point 0x00000A (line feed), append the utf-8 string \n ([0x5C, 0x6E])
    • else if c_i is unicode code point 0x00000D (carriage return), append the utf-8 string \r ([0x5C, 0x72])
    • else if c_i is unicode code point 0x000009 (line tabulation), append the utf-8 string \t ([0x5C, 0x74])
    • else if c_i is a unicode code point below 0x000020 (space), append the utf-8 string \u<hex> ([0x5C, 0x75, <hex>]), where <hex> are the two utf-8 bytes of the hexadecimal encoding of the code point, using lower-case letters a - f (0x61 - 0x66) for alphabetic hex digits
    • else append the utf-8 representation of c_i without any modifications
  • append the utf-8 string " (0x22)
Signing Encoding Floats

A float m is encoded as follows:

  • if m == 0, the encoding is the utf-8 string 0 (0x30)
  • else if m is negative, the encoding is the utf-8 string -<abs(m)>([0x2d, <abs(m)>]), where <abs(m)> is the encoding of the same float with the sign bit flipped
  • else (largely quoting from the ECMAScript specification, applying NOTE 2 from here on):
    • let n, k and s be integers such that:
      • k >= 1
      • 10 ^ (k - 1) <= s <= 10 ^ k
      • s * (10 ^ (n - k)) is m (or round-to-even-s to m if it is not precisely representable in a 64 bit float)
      • k is as small as possible
      • if there are multiple values for s, choose the one for which s * (10 ^ (n - k)) is closest in value to m
      • if there are two such possible values of s, choose the one that is even
      • Intuitively, s is the integer you get by removing the point and all trailing zeros from the decimal representation of m, k is the number of digits in the decimal representation of s, and n specifies how to print the number: If n is greater than 0, there are n digits left of the point, else there are abs(n) many zeros right of the point. The choice of s uniquely determines k and n, the tricky part is finding the unique s that rounds correctly and for which k is minimal.
    • if k <= n <= 21, the encoding is the utf-8 string <k_decimals><trailing_zeros>, where <k_decimals> is the utf-8 encoding of the digits of the decimal representation of s, and <trailing_zeros are n - k zero digits (0x30)
    • else if 0 <= n <= 21, the encoding is the utf-8 string <pre_point>.<post_point> ([<pre_point>, 0x2E, <post_point>]), where <pre_point> is the utf-8 encoding of the most significant n digits of the decimal representation of s, and <post_point> is the utf-8 encoding of the remaining k - n digits of the decimal representation of s
    • else if -6 < n <= 0, the encoding is the utf-8 string 0.<zeros><k_decimals> ([0x30, 0x2E, <zeros>, <k_decimals>]), where <zeros> are -n many zero digits (0x30), and <k_decimals> is the utf-8 encoding of the digits of the decimal representation of s
    • else if k == 1, the encoding is <base>e<sign><exp> ([<base>, 0x65, <sign>, <exp>]), where <base> is the utf-8 encoding of the single digit of s, <sign> is the utf-8 string + (0x2B) if n - 1 is positive or the utf-8 string - (0x2D) if n - 1 is negative, and <exp> is the utf-8 encoding of the decimal representation of the absolute value of n - 1
    • else, the encoding is the utf-8 string <pre_point>.<post_point>e<sign><exp> ([<pre_point>, 0x2E, <post_point>, 0x65, <sign>, <exp>]), where <pre_point> is the utf-8 encoding of the most significant digit of the decimal representation of s, <post_point> is the utf-8 encoding of the remaining k - 1 digits of the decimal representation of s, <sign> is the utf-8 string + (0x2B) if n - 1 is positive or the utf-8 string - (0x2D) if n - 1 is negative, and <exp> is the utf-8 encoding of the decimal representation of the absolute value of n - 1
  • good to know: The maximum length for such an encoding is 25 bytes
Signing Encoding Arrays

Let n be a natural number less than 2^32, let v_0, ..., v_n be values, and let e_0, ..., e_n be functions that take a natural number as an argument and return the encodings of v_0, ..., v_n respectively using the supplied number as the indentation level.

At indentation level indent, the array [v_0, ..., v_n] is encoded as follows:

  • if the array is empty (n == 0), the encoding is the utf-8 string [] ([0x5B, 0x5D])
  • else, do the following:
    • begin with the utf-8 string [<line feed> ([0x5B, 0x0A])
    • for each v_i in v_0, ..., v_(n - 1) (so skip this if n == 1):
      • append indent + 2 many space characters (0x20)
      • append e_i(indent + 2)
      • append the utf-8 string ,<line feed> ([0x2C, 0x0A])
    • append indent + 2 many space characters (0x20)
    • append e_n(indent + 2)
    • append the utf-8 string <line feed> (0x0A)
    • append indent many space characters (0x20)
    • append the utf-8 string ] (0x5D)
Signing Encoding Objects

Let n be a natural number less than 2^32, let s_0, ..., s_n be pairwise distinct strings, let v_0, ..., v_n be values, and let e_0, ..., e_n be functions that take a natural number as an argument and return the encodings of v_0, ..., v_n respectively using the supplied number as the indentation level.

At indentation level indent, the object { s_0: v_0, ..., s_n: v_n} is encoded as follows:

  • if the object is empty (n == 0), the encoding is the utf-8 string {} ([0x7B, 0x7D])
  • else, do the following:
    • begin with the utf-8 string {<line feed> ([0x7B, 0x0A])
    • for each pair (s_i, v_i) for i in 0, ..., n - 1 (so skip this if n == 1):
      • append indent + 2 many space characters (0x20)
      • append the encoding of the string s_i
      • append the utf-8 string :<space> ([0x3A, 0x20])
      • append e_i(indent + 2)
      • append the utf-8 string ,<line feed> ([0x2C, 0x0A])
    • append indent + 2 many space characters (0x20)
    • append the encoding of the string s_i
    • append the utf-8 string :<space> ([0x3A, 0x20])
    • append e_i(indent + 2)
    • append the utf-8 string <line feed> (0x0A)
    • append indent many space characters (0x20)
    • append the utf-8 string } (0x7D)
  • The order in which to serialize the entries s_i: v_i is not fully specified, but there are some constraints:
    • intuitively: Natural numbers less than 2^32 - 1 are sorted ascendingly
    • formally:
      • a key is called a more-or-less integer key if it is either the string "0", or a nonzero decimal digit (1 - 9 (0x31 - 0x39)) followed by zero or more arbitrary decimal digits (0 - 9 (0x30 - 0x39)) and consists solely of such digits
      • a key is called an int key if it:
        • is a more-or-less integer key
        • its numeric value is strictly less than 2^32 - 1
          • no, there's no off-by one error here: 0xfffffffe is an int key, whereas 0xffffffff is not
          • Why? We have no idea either
      • all entries with int keys must be serialized before all other entries
      • among themselves, the int keys are sorted:
        • by length first (ascending), using
        • numeric value as a tie breaker (the key whose raw bytes interpreted as a natural number are larger is serialized later)
          • note that this coincides with sorting the decimally encoded numbers by numeric value
    • all other entries may be serialized in an arbitrary order

The string handling is equivalent to ECMAScript 2015 QuoteJSONString, but defined over utf-8 strings instead of utf-16 ones.

The float handling is equivalent to (and quotes from) ECMAScript 2015 ToString Applied to the Number Type, with step 5 replaced as specified in NOTE 2 to result in unique encodings. This spec provides a declarative description of the encoding process, for an algorithmic perspective, there are some papers on the subject such as:

The array and object handling is equivalent to JSON.stringify(value, null, 2), specified in ECMAScript 2015 (except for the object entry ordering, which is not specified in ECMAScript, but implemented this way in v8 and spidermonkey).

Hash Computation

To compute the hash of a value, you can not use the signing encoding directly, but the hash computation is based on it. The signing encoding always results in valid unicode. Represent this unicode in utf-16. This encoding is a sequence of code units, each consisting of two bytes. The data to hash is obtained from these code units by only keeping the less significant byte.

Example: Suppose you want to compute the hash for "ß", the corresponding utf8 is [0x22, 0xC3, 0x9F, 0x22]. In big-endian utf16, this is [(0x22, 0x00), (0xDF, 0x00), (0x22, 0x00)], in little-endian utf16, this is [(0x00, 0x22), (0x00, 0xDF), (0x00, 0x22)]. In both cases, the sequence of less signifiant bytes per code unit is [0x22, 0xDF, 0x22]. That is the byte array over which to compute the hash.

Note that this means that two strings with different utf-8 encodings can result in the same hash, due to the information in the more significant byte of the utf-16 encoding being dropped.

Length Computation

Ssb places a limit on the size of messages. To compute the length of a value, compute the signing encoding (which is always valid unicode), reencode that unicode as utf16, then count the number of code units.

JSON Transport Encoding

In addition to the signing format, messages can be encoded as ECMA-404 json, with the following differences:

  • numbers may not be negative zero
  • numbers may not round to positive infinity, negative infinity or negative zero IEEE 754 64 bit floating point numbers
  • strings may not be longer than 2^53 - 1 bytes
  • arrays and object may not contain more than 2^32 - 1 entries
  • objects may not contain multiple entries with the same key
  • in strings, unicode escape sequences of code points greater than U+FFFF must be interpreted as a single code point, not as an explicit surrogate pair
  • escape sequences of surrogate code points must be matched: Each escape sequence for a high surrogate must be followed by an escape sequence for a low surrogate, and any escape sequence for a low surrogate must be preceded by an escape sequence for a high surrogate

The signing format itself is a subset of this, but this format can be more compact (by omitting all whitespace). This compact form is used by ssb server implementations for message exchange with other servers.