What is a self addressing identifier, a SAID? What does this mean and how is a SAID created and verified? This post answers these questions. We show a generalized process for calculating SAIDs and delve into the encoding format for CESR-compliant self addressing identifiers. Examples with three popular algorithms, SHA2-256, SHA3-256, and Blake3-256, show specifics of applying the general process. This general process can be used for calculating SAIDs with other cryptographic algorithms.
For those who want to skim there are pictures below including bit diagrams that illustrate exactly what is happening.
What is a SAID?
Fundamentally, a SAID is a cryptographic digest of a given set of data and is embedded within the data it is a digest of. A CESR-style SAID pads the digest to 33 bytes and adds a type code into the padded digest to replace resulting Base64 pad characters. It looks like this:
HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6
This is a SHA3-256 digest encoded in the CESR format.
What is the CESR format? It is the Base64 URL Safe encoding of the raw digest along with some front-padding of zero bits and a type code, as shown in detail below. From the above SAID, the ‘H’ character is the type code. The rest of the string is composed of Base64 URL Safe characters.
Why Base64? More Space
Why was Base64 encoding used rather than something like hex encoding? Because Base64 encoding allows maximally compact text encoding of data using a well-known encoding protocol of alphanumeric characters (0-9, a-z, A-Z, -_). As compared to hexadecimal (“hex”) encoding Base64 encodes 6 bits of data per Base64 character whereas hex encoding encodes 4 bits of data per Base64 character, so Base64 can store 50% more data in the same space compared to hex. This helps reduce bandwidth and power costs, optimizing performance overall.
Note on Hash or Digest Terminology
A note on terminology, sometimes digests are called hashes or hash values. The technical definition of the term hash refers to a hash function. Hash functions transform data into a fixed-size string. This fixed-size string is the digest, the output of a hash function.
Back to SAIDs, the fact that a SAID can be embedded in the data it is a digest of is why it is called “self addressing.” The digest is essentially a unique identifier of the data it is embedded in.
A SAID (Self-Addressing Identifier) is a special type of content-addressable identifier based on an encoded cryptographic digest that is self-referential.
Composable Event Streaming Representation ToIP Specification – Section 12.6 – Dr. Samuel M. Smith
What is a content addressable identifier? A content addressable identifier is an identifier derived from the content being stored which makes a useful lookup key in content addressable storage, such as IPFS or a key-value store database like LevelDB, LMDB, Redis, DynamoDB, Couchbase, Memcached, or Cassandra.
Embedding a digest changes the source data and hash, right?
How can the SAID digest could be accurate given that placing the SAID in the data it identifies changes the data, thus producing a different hash? The way SAIDs accomplish this is with a two step generation and embedding process.
Two step SAID generation and embedding process
During SAID calculation the destination field of the SAID is filled with pound sign filler characters (“#”) up to the same length of the SAID.
The digest is then calculated, encoded, and placed in the destination field.
The reverse occurs for verification of a SAID.
The SAID is replaced with filler ‘#’ characters up to the same length of the SAID.
The digest is calculated, encoded and compared with the SAID
How does the generation step work? This question kicks off a larger discussion about CESR-style encoding of cryptographic digests using pre-padding and type codes. First, let’s start with some code examples that cut right to the chase. You can come back to these examples after reading the post if they don’t make sense to you at first.
Code examples with multiple algorithms
Let’s start with some code examples showing how to create a correct SAID including the appropriate pre-padding characters. For additional understanding come back and review these examples after you have read the sections on 24 bit boundaries, pad characters, and pad bytes.
For now, say you want to use other cryptographic digest algorithms to create your SAIDs. How would you go about doing that?
It is as easy as changing your hashing function and then using the corresponding type code from the CESR Master Code Table corresponding to your desired digest algorithm.
The following code examples in Python illustrate the process for each of the following algorithms, Blake2b-256, Blake3-256, and SHA2-256. The SHA3-256 algorithm is shown above in the example in the main body of the article.
Filler ‘#’ characters in digest ‘d’ field
The following examples all use the raw value that includes the filler ‘#’ pound sign characters for the digest field ‘d’ which will both be explained later. The “d” digest field is supposed to contain the same number of filler characters as the eventual SAID that will replace the filler characters.
Creating a Blake2b-256 SAID – Step By Step
For a Blake2b-256 SAID with Python you just change the hash function and specify a digest size.
import hashlib
from base64 import urlsafe_b64encode
raw_value = b'{"d":"############################################","first":"john","last":"doe"}'
digest = hashlib.blake2b(raw_value, digest_size=32).digest() # <-- See the different algorithm blake2b
padded_digest = b'\x00' + digest
encoded = urlsafe_b64encode(padded_digest)
b64_str_list = list(encoded.decode()) # convert bytes to string of chars for easy replacement of 'A'
b64_str_list[0] = 'F' # replace first 'A' character with 'F' type code
b64_str = ''.join(b64_str_list) # convert string of chars to string with .join()
assert b64_str == 'FFfZ4GYhyBRBEP3oTgim3AAfJS0nPcqEGNOGAiAZgW4Q'
assert len(b64_str) == 44 # length should still be 44 characters, 264 base64 bits, a multiple of 24 bits
Creating a Blake3-256 SAID – Step By Step
Blake3-256 is even easier, though it requires the blake
library
import blake3
from base64 import urlsafe_b64encode
raw_value = b'{"d":"############################################","first":"john","last":"doe"}'
digest = blake3.blake3(raw_value).digest() # <-- See the different algorithm blake3.blake3
padded_digest = b'\x00' + digest
encoded = urlsafe_b64encode(padded_digest)
b64_str_list = list(encoded.decode()) # convert bytes to string of chars for easy replacement of 'A'
b64_str_list[0] = 'E' # replace first 'A' character with 'E' type code
b64_str = ''.join(b64_str_list) # convert string of chars to string with .join()
assert b64_str == 'EKITsBR9udlRGaSGKq87k8bgDozGWElqEOFiXFjHJi8Y'
assert len(b64_str) = 44 # length should still be 44 characters, 264 base64 bits, a multiple of 24 bits
Creating a SHA2-256 SAID – Step By Step
And finally SHA2-256 is also easy, just changing the hash function used:
import hashlib
from base64 import urlsafe_b64encode
raw_value = b'{"d":"############################################","first":"john","last":"doe"}'
digest = hashlib.sha256(raw_value).digest() # <-- See the different algorithm sha3_256
padded_digest = b'\x00' + digest
encoded = urlsafe_b64encode(padded_digest)
b64_str_list = list(encoded.decode()) # convert bytes to string of chars for easy replacement of 'A'
b64_str_list[0] = 'I' # replace first 'A' character with 'I' type code
b64_str = ''.join(b64_str_list) # convert string of chars to string with .join()
assert b64_str == 'IDuyELkLPw5raKP32c7XPA7JCp0OOg8kvfXUewhZG3fd'
assert len(b64_str) == 44 # length should still be 44 characters, 264 base64 bits, a multiple of 24 bits
Now on to a visual introduction.
Visual Introduction to SAID
Here is a SAID using the SHA3-256 algorithm on the sample JSON object used in this post.
HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6
Adding this SAID to a document looks like taking the following JSON,
computing the SAID, encoding it, and placing it in the SAID field, or digest field, which is the “d” field in this example:
The ‘H’ character is highlighted here to draw attention to the fact that is a special character. This special character is the type code in the CESR Master Code Table. This indicates the type of cryptographic algorithm being used, SHA3-256 in this case.
I see a problem…
Those new to calculating and encoding SAIDs often encounter a problem here. If you take the raw Base64 encoded value of the JSON value {"d":"","first":"john","last":"doe"}
then you end up with the string value eyJkIjoiIiwiZmlyc3QiOiJqb2huIiwibGFzdCI6ImRvZSJ9
, which is nowhere close to the value shown in the picture of HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6
. Why are they different?
Doing a plain Base64 encoding of the JSON bytes misses an important step, the encoding step referred to above. The rest of the post dives deep into this encoding as it shows and explains how to construct a correct, CESR-encoded, SAID digest and explains the rationale behind why CESR encoding is designed the way it is.
Five parts of a SAID (SAID)?
As mentioned earlier, a SAID is a cryptographic digest. Specifically, it is a kind of digest usable as a content addressable identifier, and it is embedded in the content it identifies. SAIDs were invented by Dr. Samuel Smith as a part of his work on key event receipt infrastructure (KERI), authentic chained data containers (ACDC), and composable event streaming representation (CESR).
To understand how SAIDs work you must learn the interplay of five different concepts including:
Bit boundaries – aligning on 24 bit boundaries using pre-padded bytes on the left/front of raw bytes
Hash values – hashing input bytes with hashing functions to produce output hash values (digests)
Encoding with the URL-safe variant of Base64 encoding,
Using type codes to indicate type of hashing function and size of digest,
The two-pass SAID calculation and embedding process.
This article specifically covers SAIDs that are encoded in the CESR format. These CESR-style SAIDs
use pre-padding of pad bytes for bit padding to align on 24 bit boundaries,
are compatible with a variety of common hashing functions,
are encoded in the URL-safe variant of Base64 encoding (a.k.a. Base64URL),
substitute type codes from the
CESR Master code table (section 12.4.2) for ‘A’ front zero characters
and are calculated from and embedded in the data they identify.
How does it work? How are SAIDs calculated?
The easiest way to understand a self addressing identifier is to create one. Starting with the JSON from above we walk through each of the five major concepts required to create a CESR encoded SAID.
7 Steps to Calculate and Embed a SAID
Briefly, the process is listed here. A detailed explanation and example follows this set of steps.
Get an object to calculate a SAID for with a digest field that will hold the SAID. In this case we use the JSON object below and the “d” field will hold the SAID.
The field does not have to be empty though it can be. Prior to digest calculation it will be cleared and filled with the correct number of filler characters.
Calculate the quantity of Base64 characters the final encoded bytes will take up and fill the digest field with that many ‘#’ characters. This value may be looked up from a parse table like the
CESR Master Code Table based on the type of hashing function used.
Replace the contents of the digest field, “d” in our case, with pound sign (“#”) characters up to the number of filler characters calculated in step 2.
The calculated size and pad values used for this step are reused in step 4.
Calculate a digest of the object with the filler ‘#’ characters added using the hash function selected.
This will result in a quantity of digest bytes, specifically 32 bytes for the SHA3-256 algorithm.
Calculate the quantity of pad bytes that when added to the digest bytes will give you a value length that is multiple of 24 bits. This math is shown below. For us this is 1 pad character giving us 33 bytes. This value may be looked up from a parse table like the
CESR Master Code Table.
Perform pre-padding by prepending the pad byte to the digest bytes to get padded raw bytes.
Encode the padded raw bytes with the
Base64 URL Safe alphabet.
Pre-padding causes some characters at the start of the digest to be encoded as “A” characters which represent zero in the Base64 URL Safe alphabet.
Substitute the type code for the correct number of “A” zero character(s) in the Base64 encoded string according to the CESR encoding rules from the
CESR Master Code Table. Use the type code corresponding to the cryptographic hash algorithm used. In our case this is “H” because we are using the SHA3-256 algorithm.
This is your SAID!
Place the Base64 encoded, type code substituted string (your SAID!) into the digest field in your object. This makes your object self-addressing.
3 Steps to Verify a SAID
Start with a SAID from an object you already have.
Calculate the SAID for the object using the process shown above
Compare the SAID you pulled out of the object with the SAID you calculated.
If they match then the SAID verifies. Otherwise the SAID does not verify.
An illustration will make clear why and how this process is done. Let’s walk through an example with a small JSON object. The concept applies to any size JSON object and objects of any serialization format such as CBOR, MessagePack, arbitrary text, or otherwise.
Example walkthrough with JSON and SHA3-256
Create Step 1: Get an object with some data and a digest field
Starting with the JSON below we have a “d” field, or digest field, in which the SAID will eventually be placed. In our case it is empty though it could start with the SAID in the “d” field and the process would still work.
JSON being SAIDified:
{
"d": "",
"first": "john",
"last": "doe"
}
Create Step 2: Calculate the quantity of filler ‘#’ characters
The expected final size of the SAID must be known in advance in order to create a JSON object with a stable size. Calculating this quantity requires that you understand a major concept in CESR:
How to calculate pad sizes (quantity of pad bytes) and full sizes of values.
Understanding this calculation will get you most of the way towards understanding another major CESR concept called “fully qualified Base64 representation” of a cryptographic primitive. A digest is a kind of cryptographic primitive.
Knowing the size in advance, and having it be stable, is critical for CESR’s type, length, value (TLV) encoding scheme. This stable size is achieved by filling the digest field with the same number of pound sign ‘#’ characters as the size of the SAID, which looks like this:
Correct number of filler characters added to digest field
{
"d": "############################################",
"first": "john",
"last": "doe"
}
This enables the JSON to have the same size during and after the SAID calculation process, giving a stable size. In order to know the number of filler characters then you must calculate how many Base64 characters will be in the final SAID. Calculating how many Base64 characters are needed involves summing raw bytes and pad bytes needed to align on what is called a 24 bit boundary.
Final output has same size since Base64 characters count equals filler length
Aligning on this 24 bit boundary allows the final result with the SAID to have the same length as the version with the filler characters, 44 characters in our case:
{
"d": "HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6",
"first": "john",
"last": "doe"
}
Remember when the “encoding” step was mentioned from above? That’s where this filler character and size calculation knowledge comes in. In this encoding step you learn about the CESR-style encoding using pre-padding, pre-conversion. Knowing how many filler characters to use requires understanding the concept of aligning on a 24 bit boundary. Aligning on a 24 bit boundary is where the pre-padding of CESR comes in. This calculation of pad bytes required to align on a 24 bit boundary is the primary difference between raw, or “naive”, Base64 encoding and CESR encoding.
First let’s delve into what a 24 bit boundary is, why it matters to Base64 encoded values, and then look at some diagrams that make Base64 post-padding and CESR pre-padding clear. In doing this we jump ahead a bit and show byte diagrams of the actual encoded digest since that will help introduce later steps.
24 bit boundary – from Base64
The 24 bit boundary comes from the Base64 encoding format standard, RFC4648, specifically section 4. The reason a 24 bit boundary matters is because you can only use whole Base64 characters; there is no such thing as a fractional Base64 character. A Base64 character represents 6 bits of your raw bytes. A single byte is 8 bits. How do you reconcile the 6 bit Base64 character encoding to the 8 bits of your raw bytes? This is where a little math comes in, specifically the least common multiple.
Section 4 of the Base64 RFC 4648 describes the 24-bit groups that are the origin of the 24-bit boundary:
The encoding process represents 24-bit groups of input bits as output strings of 4 encoded characters. Proceeding from left to right, a 24-bit input group is formed by concatenating 3 8-bit input groups. These 24 bits are then treated as 4 concatenated 6-bit groups, each
of which is translated into a single character in the base 64 alphabet.
RFC 4648 The Base16, Base32, and Base64 Data Encodings – Section 4
Using these 24-bit groups ensures the value coming out of a Base64 decoder is the same value you put in. Separating raw bits into these 24 bit groups is where the phrase “aligning on 24 bit boundaries” comes from.
Splitting the 8-bit groups up into 6-bit groups requires a little math because 8 does not split evenly into 6. The math equation to do this is the least common multiple (LCM). LCM is used to determine the lowest number that both 8 and 6 divide into evenly, which is 24, thus the need for 24-bit groups, or 24-bit boundaries. Any value that is encoded into Base64 characters must be padded to reach a multiple of 24 bits. These 24-bit groupings allows you to cleanly convert all of your 8-bit bytes in to 6-bit Base64 characters and back to bytes without missing any bits.
Yet, if we have a stream that does not align on a 24 bit boundary then how do we create that alignment?
Pad characters on the END of a string are the answer to this in Base64.
By adding the correct number of pad characters on the end of a Base64 stream then you always end up with a value aligned on a 24 bit boundary. The ‘=’ equals sign pad characters in a plain Base64 encoding indicate the quantity of pad bits that were used in the final Base64 character adjacent to the ‘=’ pad characters.
Pad bytes at the START of the raw bytes are the answer to this in CESR.
By prepending the correct number of pad bytes on the start of a set of raw digest bytes then you always end up with a value aligned on a 24 bit boundary. Since the pad bytes are all zero bits then the resulting encoded value will start with one or more ‘A’ characters since they correspond to all zero bits in the Base64 alphabet.
Pad characters Calculation
In a plain Base64 encoding when encoding an array of bytes into Base64 that does not align on a 24 bit boundary the correct number of Base64 pad characters ‘=’ must be included. Why? Because in order to avoid data corruption in the decoded value you must know the precise original value, which means knowing how many pad characters to strip off and how many pad bits to strip out of the Base64 character adjacent to the padding. The decoder of your Base64 character needs to know how many bits of the last character used were just padding and how many were a part of your raw value.
You must signal the end of your raw bytes somehow. If, instead, you ignore, drop, or omit pad characters then you will confuse a Base64 decoder into thinking that pad bits were a part of your raw bytes, which you want to avoid because that will give you a different output value than what your input value was, meaning you would experience data corruption.
Pad characters must be included with a plain or “naïve” Base64 encoded value so that a Base64 decoder can strip the correct number of pad bits from the output giving you your original input bytes when decoding from Base64 characters to raw bytes. This is the purpose that Base64 pad characters serve. The pad characters indicate how many pad byes were used to encode a value in Base64.
CESR uses pad bytes and characters in a similar way, yet on the front, and with pre-conversion padding, so the rules for identifying and stripping pad bits are slightly different.
Yet, let’s stick with Base64 padding for now and come back to CESR padding later. If you are starting to get confused or lost then skip ahead to the diagrams below and come back to this explanation.
ASIDE – Calculating the quantity of Base64 pad characters based on input byte quantity
For a SHA3-256 digest this count is 44 characters. See the math below for an explanation. This number may also be found in the CESR Master Code Table for the type of algorithm used. Since we measure every raw value in terms of bytes (8 bits) then there are three possible scenarios, detailed here in the Base64 RFC, for the number of pad bytes required and thus pad characters.
A value ending with a single byte (8 bits) beyond a 24 bit boundary requires two bytes (16 bits) to meet a 24 bit boundary. This will have two ‘=’ pad characters.
This means that your 8 raw bits + the 16 padding bits (two bytes) will equal 24 bits, aligning your raw value on a 24 bit boundary.
A value ending with two bytes (16 bits) beyond a 24 bit boundary requires one byte (8 bits) to align on a 24 bit boundary. This will have one ‘=’ pad character.
Take the 16 bits + one pad byte (8 bits) to get to 24 bits to align on the 24 bit boundary.
A value ending with three bytes is already aligned on a 24 bit boundary (3 * 8 = 24)
You can use the modulus operator ‘%’ to determine the number of ending bits you have. For 256 bits (32 bytes * 8 bits per byte) you end up with 16 bits, or two bytes, rule number two above. So we need the equivalent of one pad byte.
How Base64 handles pad bits
The way that Base64 handles the need for pad bytes is to split the last byte into two characters, add zero bits to the last Base64 character, and then add the correct number of pad ‘=’ equals sign characters to the final output to end up with groups of 4 Base64 characters, which aligns on a 24 bit boundary because 4 * 6 bits per Base64 character = 24 bits.
What this means for a SAID – Calculating Pre-pad Bytes for CESR
In CESR padding is handled a bit differently because it repurposes the pad characters for type codes in its TLV encoding scheme. This means that what would have been zero bits representing ‘A’ characters in the Base64 encoded CESR value gets replaced with the type code, also called derivation code, in the final CESR value. To accomplish this CESR does pre-padding prior to conversion to Base64 characters. What this means for SAIDs is that all digest bytes must be padded at the front of the digest bytes to reach a multiple of 24 bits. Compare this to Base64 padding which occurs at the end of the digest bytes. Both scenarios are pictured below, Base64 padding and CESR padding.
Since the SHA3-256 digest we start with is 32 bytes, or 256 bits (not a multiple of 24), then all we need to add is one byte to get to 264 bits, which is a multiple of 24, or 33 bytes.
Now once you know the quantity of bytes that align on a 24 bit boundary you can do a simple calculation to get to the number of pad characters for your digest. Since 6 bits of every byte are put into a Base64 character (6 bit groups) then you can divide your total number of bits (264) by 6 to get the number of Base64 characters of your final digest.
264 (bits) / 6 (bits per Base64 char) = 44 (Base64 chars)
This means the total length of the resulting SAID will be 44 Base64 characters. So, you need 44 filler ‘#’ pound sign characters in your digest field of your JSON object prior to calculating the SAID.
Fixed width output – why is it needed?
Consistent sizing of the resulting JSON object for stable size of the overall output is the primary reason for pad characters. In order to create the same size output both before and after the SAID is added into the JSON there must be an equivalently sized number of pound signs (44 in this case) placed into the same field where the SAID will go. This is used in CESR encoding because CESR data types are encoded with to a type, length, and value scheme (TLV scheme) that simplifies parsing. Size of the overall output is the length, or “L,” in TLV and it only works if you have a known width data.
{
"d": "############################################",
"first": "john",
"last": "doe"
}
Now that you know the rules for calculating the number of pad characters then we are ready to illustrate the calculation process with diagrams.
Diagram for plain “naïve” Base64 encoding of SHA3-256 digest
Base64 uses post-padding, post-conversion of pad characters, as shown in the diagram below. You start with the raw digest. All the boxes in this diagram represent the raw bytes of the digest. There is no padding yet because the value is raw and is not yet converted to Base64 characters.
Binary bits of 32 byte SHA3-256 digest of above JSON with ‘#’ filler
For those following along in code the raw bytes of the 32 byte SHA3-256 digest of the JSON above (with the ‘#’ filler characters) are represented in binary as follows:
1111001001011011010101100010111010011111011001101111000110001101000010000000010010000011100010110000000000000001100111110110110000101001010000110100100101001000111110110110011100010001110100110010011010101000010001000100101011100100000011111110100011111010
Take a look at the last two bytes are 11101000
and 11111010
. This factors in to the last two characters adjacent to the pad character as you see below.
Encode this 32 byte digest to Base64 URL Safe and you get get:
What happened here is that four bits (1010
) of the last byte ( 11111010
) were encoded into the last character lowercase ‘o’ adjacent to the pad character. If you look at the value for lowercase o in the Base64 alphabet you will see that lowercase ‘o’ has the bit pattern 101000
. Yet it only pulled four bits from the last byte of 11111010
so where did the last two bits (00
) come from? They were added in by the Base64 encoder. These two pad bits are why the corresponding final value has a single equals sign ‘=’ pad character. That instructs the Base64 encoder to strip two bits from the last character during the decoding process:
IMPORTANT: Base64 does not add the padding to the raw bytes prior to conversion. Instead it adds the padding while converting the 6 bit groups of the raw bytes into Base64 characters.
Due to the fact that 32 bytes, 256 bits, does not evenly align on a 24 bit boundary, is not a multiple of 24, the Base64 encoder splits the last byte into two different Base64 characters since 8 bits does not evenly fit in one 6 bit group and must be spread across two 6-bit groups. Each of these 6 bit groups each get their own Base64 character. In this case, the last two bytes 11101000
and 11111010
get spread across the last two characters ‘P’ (001111
) and ‘o’ (101000
).
Because of how the math works when splitting the 8-bit byte groups into 6-bit Base64 character groups the ‘o’ character got four bits from the very end of the digest. Yet four bits is not enough for a Base64 character so the Base64 encoder adds two zero bits on the end, signified with white boxes containing zeroes. Before the pad character is added then we are at 43 Base64 characters (6 bit groups, 258 bits), which is not a multiple of 24 bits. When the pad character ‘=’ is added then we get to a 44 characters (264), which is a multiple of 24 bits, meaning the encoding completed successfully.
Base64 Encoded SHA3-256 Digest
With the fully padded value you end up with a valid, encoded, Base64 value that looks like the following bit diagram:
The C2 character at the end shares some bits with the raw bytes of the digest and also contains some padding zero bits. The last character, C1, is an equals sign ‘=’ pad character. The fact that there is one pad character indicates to the Base64 decoder that there are two zeroed pad bits to remove from the last character, ‘C2’, during decoding in order to get back to the original digest bytes.
‘=’ is wasted space?
You could consider the pad characters ‘=’ as wasted space that could be useful if repurposed. All of the pad bits used for the equals sign could represent something. This is exactly what CESR does except it moves the padding to the front of the bytes so that it can have a uniform TLV encoding format. TLV encoding formats require the type character to be at the front of the value, so using post-padding like Base64 does would not work.
Along these same lines, SAIDs do not use Base64-style padding because it does not enable separability of individual concatenated values due to the fact that there is no easy way to cleanly and reliably separate individual values out of a Base64 encoded stream of bytes. The CESR specification introduction mentions this:
This Composability property enables the round-trip conversion en-masse of concatenated Primitives between the text domain and binary domain while maintaining the separability of individual Primitives.
Composable Event Streaming Representation ToIP specification – Dr. Sam Smith
Now that you understand how the plain or “naïve” Base64 encoding works then we turn our attention to CESR style pre-padding.
CESR Byte Padding: Pre-padding, Pre-conversion
In CESR the padding of values occurs with the raw bytes prior to encoding to Base64 as shown below in the white box containing ‘B33.’
What this means is that the raw value, prior to conversion, already aligns on a 24 bit boundary. Due to this alignment pre-conversion then there will never be any Base64 pad characters ‘=’ in the output.
How many bytes to prepend?
How do you know how many bytes to prepend? With a similar calculation we did above to find the number of filler characters
Since the SHA3-256 digest we start with is 32 bytes, or 256 bits (not a multiple of 24), then all we need to add is one byte to get to 264 bits, which is a multiple of 24, or 33 bytes.
Again, once you know the quantity of bytes that align on a 24 bit boundary you can do a simple calculation to get to the number of pad characters for your digest. Since 6 bits of every byte are put into a Base64 character (6 bit groups) then you can divide your total number of bits (264) by 6 to get the number of Base64 characters of your final digest.
264 (bits) / 6 (bits per Base64 character) = 44 (Base64 Characters)
So 44 will be the quantity of filler characters to put into the JSON object in order to calculate a SAID.
What happens when prepending bytes for CESR style encodings?
When encoding a value that requires padding with CESR-style padding (up front), instead of ‘=’ at the end like Base64-style padding would produce you end up with ‘A’ characters on the front of your encoded value. You also end up with the one character adjacent to the ‘A’ character(s) including some pad bits and some raw bits, as shown below in the bit diagram.
The intermediate encoded value looks like the below value that is not yet a SAID. This is not yet a SAID because the ‘A’ character has not yet been replaced with a type code from the TLV scheme indicating this is a SHA3-256 digest.
This ‘A’ character represents all zero bits (000000
) in the Base64 alphabet.
In binary the full, pre-padded digest value (all 33 bytes) looks like the following. Notice the zero bits at the front.
000000001111001001011011010101100010111010011111011001101111000110001101000010000000010010000011100010110000000000000001100111110110110000101001010000110100100101001000111110110110011100010001110100110010011010101000010001000100101011100100000011111110100011111010
The first two bytes are 00000000
and 11110010
which get encoded into Base64 as shown below. Six of the zero pad bits get encoded as an ‘A’ character and two of the zero pad bits get included in the capital ‘P’ character which also has four bits from the next raw byte of data from the digest.
Bit diagram of Base64 encoded, CESR pre-padded raw value.
This diagram illustrates how CESR does pre-padding with pad bytes of zero bits prior to performing a Base64 encoding on the fully padded raw value. The next diagram of characters shows the space a fully padded, encoded, CESR-style value would look like.
As you can see, the padding is at the front of the encoded value rather than the back like Base64 does. And the character with shared pad and raw bits is adjacent to the pad character at the front of the Base64 encoded value.
To get to the final SAID then you replace the ‘A’ character with the appropriate type code, or derivation code, yet we are getting ahead of ourselves a bit too much. Let’s now get into the calculation of the digest.
This step showed you how to calculate the appropriate number of filler ‘#’ pound sign characters to put in to the digest field in your JSON object. The next step shows you how to calculate a digest of that JSON object.
Creation Step 3: Calculate a digest of the data
When calculating a digest then you take the data with the correct number of filler characters added to the digest field and you simply take a digest of it. So in our case we would take a digest of the following
{
"d": "############################################",
"first": "john",
"last": "doe"
}
In Python taking a digest of this data would be as simple as the following:
import hashlib
raw_value = b'{"d":"############################################","first":"john","last":"doe"}'
digest = hashlib.sha3_256(raw_value).digest()
# hash function ↑↑↑↑
This is a simple step and is very similar for any other algorithm such as SHA2-256, Blake3-256 or otherwise. You use the desired type of hash function.
The only other thing to be aware of here is that if you create a digest that is sized differently than 32 bytes, such as a SHA3-512 digest (64 bytes) then you need to also change the number of pad bytes, which gets into the next step.
Creation Step 4: Calculate the quantity of pad bytes
The calculation for the quantity of pad bytes is very similar to the calculation for the quantity of filler ‘#’ characters needed in Step 2. In fact, it is a subset of that calculation. The goal with pad characters is to make sure that the final value aligns on a 24 bit boundary as mentioned above.
For example, since the SHA3-256 digest we start with is 32 bytes, or 256 bits (not a multiple of 24), then all we need to add is one byte to get to 264 bits, which is a multiple of 24, or 33 bytes.
Deeper into Modulus Math for Pad Bytes
To get a bit deeper into the math, one way to do this calculation with the modulus operator is to find out how many characters are necessary to completely fill a 3 byte group. Since 3-byte groups are multiples of 24 then you can use a modulus calculation to see how far away you are from filling a three byte group by doing a modulus 3 operation in two steps:
Step 1: take bytes mod 3
32 bytes mod 3 = 2 (bytes)
meaning there are two bytes already in the last group of three (24 bit boundary).
Step 2: subtract bytes in group from group size
So to see how many bytes you must add to get to the 24 bit boundary (3 byte group) you subtract the quantity of bytes you have from the group size:
3 (group size) – 2 (bytes in group) = 1 (pad bytes needed to fill group)
Due to how modulus arithmetic works you will only ever have three possible values from this equation:
3 – (bytes mod 3) = 0 (pad bytes)
3 – (bytes mod 3) = 1 (pad bytes)
3 – (bytes mod 3) = 2 (pad bytes)
You never have to worry about three pad bytes because that would be an even multiple of 24 which means your raw value would already align on a 24 bit boundary and thus not need any pad bytes.
So, to review, for us the calculation of (3 - (32 mod 3)) = 1 pad byte
gives us a single pad byte to be prepended to our raw value, as shown below in the ‘B33’ box.
As mentioned before, CESR does pre-padding, pre-conversion which means that the pad byte we found we need is added to the front of the array of raw bytes for the SHA3-256 digest. The next step covers encoding this padded raw value.
Creation Step 5: Base64 URL Safe Encode the padded raw bytes
Now that the raw value from Step 4 is properly padded then you encode it with Base64 URL Safe encoding. CESR uses Base64 URL Safe encoding rather than plain Base64 encoding so that CESR values can safely be used in URLs and filenames.
import hashlib
from base64 import urlsafe_b64encode
raw_value = b'{"d": "############################################", "first": "john", "last": "doe"}'
digest = hashlib.sha3_256(raw_value).digest()
padded_digest = b'\x00' + digest
encoded = urlsafe_b64encode(padded_digest)
# encode to base64 ↑↑↑↑
assert encoded == b'APJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6'
assert len(encoded) == 44
Now that you have the Base64 URL Safe encoded value then you are ready to finish off this SAID creation by replacing the ‘A’ pad character at the front of the encoded value with the appropriate value from the CESR Master Code Table.
Creation Step 6: Substitute Type Code for the front ‘A’ character(s)
When CESR pre-padded the raw value to get to a 24 bit boundary the purpose of that was to be able to repurpose the wasted space of the pad character for a type code in CESR’s TLV encoding scheme. The ‘A’ character at the front of the value in this scheme is considered to be a pad character. This pad ‘A’ character will be replaced with the appropriate type code, or derivation code in CESR parlance, from the CESR Master Code Table.
For a SHA3-256 digest that type code is ‘H’ as seen in the following subset of the CESR Master Code Table.
The substitution gives us a final value of HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6
as seen in the following substitution diagram.
The substitution of the ‘A’ character with the ‘H’ character is the final part of what is called CESR encoding a raw digest value into a CESR-style self addressing identifier. This SAID is a front-padded, Base64 encoded, and type-code substituted, string of Base64 characters.
The final value can be created by the code as follows:
import hashlib
from base64 import urlsafe_b64encode
raw_value = b'{"d":"############################################","first":"john","last":"doe"}'
digest = hashlib.sha3_256(raw_value).digest()
padded_digest = b'\x00' + digest
encoded = urlsafe_b64encode(padded_digest)
b64_str_list = list(encoded.decode()) # convert bytes to string of chars for easy replacement of 'A'
b64_str_list[0] = 'H' # replace first 'A' character with 'H' type code
b64_str = ''.join(b64_str_list) # convert string of chars to string with .join()
assert b64_str == 'HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6'
assert len(b64_str) == 44 # length should still be 44 characters, 264 base64 bits, a multiple of 24 bits
Creation Step 7: Place the Front-Padded, Base64 encoded, Type-code Substituted string in the digest field
Now we can take this correctly padded, CESR encoded value and place it into the digest field in our JSON object, replacing the filler ‘#’ characters with the final, valid SAID:
{
"d": "HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6",
"first": "john",
"last": "doe"
}
This takes us back to where we started off, with a valid SAID and a SAIDified JSON object.
What about verification?
What is nice about verification is that it is as simple as calculating the SAID again of a JSON object and comparing that to a SAID you are handed.
Verification Step 1: Start with a SAID from the object you already have
Say you are starting with the below object that has already had a SAID calculated and embedded in the digest field, the “d” field here.
{
"d": "HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6",
"first": "john",
"last": "doe"
}
To get the SAID from this object you extract the value of the “d” field, giving you HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6
Verification Step 2: Calculate the SAID of the object using the SAID creation steps
Verification is easy because all you need to do is take steps 1 through 6 above and re-calculate the SAID on the JSON object provided. Once you have recalculated the SAID, which will be HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6
again, you can perform the comparison in step 3.
Verification Step 3: Compare the SAID from the object to the calculated SAID
If the SAID the object started with matches the SAID you calculated from the object then you know the object has not been changed and that the SAID is valid. Otherwise either your SAID is invalid or the object has changed.
Review
Calculating a SAID
Now you understand how we SAIDify a JSON object by doing the following seven step process:
Start with a JSON object we want to add a SAID to that has a digest field.
Calculate the quantity of Base64 characters the final, pre-padded, encoded raw digest bytes (SAID) will take up and fill the digest field with that many ‘#’ characters.
Calculate a digest of the bytes of the JSON object after the ‘#’ filler characters are added.
Calculate the quantity of pad bytes needed to align on a 24 bit boundary and prepend that to the raw bytes for a digest.
Encode the padded raw bytes with the Base64URLSafe alphabet.
Substitute the appropriate type code in place of the ‘A’ character(s) at the front of the encoded string. This final value is your SAID
Place the final SAID value into the digest field of your JSON object.
Pre-padding and type code substitution prior to Base64 encoding is the essence of CESR-style self addressing identifiers. The steps above may seem overwhelming at first, though once you mentally anchor in that CESR pads at the start and that padding gives you ‘A’ characters you can reuse for type codes then you have mastered the fundamentals of what makes CESR style SAIDs work.
Verifying a SAID
Verification of a SAID is easy because you just calculate it again from the original JSON object, or other data object you are using. If the SAIDs match then it verifies; if they don’t then the data changed.
Extra Learning Alert – fully qualified Base64 primitive
And, as a nice side note, you happen to now know what the phrase “fully qualified base64 primitives” in KERIpy means. All that means is that your encoded value has been pre-padded, pre-conversion, and has had its type code added to the front, as we did here with substitution, with the exception that some CESR primitives
Give me a library please! I don’t want to manage these details
In case this article has convinced you that you do not ever again want to worry about the vagaries of aligning on 24 bit boundaries for Base64 or CESR values then you are in luck. There are multiple implementations of the SAID process that can meet your needs in a variety of different languages.
The Python reference implementation in Web Of Trust’s KERIpy’s
Saider.saidify.
The Human Colossus Foundation’s
Rust implementation with WASM bindings for their
JavaScript package. See their cool SAID
generator and verifier demo here where you can try a whole list of different algorithms.
SAIDify, my own
Typescript implementation of the SAID creation process.
Implementations
Web Of Trust KERIpy Python
The Python example below from KERIpy shows a unit test showing the usage of the KERIpy Saider.saidify library code to calculate a SAID. The SAID is stored in the .qb64
property of Saider
. The term qb64
stands for “qualified base64” which means a left-padded, Base64 encoded, type code substituted value as described above.
import json
from keri.core.coring import MtrDex, Saider
def test_saidify_john_doe():
code = MtrDex.SHA3_256
ser0 = b'{"d": "", "first": "john", "last": "doe"}'
sad0 = json.loads(ser0)
saider, sad = Saider.saidify(sad=sad0, code=code)
assert saider.qb64 == 'HPJbVi6fZvGNCASDiwABn2wpQ0lI-2cR0yaoRErkD-j6'
Human Colossus Foundation Rust SAID demo and test code
Start with their cool demo site of generating and verifying SAIDs:
If you want to dive into their code the linked test basic_derive_test
shows the Rust code for the cool SAD
macro #[derive(SAD, Serialize)]
that can turn any Rust struct along with the #[said]
field attribute for the SAID digest field into a self-verifying data structure.
use said::derivation::HashFunctionCode;
use said::sad::SAD;
use said::version::format::SerializationFormats;
use said::SelfAddressingIdentifier;
use serde::Serialize;
#[test]
pub fn basic_derive_test() {
#[derive(SAD, Serialize)]
struct Something {
pub text: String,
#[said]
pub d: Option<SelfAddressingIdentifier>,
}
let mut something = Something {
text: "Hello world".to_string(),
d: None,
};
let code = HashFunctionCode::Blake3_256;
let format = SerializationFormats::JSON;
something.compute_digest(&code, &format);
let computed_digest = something.d.as_ref();
let derivation_data = something.derivation_data(&code, &format);
assert_eq!(
format!(
r#"{{"text":"Hello world","d":"{}"}}"#,
"############################################"
),
String::from_utf8(derivation_data.clone()).unwrap()
);
assert_eq!(
computed_digest,
Some(
&"EF-7wdNGXqgO4aoVxRpdWELCx_MkMMjx7aKg9sqzjKwI"
.parse()
.unwrap()
)
);
assert!(something
.d
.as_ref()
.unwrap()
.verify_binding(&something.derivation_data(&code, &format)));
}
SAIDify
If you want to use a Typescript library that is about 530 lines of code you can go with my SAIDify library. The below example shows how to use the library with Typescript.
Start with an NPM install
npm install saidify
And then you can use the saidify(data, label)
function to SAIDify any JavaScript object you have as long as you indicate which field is the digest field, the “label” field, which defaults to the “d” field.
import { saidify, verify } from 'saidify'
// create data to become self-addressing
const myData = {
a: 1,
b: 2,
d: '',
}
const label = 'd'
const [said, sad] = saidify(myData, label)
// said is self-addressing identifier
// sad is self-addressing data
console.log(said)
// ...Vitest test assertion
expect(said).toEqual('ELLbizIr2FJLHexNkiLZpsTWfhwUmZUicuhmoZ9049Hz')
// verify self addressing identifier
const computedSAID = 'ELLbizIr2FJLHexNkiLZpsTWfhwUmZUicuhmoZ9049Hz'
const doesVerify = verify(sad, computedSAID, label) // can verify with original myData or sad
// ...Vitest test assertion
expect(doesVerify).toEqual(true)
Conclusion
The key takeaways from calculating SAIDs are:
Use pre-padded bytes to align on a 24 bit boundary prior to encoding as Base64 characters.
Substitute type codes in for the leading ‘A’ character(s) of a SAID.
It is easy to chose different algorithms for the SAID calculation process. Just make sure you use a code on the CESR Master Code Table if you want to be CESR compliant.
There are multiple implementations of the SAID algorithm you can use.
Now go make some SAIDs!
References:
HCF oca-spec #58
RFC 4648: The Base16, Base32, and Base64 Data Encodings, specifically
section 5
Composable Event Streaming Representation (CESR) ToIP Specification, specifically
section 12.6
Self Addressing Identifier IETF draft specification
SADs, SAIDs, and ACDCs video presentation by Daniel Hardman