Using the Identity Hash for Template Deduplication

I’ve noticed that certain software copyright licenses are typically given as an accompanying file but with the name and date changed, which gives us a great example for de-duplication.

Let’s say Alice and Bob each want to publish a CID for an MIT license file. However, they are both worried that there won’t be enough providers of the file at some point. If they both publish the same CID, then the file is more likely to stay available, but they want to publish slightly different files. They note that the differing portion of the files are small enough to publish inline of the CID. They then agree to use a common raw block for the portion of the file they share (bafkreifoj7xag3idnver5pcj2qqnvn5rbrwtluxp6f7ryc3nxql2osxrim). Next they will each create a raw identity CID for the differing portion of the block, then create an upper level UnixFS file block, which will represent their files.

Their result:

If more people follow suite, then the availability increases.

Use the ipld Explorer to look inside the links. Note that it is likely the shared CID won’t be available on the DHT in real life. Unfortunately, the url sizes may be to long to open in an HTTP gateway.

How I did this

This example was just to demonstrate a general idea and is not necessarily correct. However I will explain what I did for the sake of reproducibility.

To create the common block, I copy-and-pasted the licenses body (including two \ns at the top) into text editor to get a .txt file. I then used ipfs add --cid-version=1 [file path] to get a raw block. I followed the same procedure for each version of the differing portions, except I used --hash=identity.

The somewhat tricky part was merging them into one file. The short version is I adapted this techique for putting two files under a directory to concatenate two blocks into one file.

The json I used for Bob was:

{
	"Data": {
		"/": {
			"bytes": "CAIgDCCACA=="
		}
	},
	"Links": [
		{
			"Hash": {
				"/": "bafkqadbimmusamzqgaysaqtpmi"
			}
		},
		{
			"Hash": {
				"/": "bafkreifoj7xag3idnver5pcj2qqnvn5rbrwtluxp6f7ryc3nxql2osxrim"
			},
            "Tsize": 1024
		}
	]
}

The two bafk...s are the CIDs I concatenated. Typically, each link in a UnixFS file node will have the CID, an empty name (see #8691), and the Tsize but I chose to omit some of the values to save space.

Creating a top level file block this way differs from creating a directory block because the Data > bytes field has to be separately generated for different files. While ipfs dag put and ipfs dag get will translate much, but not all of the node back and forth between dag-pb and dag-json. Inside Data > bytes is the UnixFS Data entry, which has to be encoded separately.

To encode the data entry into protobuf I used this tool. It does not support the uint64 type but if the values are small enough, uint32 will yield the same result. The output is in hexadecimal, but it can easily be transformed to the format dag-json takes with this tool.

I used this as the input:

{
  "Type": "File",
  "Data": null,
  "filesize": null,
  "blocksizes": [
    12,
    1024
  ],
  "hashType": null,
  "fanout": null
}

and this as the protobud definition:

message Data {
	enum DataType {
		Raw = 0;
		Directory = 1;
		File = 2;
		Metadata = 3;
		Symlink = 4;
		HAMTShard = 5;
	}

	required DataType Type = 1;
	optional bytes Data = 2;
	optional uint32 filesize = 3;
	repeated uint32 blocksizes = 4;

	optional uint32 hashType = 5;
	optional uint32 fanout = 6;
}

To save space I left every field blank except for the mandatory Type and blocksizes. Type must be 2 to show that all the linked data is part of the same file, while blocksizes is an array of the length of the raw data represented by each link. In theory, filesize should be made redundant by `blocksizes, however webui will think that means the file size is 0.

Finally, the link was created with ipfs dag put --store-codec=dag-pb --hash=identity[fjson file]. The number of characters in the CID can be reduced with ipfs cid format -b=base64url [original CID].