Trouble recreating binary representation of "Hello World\n" block

I’m trying to reproduce the binary representation of a simple “Hello World\n” block. I know what the output should be from this…

============================

% echo Hello World | ipfs add
added QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u
…etc.

============================

% ipfs block get QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u | hexdump -C
00000000 0a 12 08 02 12 0c 48 65 6c 6c 6f 20 57 6f 72 6c |…Hello Worl|
00000010 64 0a 18 0c |d…|
00000014

============================

I created the Protocol Buffers code with this proto file:

syntax = “proto2”;

message PBLink {
optional bytes Hash = 1;
optional string Name = 2;
optional uint64 Tsize = 3;
}

message PBNode {
optional bytes Data = 1;
repeated PBLink Links = 2;
}

message Metadata {
optional string MimeType = 1;
}

message UnixTime {
required int64 Seconds = 1;
optional fixed32 FractionalNanoseconds = 2;
}

message Data {
enum DataType {
Raw = 0;
Directory = 1;
File = 2;
Metadata = 3;
Symlink = 4;
HAMTShard = 5;
}

    required DataType Type       = 1;
    optional bytes    Data       = 2;
    optional uint64   filesize   = 3;
    repeated uint64   blocksizes = 4;
    optional uint64   hashType   = 5;
    optional uint64   fanout     = 6;
    optional uint32   mode       = 7;
    optional UnixTime mtime      = 8;

}

============================

Here is Python code that used the Protocol Buffers “compiled” code:

#!/usr/bin/env python3

import ipfs_pb2
import subprocess

unixfs = ipfs_pb2.Data()
unixfs.Type = 2
unixfs.Data = b"Hello World\n"

dag = ipfs_pb2.PBNode()
dag.Data = unixfs.SerializeToString()

print(dag.SerializeToString())
print(dag.SerializeToString().hex())

============================

Here is the output of my Python code above…

b’\n\x10\x08\x02\x12\x0cHello World\n’

0a100802120c48656c6c6f20576f726c640a

============================

Notice the correct output has 2 extra bytes after the “Hello World\n”. I suspect the problem is that I’m not specifying the empty PBLink vector in the PBNode of the DAG? How specify that empty object?

Any help greatly appreciated.

Sincerely,

Chris

dunno if its related but when i was pulling data out of ipfs it was giving me stuff wrapped in a tarball. i had to extract first ‘file’ from tarball to get the raw input bytes.

Blade

Yes I think for my hello world example, it is likely something more minor I have to adjust

If anyone is reading this, I learned how Protocol Buffers encodes data structures and
was able to write some Python code that can decode its binary representations…

G_LEN   = 7
G_MASK  = 0x7f
WT_LEN  = 3
WT_MASK = 0x7
DW_LEN  = 8
W_LEN   = 4
HEXADEC = 16
BASE_2  = 2

def get_key_info(bytes_):
        """
        Extracts naturals, field numbers, wire types and key lengths from bytes.
        """

        nat   = []
        msb   = 1
        index = 0
        while msb:
                msb    = bytes_[index] >> G_LEN
                nat.append(bytes_[index] & G_MASK)
                index += 1
        nat.reverse()
        nat   = int("".join([bin(e)[2:].zfill(G_LEN) for e in nat]), BASE_2)
        info  = nat, (nat >> WT_LEN), (nat & WT_MASK), index

        return info

def get_value_info(bytes_, wire_type):
        """
        Extracts values and value lengths from bytes for given wire types.
        """

        if wire_type == 0:
                info  = get_key_info(bytes_)
                value = info[0]
                len_  = info[3]
        if wire_type == 1:
                value = bytes(reversed(bytes_[:DW_LEN]))
                len_  = DW_LEN
        if wire_type == 2:
                info  = get_key_info(bytes_)
                value = bytes_[info[3]:info[3] + info[0]]
                len_  = info[0] + info[3]
        if wire_type == 5:
                value = bytes(reversed(bytes_[:W_LEN]))
                len_  = W_LEN

        return value, len_

def decode(bytes_):
        """
        Extracts key value pairs from Protocol Buffer object encodings.
        """

        kv_pairs = []
        index    = 0
        while index < len(bytes_):
                key_info    = get_key_info(bytes_[index:])
                index      += key_info[3]
                value_info  = get_value_info(bytes_[index:], key_info[2])
                index      += value_info[1]
                kv_pairs.append((key_info[1], value_info[0]))

        return kv_pairs

I decoded things like this after saving code above in a file called decode.py…

>>> import decode

>>> import subprocess

>>> subprocess.check_output(["ipfs", "block", "get", "QmWATWQ7fVPP2EFGu71UkfnqhYXDYH566qy47CnJDgvs8u"])

b'\n\x12\x08\x02\x12\x0cHello World\n\x18\x0c'

>>> decode.decode(b'\n\x12\x08\x02\x12\x0cHello World\n\x18\x0c')

[(1, b'\x08\x02\x12\x0cHello World\n\x18\x0c')]

>>> decode.decode(b'\x08\x02\x12\x0cHello World\n\x18\x0c')

[(1, 2), (2, b'Hello World\n'), (3, 12)]

I’m happy to elaborate further if anyone needs help with this.

Sincerely,

Chris

I was missing the filesize field in the Data message in my original problem FWIW.