What CID to use to download the Internet Archive's Democracy's Library from mainnet?

I’m trying to use Lassie to download the End of Term 2020 Dataset but I can’t find any mention of the CID to use either on the latest announcement or the annual report or the previous announcement.

If I download some of the files from AWS S3 and hash them, can I derive the CID from those hashes? Can I start from one CID and find related ones, for example everything sealed around the same time by the same storage provider?

For example if I start from the first WARC listed, crawl-data/EOT-2020/segments/IA-000/warc/EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz, here are some of its checksums:

$ for f in EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz ; do sha1sum $f ; sha256sum $f ; sha512sum $f ; md5sum $f ; crc32 $f ; done 
72699fbff9f143d9ea1e51389f531287defbbdbf  EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz
9fa2f20ae30cb69a5ebfcc66d5492ce3dfdbbf40a3ac48bdcb5a983046b8ad64  EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz
686711a5cef31131ed597553b40a41529678733eb27dd1bac81aa7f12cccd10df91e9e394cc60c7a936a3b6f2b6b2aca3896b6ed6608db827d38f8223c239b39  EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz
8c89de9ad596515892555dce90ebe29f  EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz
ba07cdea

Does someone know a corresponding CID?

I have read CID concept is broken - #16 by hector and more.

Hi @Federico, I am working on this project at IA. Are you looking to download the complete dataset, i.e. a collection-level CID? We are currently in the process of generating DAGs for easier retrieval of this data, I can post an update here once that’s done (should be ~2 weeks max)

1 Like

Fantastic. I don’t have 500 TB available, though I’d love to try mirroring the entire thing if there is some way for the mirror to be useful (i.e. to serve retrievals, if I understand the terminology correctly). So I’d like to download some selection, let’s say 8 TB to start. I wasn’t aware of the concept of collection-level CID, it sounds fascinating: if you have some links to further resources to learn about it, please share (source code is fine too).