I’m trying to use Lassie to download the End of Term 2020 Dataset but I can’t find any mention of the CID to use either on the latest announcement or the annual report or the previous announcement.
If I download some of the files from AWS S3 and hash them, can I derive the CID from those hashes? Can I start from one CID and find related ones, for example everything sealed around the same time by the same storage provider?
For example if I start from the first WARC listed, crawl-data/EOT-2020/segments/IA-000/warc/EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz, here are some of its checksums:
$ for f in EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz ; do sha1sum $f ; sha256sum $f ; sha512sum $f ; md5sum $f ; crc32 $f ; done
72699fbff9f143d9ea1e51389f531287defbbdbf EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz
9fa2f20ae30cb69a5ebfcc66d5492ce3dfdbbf40a3ac48bdcb5a983046b8ad64 EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz
686711a5cef31131ed597553b40a41529678733eb27dd1bac81aa7f12cccd10df91e9e394cc60c7a936a3b6f2b6b2aca3896b6ed6608db827d38f8223c239b39 EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz
8c89de9ad596515892555dce90ebe29f EOT20-20201009165718-crawl800_EOT20-20201009165718-00000.warc.gz
ba07cdea
Does someone know a corresponding CID?
I have read CID concept is broken - #16 by hector and more.