Repo seems stuck at (yet unexisting) version 14 (?) and cannot revert migration [SOLVED with a hack]

So, with Kubo out, I thought it was time to tinker with it and manually compile the latest & greatest version (0.15.0-dev) on my public node, straight from the GitHub sources, using Go 1.18.4 under Ubuntu 22.04.1 LTS.

How naĆÆve of meā€¦

Basically, the compilation didnā€™t work ā€” I definitely got the binaries for ipfs and ipfs-update, and they certainly launched, but they were also having some issues with the quic-go component (which is not directly under the dev teamā€™s control ā€” at least not totally ā€” and possibly requires compiling with earlier versions of Go). This isnā€™t easily fixable (at least not yet), so it was time to download the official binaries for kubo, including ipfs, ipfs-update, and (for good measure) fs-repo-migrate, and install them by simply placing them in the proper places (in my case, /usr/bin owned by the user ipfs)

And here is the strange thing that I got:

$ ipfs daemon
Initializing daemon...
Kubo version: 0.14.0
Repo version: 12
System version: amd64/linux
Golang version: go1.18.3

Error: Your programs version (12) is lower than your repos (14).
Please update ipfs to a version that supports the existing repo, or run
a migration in reverse.

See https://github.com/ipfs/fs-repo-migrations/blob/master/run.md for details.

Ok. That was baffling, since at the time of writing, repo version 12 is the latest one. There is no version 14. So where did that come from?!

All right, I read the instructions, and itā€™s not the first time I had to tweak the migration manually; itā€™s usually a painless experience. So, I tried to do a reverse migration (also using the user ipfs) the following way:

$ fs-repo-migrations --revert-ok
Found fs-repo version 14 at /var/local/ipfs/.ipfs
Do you want to upgrade this to version 12? [y/n] y
2022/08/03 12:16:28 Looking for suitable migration binaries.
2022/08/03 12:16:28 Need 2 migrations, downloading.
2022/08/03 12:16:28 Downloading migration: fs-repo-13-to-14...
2022/08/03 12:16:28 Downloading migration: fs-repo-12-to-13...
2022/08/03 12:16:28 could not get latest version of migration fs-repo-12-to-13: GET https://ipfs.io/ipns/dist.ipfs.io/fs-repo-12-to-13/versions error: 404 Not Found: ipfs resolve -r /ipns/dist.ipfs.io/fs-repo-12-to-13/versions: no link named "fs-repo-12-to-13" under QmcSiPhsEFp8iHZVC1zGj8Yd4FxqHnCMtM5TJ9yMaVneCa
2022/08/03 12:16:28 could not get latest version of migration fs-repo-13-to-14: GET https://ipfs.io/ipns/dist.ipfs.io/fs-repo-13-to-14/versions error: 404 Not Found: ipfs resolve -r /ipns/dist.ipfs.io/fs-repo-13-to-14/versions: no link named "fs-repo-13-to-14" under QmcSiPhsEFp8iHZVC1zGj8Yd4FxqHnCMtM5TJ9yMaVneCa
2022/08/03 12:16:28 Failed to download migrations.
ipfs migration:  failed to download migrations: fs-repo-13-to-14 fs-repo-12-to-13

(fs-repo-migration is v2.0.2, straight from the official binaries for Linux amd64)

Ok, this makes sense: there are no versions above 12, so it makes sense that there are no official ā€˜patchesā€™ from 12 to 13 and 13 to 14. To check that there wasnā€™t an overnight upgrade by some reason, I tried:

$ fs-repo-migrations -v
12

Again, thatā€™s exactly what is expected.

So, how on Earth did IPFS jump its repo version to 14?!?

My only (rational) explanation is that, during my many attempts at compiling a more recent version from the GitHub sources, and testing each run, one of them may have corrupted the repo version number, using a (yet) inexistent version 14 (strangely, not version 13, which would make more sense). It might not even have corrupted the data itself, only the version number; but there is no way I can tell, since, well, the only tool Iā€™ve got to fix/repair things is ipfs itself, and itā€™s as baffled as I am from a version ā€˜coming from the futureā€™.

I was reading the GitHub notes for fs-repo-migrations, and I somehow understand how the process works: basically, for every repo version bump, one is expected to provide a plugin which does the migration but also the reverse migration; such plugins are applications with their own Git (sub-)repository. By taking a peek at the source for fs-repo-migrations, itā€™s clear that the greatest and latest repo version is, indeed, 12, and that all the migration plugins are available, up to 11-to-12, which was developed half a year ago. Again ā€” no surprises there.

There are instructions on how to write oneā€™s own migration plugin. I could do that, of course, but then I would have to know what changes were made to the repo on versions 13 and 14 ā€” none of which exist yet. This is really quite intriguing, and I suspect that really there arenā€™t any changes on my repo at all, except for the wrongly set version number.

Naturally, I tried to read some of the answers provided here (such as this old one) but none applied to my particular case.

Well, frustrated, all that remained for me to do was to make a repo backup and then manually edit ~/.ipfs/version which did indeed have version number 14 ā€” and change it to 12.

Huh. That seemed to do the trick. ipfs now launches without complaints, and, as far as I can see, all the data I had (namely, the pinned files/directories) seems to be in the right place and all. The configuration seems to be working well, since all the extra goodies Iā€™ve got running on top of ipfs seem to be operational. At least nothing shows up on the logs, and, as far as I can test things, they work.

Obviously, Iā€™m happy now :slight_smile: and Iā€™m pretty sure that even if I have lost some data, the built-in mechanisms (checksums/hashes, signatures, etc.) will eventually drop any truncated/mangled blocks, refresh them from one of the peers, and, over time, repair itself (if, indeed, it requires repairingā€¦). However, I cannot be 100% sure that this is the case. I mean, if I go back and edit the version file to display, say, 11, things wonā€™t work that easily ā€” at best, ipfs would consider the version to be wrong, and attempt a repair, which might not necessarily fail but just ā€˜do nothingā€™, eventually updating the version back to 12 again, and everything would continue to run smoothly.

But it could be much worse than that, and the migration process might simply break everything along the way. And, who knows, the backup might stop working as well ā€” I have no idea if the repo version is somehow ā€˜embeddedā€™ in the blocks themselves. There are too many variables there ā€” too many things that can go wrong in different ways ā€” and just because by sheer chance my configuration seems to be working, this is not really a ā€˜fixā€™, but rather a ā€˜hackā€™.

Naturally, Iā€™m still baffled at what changed the version number to ā€˜14ā€™. All I can see is that the last change to the version file was made on July 20. I think I wasnā€™t tinkering with ipfs that early, but only much later, i.e. on July 29 or 30, so, strangely, this ā€˜bumpā€™ to repo version 14 seems to have happened well before I was attempting a new compilation.

And thatā€™s a bit scary, IMHO ā€” I donā€™t expect that the ipfs daemon simply ā€˜decidedā€™ for itself what the repo version ought to be, and changes it in silence, in the background.

Then again, I was running a self-compiled, developer version of ipfs before, so perhaps I stumbled upon a bugā€¦ but the truth is that the commit history for the ipfs sources donā€™t seem to refer to such a bug (itā€™s still possible that it has been fixed and commented with an overarching description which includes the fix among others ā€” I did not look into the code, mind you).

Did anyone ever encounter such a strange situation, i.e. that ipfs suddenly and unexpectedly changes the repo version number to the future, self-breaking its own ability to repair/migrate the repo?

And ā€” besides manually editing the version number and praying that it works ā€” is there any other way to get ipfs to somehow check if the repo structure does match the version it thinks it has? (that seems to be tricky to accomplish, though)

Can anyone who delved deep into the code confirm that the only place where ipfs and the other tools look for the repo version is at ~/.ipfs/version?

Thanks in advance for any input or suggestions; as said, my simple hack seems to be working for now, and Iā€™ll keep a close eye on the logs to see if there are some unexpected new errors popping upā€¦

Cheers,

- Gwyn

No it doesnā€™t, all versions of quic-go shipping in Kubo0.15.0-dev supports go1.18 and go1.17 for now (we will bump to go1.19 because I want to ship boringcrypto in dist).

Whatever build issue you had is an other bug, pls report it.

which is not directly under the dev teamā€™s control

Just so you know, itā€™s mainly being worked on by Marten theses days, which is part of the go-libp2p team.

Even if it wouldnā€™t not in our control, we still care about which library we are including, and they have to work somewhat. Quic-go is definitely on the edge of what is acceptable (since it only supports the last two versions of go).

It will also check that the configuration match what is inside ~/.ipfs/datastore_spec.


Switching from 14 to 12 is two side by side bit flips, this isnā€™t characteristic from cosmic induced bitflip, but look like your SSD or HDD dying, maybe the cable to your disk being thinky, could also be one of your memory line (but then I would expect your system to work far less).

you should run a SMART check on your disk

We have bugs, but no bugs that bump the version number like that afaik.

Compiling the program yourself shouldnā€™t break that, it doesnā€™t make sense to me.

Or maybe it has happend a different time that wasnā€™t being recorded with mtime because of faulty hardware.

Hey @Jorropo !

Thanks for the superfast reply!

Ok, so, from your reply I got scared ā€” the possibility of having a faulty disk(s) did not occur to me, and it would be really an extreme case, i.e. disk hardware failing all over the place. Since ā€˜myā€™ bare metal server is not new, it would be correct to assume that the disks (which are HDD, not SSD) were starting to show some wear & tear beyond the ability of automatic repair. And that would be, indeed, a catastrophe waiting to happen.

While I cannot fully exclude that possibility, I ran as many tests as I could in the past hour. The hardware I use has two server-grade Seagate Enterprise HDDs spinning at 7200rpm, configured to use software RAID1, so, in theory at least, if one disk was seriously damaged, the system would be running from just the other one, but possibly with increased activity, which would lead to further wear & tear, and possibly a much earlier total failure. And all the monitoring software I got should have sent me a trillion warnings by now. Thatā€™s the theory.

In practice, I ran tests to determine bad blocks, the state of the array, the (expected) overall status of both disks (considering their age), the temperature theyā€™re running, and the kind of faults that have been recorded in the recent past.

Fortunately for me (and to ease my anxiety), all tools Iā€™ve used do not detect a single issue. Even the list of bad blocks is empty (and Iā€™d expect by now to have quite a few), but, then again, this is what the hardware controller reports to Linux ā€” the Seagate Enterprise disks have very likely a hardware-base mechanism to automatically allocate physical bad blocks to unused areas that are never exposed to the OS anyway. Itā€™s just when all those are used up that the Linux kernel starts detecting new bad blocks and updating its own internal lists. Allegedly, my tests show that the possibility of that happening is still very far in the future. Granted, these are statistical data for averages. Anything might happen that ā€˜suddenlyā€™ triggers a major disk fault, so that cannot be dismissed just because the ā€˜statisticsā€™ say otherwise.

However, in such a situation, one has to exclude such edge cases ā€” not denying them, just understand that they are statistically unlikely ā€” and trust that the OS, in general, has reasonable information about the real state of the disks. While the many layers between the OS and the physical, spinning disks may give different results ā€” i.e. all sorts of ā€˜tricksā€™ that are in place by the many circuits and components that shield the disk from so-called ā€˜direct accessā€™ by the Linux kernel ā€” one has to at least trust what the device drivers and the kernel knows about the disk status, and assume that the ā€˜realityā€™ is not far from the understanding of the system about the state that the disks are.

That said ā€” and although I certainly did take your warning by heart! ā€” to the best of my knowledge, the disks are 100% fine and fully operational, and will remain so for an expected long time, despite their age. In other words, looking at the reports, I would look at different causes for eventual ā€˜errorsā€™.

And in fact the next most possible cause is, of course, human error. There is just one person tinkering with the system ā€” myself ā€” and I can in all seriousness say that I donā€™t remember ever taking a look at the version file (in fact, I wasnā€™t even aware that such a file existed, or what it contained, before today!), but what I nevertheless did was to grep version ~/.bash_history, and, sure enough, there it was: echo 14 > version.

DUH!

Now I could kick myselfā€¦

Why I ever did that is really beyond my understanding, but at least, from taking a look at the .bash_history, it seems to have been done a few days ago, when I decided to try out the Experimental.AcceleratedDHTClient, and, afterwards, while trying to be sure that everything was working properly, I was doing some maintenance here and thereā€¦ and for some utterly stupid and unfathomable reason I ā€˜decidedā€™ that overwriting version with 14 was a good idea. It must have been very late indeed (sadly, .bash_history doesnā€™t keep timestampsā€¦) and I definitely had most of the brain shut down while doing whatever I thought I was doing.

Itā€™s clear from my insane actions that I was browsing through the datastore, found an old (0.12.1) version of ipfs lying around (possibly from some tests), deleted it promptly, then took a look at what version contained, and for some reason disliked what I saw and overwrote the file with ā€˜14ā€™ā€¦ possibly because my deluded mind assumed that the version file there should match the ipfs version for some reason, and I had just deleted 0.12.1, soā€¦

Anyway. Of course things continued to work properly ā€” ipfs doesnā€™t seem to be checking version very often during its regular operation ā€” until the inevitable Linux kernel update forced me to do a (planned) reboot, and when all services came up, ipfs was the only one failingā€¦ and thatā€™s what prompted me to try to understand ā€” to no avail ā€” what was wrong with it.

ā€˜Unix: 99% of all errors are permission errors; of the rest, 99% of all errors are human errors.ā€™

All right, Iā€™m glad that Iā€™m the only one to blame for being stupid, and I do apologise for wasting your time with me :slight_smile:

Still, your information was accurate, correct, very helpful, and, ultimately, led me to the belief that there could be just one major source of errors affecting the ipfs environment: myself!

On the other hand, I have no stomach to try to compile 0.15.0-dev now :slight_smile: Iā€™ve lost too much time with this already. Although, like you, Iā€™m very eager to see the implementation of boringcrypto as directly supported by Go 1.19; the PPA Iā€™m using for automating the upgrades (longsleep) has just released 1.19, but, in all seriousness, Iā€™ll leave that for the weekendā€¦ :smiley:

Thanks again for all your time and patience in answering me. Iā€™m really glad that it was all a stupid human error, and that nothing is wrong with either my disks or with ipfs!

1 Like

Even tho this isnā€™t this in the end, FYI, itā€™s fairly easy to check.

Ideally you run a checksummed FS, (btrfs, XFS or ZFS). That will catch all errors automatically.

If you canā€™t you can very easily do a SMART check, that will only catch errors that are being self reported by your disks, which a dying disk should do.
Google ā€œlinux run SMART check disksā€, it takes 1 or 2 commands to run.