I've been playing around with BitTorrent again lately. One of the things that been nagging me is where "failed hashes" come from.
BitTorrent transfers files in smaller pieces, usually around 256kbytes, and double-checks them with their own hash (.torrent files are so big because they contain a list of all the hashes for all the pieces). Sometimes when you download a piece from a source, it fails the hash check, meaning it was corrupted.
One reason that's been documented on the web is that sometimes Internet devices have bugs that corrupt data. D-Link has a "gaming" mode that tries to fix some gaming protocols by correcting your NATted IP address. This means, in 4-billion bytes of random/compressed data, it will mistakenly see what it thinks is an IP address is needs to correct, thereby corrupting the chunk.
Another source of corruption is TCP. Its checksum doesn't always catch multi-bit errors. Therefore, it will report a packet as good that is actually corrupted.
Finally, one source I've found is that large chunks of a piece can be corrupted. I'm guessing that the file system on the disk drive of the sender got corrupted.
This points to two obvious changes that would be good for BitTorrent clients. The first is that senders should re-verify pieces when they send them (not just on reception) to see if they've been corrupted on the disk in the meantime. Second, clients can easily save the bad chunks and figure out why they got corrupted.
For example, a client could compare the bad chunk with the eventual re-download of a good chunk. It could run tests on the regions of the pieces that differ. The nice thing about the TCP checksum algorithm is that you can just run it over those regions: if the corrupted piece and good piece have the same TCP checksum, then it's a good chance that the reason the chunk was corrupted was because of a network problem.
Likewise, if an entire 4k portion was corrupted, it's likely a disk error. If a 4-byte part is different, then it's likely the D-Link bug.