Previous Topic

Next Topic

Book Contents

Book Index

Algorithm

The rsync utility uses an algorithm invented by the Australian computer programmer Andrew Tridgell for efficiently transmitting a structure (such as a file) across a communications link when the receiving computer already has a similar, but not identical, version of the same structure.

The recipient splits its copy of the file into fixed-size non-overlapping chunks and computes two checksums for each chunk: the MD5 hash, and a weaker rolling checksum. (Prior to version 3.0 of the protocol, released with rsync version 3.0.0, it used MD4 hashes rather than MD5). The sender computes the rolling checksum for every chunk of size  in its own version of the file, even overlapping chunks. This can be calculated efficiently because of a special property of the rolling checksum: if the rolling checksum of bytes  through  is R, the rolling checksum of bytes  through  can be computed from , byte , and byte  without having to examine the intervening bytes. Thus, if one had already calculated the rolling checksum of bytes , one could calculate the rolling checksum of bytes  solely from the previous checksum, and from bytes  and .

The rolling checksum used in rsync is based on Mark Adler's adler-32 checksum, which is used in zlib, and is itself based on Fletcher's checksum.

The sender then compares its rolling checksums with the set sent by the recipient to determine if any matches exist. If they do, it verifies the match by computing the hash for the matching block and by comparing it with the hash for that block sent by the recipient.

The sender then sends the recipient those parts of its file that did not match the recipient's blocks, along with information on where to merge these blocks into the recipient's version. This makes the copies identical. However, there is a small probability that differences between chunks in the sender and recipient are not detected, and thus remains uncorrected. This requires a simultaneous hash collision in MD5 and the rolling checksum. It is possible to generate MD5 collisions, and the rolling checksum is not cryptographically strong, but the chance for this to occur by accident is nevertheless extremely remote. With 128 bits from MD5 plus 32 bits from the rolling checksum, and assuming maximum entropy in these bits, the probability of a hash collision with this combined checksum is . The actual probability is a few times higher, since good checksums approach maximum output entropy but very rarely achieve it.

If the sender's and recipient's versions of the file have many sections in common, the utility needs to transfer relatively little data to synchronize the files.

While the rsync algorithm forms the heart of the rsync application that essentially optimizes transfers between two computers over TCP/IP, the rsync application supports other key features that aid significantly in data transfers or backup. They include compression and decompression of data block by block using zlib at sending and receiving ends, and support for protocols such as ssh that enables encrypted transmission of compressed and efficient differential data using rsync algorithm. Instead of ssh, stunnel can also be used to create an encrypted tunnel to secure the data transmitted. Finally, rsync is capable of limiting the bandwidth consumed during a transfer, a useful feature that few other standard file transfer protocols offer.

See Also

What is rsync?

Uses

Examples

Practical applications