sourcecode

rsync's --fuzzy parameter: what is a 'similar named' file?

rsync has a parameter called '--fuzzy', which tries to find an equal or similar file within the same directory to diff against if the destination file is not found. For example, if you move or copy a file. It does this by first looking for a file having the same timestamp and the same size as the destination file. If this fails it tries to find a 'similar named' file as comparent. However, even if there are thousands of sources that tell you that the fuzzy option exists, I didn't find a single site explaining what a 'similar named' file actually is.

So I start to check rsync's source code to verify that my backup plan (one new archive file each day, which is rsync'd to another host) will work as intended. I want that rsync transfers the newest archive based on the previous file.

The starting point is the function find_fuzzy(..) in generator.c. The first part does the timestamp and filesize based check, which is not of any interest for me since both will always differ in my archives.

After that, it does the following (pseudo-code here):

initialize LOWEST_DISTANCE with 25
for each other FILE within the same directory do:
    calculate DISTANCE as 
        fuzzy_distance(FILE.name, INPUTFILE.name) + fuzzy_distance(FILE.suffix, INPUTFILE.suffix)
    if DISTANCE is lower or equal to LOWEST_DISTANCE then:
        set LOWEST_DISTANCE to DISTANCE
        remember FILE
return last remembered FILE

According to util.c fuzzy_distance(..) is based one the Levenshtein Distance.

Conclusion (or TL;DR)

rsync's --fuzzy algorithm will choose the file with the lowest Levenshtein Distance of it's name and it's suffix name. If there are multiple files with the same resulting value (which might happened on timestamp'ed named archive files) the last one will be taken. So, my backup plan should work as intended in most cases, but not in all. For example, it will choose the archive '2013-12-10' when transfering the backup of '2013-12-11'. However, on '2014-01-01' it might decide for '2013-12-01' instead of '2013-12-31' or even '2013-01-01', which is completly wrong. This means that I should choose another naming schema to avoid unnecessary traffic.

Alternative approach: Put the archive in directories named by the date and called "current.tar" and "parent.tar", where "parent.tar" is a hardlink to the previous "current.tar". In conjunction with --hard-links this should work, since the Levenshtein Distance of "current.tar" and "parent.tar" is below 25 (actually it is 3). Of course, source and target filesystems have to support hardlinks, and you must ensure that "parent.tar" is transmitted before "current.tar".