rsync's --fuzzy parameter: what is a 'similar named' file?

rsync has a parameter called '--fuzzy', which tries to find an equal or similar file within the same directory to diff against if the destination file is not found. For example, if you move or copy a file. It does this by first looking for a file having the same timestamp and the same size as the destination file. If this fails it tries to find a 'similar named' file as comparent. However, even if there are thousands of sources that tell you that the fuzzy option exists, I didn't find a single site explaining what a 'similar named' file actually is.

So I start to check rsync's source code to verify that my backup plan (one new archive file each day, which is rsync'd to another host) will work as intended. I want that rsync transfers the newest archive based on the previous file.

The starting point is the function find_fuzzy(..) in generator.c. The first part does the timestamp and filesize based check, which is not of any interest for me since both will always differ in my archives.

After that, it does the following (pseudo-code here):

initialize LOWEST_DISTANCE with 25
for each other FILE within the same directory do:
    calculate DISTANCE as 
        fuzzy_distance(, + fuzzy_distance(FILE.suffix, INPUTFILE.suffix)
    if DISTANCE is lower or equal to LOWEST_DISTANCE then:
        remember FILE
return last remembered FILE

According to util.c fuzzy_distance(..) is based one the Levenshtein Distance.

Conclusion (or TL;DR)

rsync's --fuzzy algorithm will choose the file with the lowest Levenshtein Distance of it's name and it's suffix name. If there are multiple files with the same resulting value (which might happened on timestamp'ed named archive files) the last one will be taken. So, my backup plan should work as intended in most cases, but not in all. For example, it will choose the archive '2013-12-10' when transfering the backup of '2013-12-11'. However, on '2014-01-01' it might decide for '2013-12-01' instead of '2013-12-31' or even '2013-01-01', which is completly wrong. This means that I should choose another naming schema to avoid unnecessary traffic.

Alternative approach: Put the archive in directories named by the date and called "current.tar" and "parent.tar", where "parent.tar" is a hardlink to the previous "current.tar". In conjunction with --hard-links this should work, since the Levenshtein Distance of "current.tar" and "parent.tar" is below 25 (actually it is 3). Of course, source and target filesystems have to support hardlinks, and you must ensure that "parent.tar" is transmitted before "current.tar".

Annotation based migrations for XStream

While looking for a simple way to migrate outdated Java models serialized with XStream I found XMT from Robin Shine, described at Migrate Serialized Java Objects with XStream and XMT. Hell, this was exactly what I was looking for! Instead of fiddling with multiple outdate Java classes that only exists for legacy reasons, just modify the DOM document before unmarshaling the thing with XStream!

Use autofs to wake fileserver on demand

This is the continuation of "Suspend NAS when idle".

I want to automatically resume my NAS server from suspend to ram when accessing the media directory, e.g. via XBMC. I'm connecting my clients via sshfs, so if you're using SMB/CIFS, you have to modify the fstype parameters. The implementation is based upon

Required tools: