[Server-devel] Aggressive hardlinking strategy

Martin Langhoff martin.langhoff at gmail.com
Tue Aug 24 18:21:36 EDT 2010


On Fri, Aug 20, 2010 at 12:36 AM, Bernie Innocenti <bernie at codewiz.org> wrote:
>  # du -sh  --exclude datastore-200* /library/backup
>  92G    /library/backup
>
> So, backing up the last versions of all journals would take "just" 92GB,
> which would take more that 4 days on a 2mbit link for the initial
> backup.

Why do you have to say exclude datastore? du should be smart enough to
recognise the hardlinking strategy.

And I think we can make it much better, if it's true that many large
files are present in many users' Journals.

There are few "find identical files and hardlink them" scripts out
there. The "best" ones I've spotted are memory-bound, and stupidly
hash every damn file...and will just make a mess on a busy XS.

[ Still, maybe you can test it overnight with one of these
un-optimised scripts... ]

Maytbe we can take one and rework it so that it works in several passes

1 - Run find -type f, find all files and store them into 'buckets' by
size in bytes, storing only full path and inode. Find a smart way to
avoid being memory-bound...

2 - any buckets with only one member, discard

3 - in each bucket
     - group by inode -- those are already hardlinked together
     - hash each distinct inode
     - coalesce hardlinks to lowest-numbered inode with identical hash...

Each bucket fits in memory so step 3 can be done in mem...

cheers,



m
-- 
 martin.langhoff at gmail.com
 martin at laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff


More information about the Server-devel mailing list