Treatise on Formatting FLASH Storage Devices

Wed Feb 4 15:49:36 EST 2009

I am the author of the page in question.  To establish my credentials, I 
wrote my first filesystem forensic tool in 1980, to diagnose and repair 
a Unix filesystem that had been damaged by a kernel misconfigured that 
made it swap on top of the filesystem.  That was when 10 MB disk packs 
the size of garbage can lids cost $5000.

Since then I have written filesystem readers, writers, and forensic 
tools for UFS, ext2, FAT12/16/32, ISO-9660, NFS, Mac HFS, romfs, and 
JFFS2.  I have studied, with an eye toward implementation, the data 
structures for NTFS, UBIFS, cramfs, and squashfs.

david at lang.hm wrote:
>
> so if the device is performing wear leveling, then the fact that your 
> FAT is on the same eraseblock as your partition table should not 
> matter in the least, since the wear leveling will avoid stressing any 
> particlar part of the flash.

That would be true in a perfect world, but wear leveling is hard to do 
perfectly.  Relocating requires maintaining two copies of the erase 
block, as well as hidden metadata that tells you which copy is which, 
plus a hidden allocation map.  Updating all of these things in a way 
that avoids catastrophic loss of the entire device (due to inconsistent 
metadata) is tricky.  Some FTLs get it (at least mostly) right, many 
don't.  FTL software is, after all, software, so obscure bugs are always 
possible.  Making hardware behave stably during power loss is triply 
difficult.

I suspect, based on cryptic hints in various specs and standards that 
I've read, that some FTLs have special optimizations for FAT filesystems 
with  the factory-supplied layout.  If the FAT is in a known "nice" 
location, you can apply different caching and wear leveling policies to 
that known hot-spot, and perhaps even reduce the overall metadata by 
using the FAT as part of the block-substitution metadata for the data 
area.  Many manufacturers could care less about what Linux hackers want 
to do - their market being ordinary users who stick the device in a 
camera - so such "cheat" optimizations are fair game from a business 
standpoint.

>
> as such I see no point in worrying about the partition table being on 
> the same eraseblock as a frequently written item.

Many filesystem layouts can recover from damage to the allocation maps, 
either automatically or with an offline tool.  It's possible to rebuild 
ext2 allocation bitmaps from inode and directory information.  For FAT 
filesystems, there's a backup FAT copy that will at least let you roll 
back to a semi-consistent recent state.  But there's no redundant for 
the partition map or the BPB.  If you should lose one of those during a 
botched write, it's bye-bye to all your data, barring mad forensic skills.

In stress testing of some "LBA NAND" devices, we saw several cases 
where, after a fairly long period, the devices completely locked up and 
lost the ability to read or rewrite the first block.  I had done a bad 
job of partitioning it, because I wasn't paying enough attention when I 
created the test image.  It's unclear what the results would have been 
had the layout been better - the stress test takes several weeks and the 
failures are statistical in nature - but I can't help believing that, 
for a device with a known wear-out mechanism and elaborate workarounds 
to hide that fact, working it harder than necessary will reduce its 
lifetime and possibly trigger microcode bugs that might otherwise cause 
no trouble.

>
> as for the block boundry not being an eraseblock boundry if the 
> partition starts at block 1
>
> if you use 1k blocks and have 256k eraseblocks, then 1 out of every 
> 256 writes will generate two erases instead of one
>
> worst case is you use 4k blocks and have 128k eraseblocks, at which 
> point 1 out of every 32 writes will generate two erases instead of one.
>
> to use the intel terminology, these result in write amplification 
> factors of approximatly 1.005 and 1.03 respectivly.
>
> neither of these qualify as a 'flash killer' in my mind.

The main amplification comes not from the erases, but from the writes.  
If the "cluster/block space" begins in the middle of FLASH page, then 
1-block write will involve a read-modify-write of two adjacent pages.  
That is four internal accesses instead of one.  Each such access takes 
between 100 and 200 uS, depending on the degree to which you can 
pipeline the accesses - and read-modify-write is hard to pipeline.  So 
the back-end throughput can easily be reduced by a factor of 4 or even 
more.  The write-amplification factor is 2 by a trivial analysis, and it 
can get worse if you factor in the requirement for writing the pages 
within an erase block sequentially.  The implied coupling between the 
two spanned pages increases the difficulty of replacement-page 
allocation, increasing the probability of garbage collection.

The erase amplification factor tracks the write amplification factor.  
You must do at least one erase for every 64 writes, assuming perfect 
efficiency of your page-reassigment algorithm and its metadata.  Double 
the writes, at least double the erases.

>
> now, if a FAT or superblock happens to span an eraseblock, then you 
> will have a much more significant issue, but nothing that is said in 
> this document refers to this problem (and in fact, it indicates that 
> things like this follow the start of the partition very closely, which 
> implies that unless the partition starts very close to the end of an 
> eraseblock it's highly unlikely that these will span eraseblocks)
>
> so I still see this as crying wolf.

It has been my experience that USB sticks and SD cards with intact 
factory formatting tend to last longer and run faster than ones that 
have been reformatted with random layouts.  I don't have quantifiable 
numbers, but I do have enough accumulated experience - I have over two 
dozen FLASH devices within each reach - to convince me that something 
interesting is happening.  And I know enough about how these things work 
internally to convince me that aligned accesses inherently result in 
less internal data traffic than unaligned accesses.

>
> as for ubifs, that is designed for when you have access to the raw 
> flash, which is not the case for any device where you have a flash 
> translation layer in place, so it is really only useful on embedded 
> system, not on commercially available flash drives of any type.

Indeed.  The page in question has nothing whatsoever to do with UBIFS.