Treatise on Formatting FLASH Storage Devices
Mitch Bradley
wmb at laptop.org
Wed Feb 4 15:49:36 EST 2009
I am the author of the page in question. To establish my credentials, I
wrote my first filesystem forensic tool in 1980, to diagnose and repair
a Unix filesystem that had been damaged by a kernel misconfigured that
made it swap on top of the filesystem. That was when 10 MB disk packs
the size of garbage can lids cost $5000.
Since then I have written filesystem readers, writers, and forensic
tools for UFS, ext2, FAT12/16/32, ISO-9660, NFS, Mac HFS, romfs, and
JFFS2. I have studied, with an eye toward implementation, the data
structures for NTFS, UBIFS, cramfs, and squashfs.
david at lang.hm wrote:
>
> so if the device is performing wear leveling, then the fact that your
> FAT is on the same eraseblock as your partition table should not
> matter in the least, since the wear leveling will avoid stressing any
> particlar part of the flash.
That would be true in a perfect world, but wear leveling is hard to do
perfectly. Relocating requires maintaining two copies of the erase
block, as well as hidden metadata that tells you which copy is which,
plus a hidden allocation map. Updating all of these things in a way
that avoids catastrophic loss of the entire device (due to inconsistent
metadata) is tricky. Some FTLs get it (at least mostly) right, many
don't. FTL software is, after all, software, so obscure bugs are always
possible. Making hardware behave stably during power loss is triply
difficult.
I suspect, based on cryptic hints in various specs and standards that
I've read, that some FTLs have special optimizations for FAT filesystems
with the factory-supplied layout. If the FAT is in a known "nice"
location, you can apply different caching and wear leveling policies to
that known hot-spot, and perhaps even reduce the overall metadata by
using the FAT as part of the block-substitution metadata for the data
area. Many manufacturers could care less about what Linux hackers want
to do - their market being ordinary users who stick the device in a
camera - so such "cheat" optimizations are fair game from a business
standpoint.
>
> as such I see no point in worrying about the partition table being on
> the same eraseblock as a frequently written item.
Many filesystem layouts can recover from damage to the allocation maps,
either automatically or with an offline tool. It's possible to rebuild
ext2 allocation bitmaps from inode and directory information. For FAT
filesystems, there's a backup FAT copy that will at least let you roll
back to a semi-consistent recent state. But there's no redundant for
the partition map or the BPB. If you should lose one of those during a
botched write, it's bye-bye to all your data, barring mad forensic skills.
In stress testing of some "LBA NAND" devices, we saw several cases
where, after a fairly long period, the devices completely locked up and
lost the ability to read or rewrite the first block. I had done a bad
job of partitioning it, because I wasn't paying enough attention when I
created the test image. It's unclear what the results would have been
had the layout been better - the stress test takes several weeks and the
failures are statistical in nature - but I can't help believing that,
for a device with a known wear-out mechanism and elaborate workarounds
to hide that fact, working it harder than necessary will reduce its
lifetime and possibly trigger microcode bugs that might otherwise cause
no trouble.
>
> as for the block boundry not being an eraseblock boundry if the
> partition starts at block 1
>
> if you use 1k blocks and have 256k eraseblocks, then 1 out of every
> 256 writes will generate two erases instead of one
>
> worst case is you use 4k blocks and have 128k eraseblocks, at which
> point 1 out of every 32 writes will generate two erases instead of one.
>
> to use the intel terminology, these result in write amplification
> factors of approximatly 1.005 and 1.03 respectivly.
>
> neither of these qualify as a 'flash killer' in my mind.
The main amplification comes not from the erases, but from the writes.
If the "cluster/block space" begins in the middle of FLASH page, then
1-block write will involve a read-modify-write of two adjacent pages.
That is four internal accesses instead of one. Each such access takes
between 100 and 200 uS, depending on the degree to which you can
pipeline the accesses - and read-modify-write is hard to pipeline. So
the back-end throughput can easily be reduced by a factor of 4 or even
more. The write-amplification factor is 2 by a trivial analysis, and it
can get worse if you factor in the requirement for writing the pages
within an erase block sequentially. The implied coupling between the
two spanned pages increases the difficulty of replacement-page
allocation, increasing the probability of garbage collection.
The erase amplification factor tracks the write amplification factor.
You must do at least one erase for every 64 writes, assuming perfect
efficiency of your page-reassigment algorithm and its metadata. Double
the writes, at least double the erases.
>
> now, if a FAT or superblock happens to span an eraseblock, then you
> will have a much more significant issue, but nothing that is said in
> this document refers to this problem (and in fact, it indicates that
> things like this follow the start of the partition very closely, which
> implies that unless the partition starts very close to the end of an
> eraseblock it's highly unlikely that these will span eraseblocks)
>
> so I still see this as crying wolf.
It has been my experience that USB sticks and SD cards with intact
factory formatting tend to last longer and run faster than ones that
have been reformatted with random layouts. I don't have quantifiable
numbers, but I do have enough accumulated experience - I have over two
dozen FLASH devices within each reach - to convince me that something
interesting is happening. And I know enough about how these things work
internally to convince me that aligned accesses inherently result in
less internal data traffic than unaligned accesses.
>
> as for ubifs, that is designed for when you have access to the raw
> flash, which is not the case for any device where you have a flash
> translation layer in place, so it is really only useful on embedded
> system, not on commercially available flash drives of any type.
Indeed. The page in question has nothing whatsoever to do with UBIFS.
More information about the Devel
mailing list