Treatise on Formatting FLASH Storage Devices

Thu Feb 5 01:51:39 EST 2009

On Wed, 4 Feb 2009, Mitch Bradley wrote:

> david at lang.hm wrote:
>> 
>> so if the device is performing wear leveling, then the fact that your FAT 
>> is on the same eraseblock as your partition table should not matter in the 
>> least, since the wear leveling will avoid stressing any particlar part of 
>> the flash.
>
> That would be true in a perfect world, but wear leveling is hard to do 
> perfectly.  Relocating requires maintaining two copies of the erase block, as 
> well as hidden metadata that tells you which copy is which, plus a hidden 
> allocation map.  Updating all of these things in a way that avoids 
> catastrophic loss of the entire device (due to inconsistent metadata) is 
> tricky.  Some FTLs get it (at least mostly) right, many don't.  FTL software 
> is, after all, software, so obscure bugs are always possible.  Making 
> hardware behave stably during power loss is triply difficult.

so it sounds like you are basicly saying that if the FAT/Superblock gets 
corrupted due to a bug in the FTL software it's easier to recover than if 
the FAT gets corrupted, so isolating the two is a benifit. is this a fair 
reading?

I will note that even if you never write to the partition table, that 
eraseblock will migrate around the media (the fact that it's never written 
to make it a good candidate to swap with a high-useage block. it will move 
less, but it will still move.

> I suspect, based on cryptic hints in various specs and standards that I've 
> read, that some FTLs have special optimizations for FAT filesystems with  the 
> factory-supplied layout.  If the FAT is in a known "nice" location, you can 
> apply different caching and wear leveling policies to that known hot-spot,

this makes sense

> and perhaps even reduce the overall metadata by using the FAT as part of the 
> block-substitution metadata for the data area.

this I don't understand.

> Many manufacturers could care 
> less about what Linux hackers want to do - their market being ordinary users 
> who stick the device in a camera - so such "cheat" optimizations are fair 
> game from a business standpoint.

this is definantly true

>> as such I see no point in worrying about the partition table being on the 
>> same eraseblock as a frequently written item.
>
> Many filesystem layouts can recover from damage to the allocation maps, 
> either automatically or with an offline tool.  It's possible to rebuild ext2 
> allocation bitmaps from inode and directory information.  For FAT 
> filesystems, there's a backup FAT copy that will at least let you roll back 
> to a semi-consistent recent state.  But there's no redundant for the 
> partition map or the BPB.  If you should lose one of those during a botched 
> write, it's bye-bye to all your data, barring mad forensic skills.

I've recovered from partition table mistakes in the past, it's not that 
hard (and in the cases like flash where the media is small enough that 
there is usually only one partition it becomes as close to trivial as such 
things can be)

> In stress testing of some "LBA NAND" devices, we saw several cases where, 
> after a fairly long period, the devices completely locked up and lost the 
> ability to read or rewrite the first block.  I had done a bad job of 
> partitioning it, because I wasn't paying enough attention when I created the 
> test image.  It's unclear what the results would have been had the layout 
> been better - the stress test takes several weeks and the failures are 
> statistical in nature - but I can't help believing that, for a device with a 
> known wear-out mechanism and elaborate workarounds to hide that fact, working 
> it harder than necessary will reduce its lifetime and possibly trigger 
> microcode bugs that might otherwise cause no trouble.

interesting datapoint, but not something that I would call conclusive 
(especially when some of the elaborate workarounds you are referring to 
are speculation, not documented)

>> as for the block boundry not being an eraseblock boundry if the partition 
>> starts at block 1
>> 
>> if you use 1k blocks and have 256k eraseblocks, then 1 out of every 256 
>> writes will generate two erases instead of one
>> 
>> worst case is you use 4k blocks and have 128k eraseblocks, at which point 1 
>> out of every 32 writes will generate two erases instead of one.
>> 
>> to use the intel terminology, these result in write amplification factors 
>> of approximatly 1.005 and 1.03 respectivly.
>> 
>> neither of these qualify as a 'flash killer' in my mind.
>
> The main amplification comes not from the erases, but from the writes.  If 
> the "cluster/block space" begins in the middle of FLASH page, then 1-block 
> write will involve a read-modify-write of two adjacent pages.  That is four 
> internal accesses instead of one.  Each such access takes between 100 and 200 
> uS, depending on the degree to which you can pipeline the accesses - and 
> read-modify-write is hard to pipeline.  So the back-end throughput can easily 
> be reduced by a factor of 4 or even more.  The write-amplification factor is 
> 2 by a trivial analysis,

the write amplification factor is 2 for those blocks that span the 
eraseblocks, but since 255 out of 256 data blocks do not span the 
eraseblocks, overall that is an amplification factor of ~1.005.

as I see it the math is (( 1 * 2 ) + 255) / 256

if you are unlucky enough to have a 'hot' block of data in the wrong spot 
it could be more significant, but assuming something close to random 
access it's not that bad.

> and it can get worse if you factor in the 
> requirement for writing the pages within an erase block sequentially.

I don't understand this statement.

>  The 
> implied coupling between the two spanned pages increases the difficulty of 
> replacement-page allocation, increasing the probability of garbage 
> collection.

other than two eraseblocks having loosly coupled write counts I don't see 
this.

and they would only be loosely coupled becouse the writes to the other 255 
(or 31 worst case) pages happen independantly of the other block.

> The erase amplification factor tracks the write amplification factor.  You 
> must do at least one erase for every 64 writes, assuming perfect efficiency 
> of your page-reassigment algorithm and its metadata.  Double the writes, at 
> least double the erases.

double the writes and you double the erases, but you are only doubling the 
writes for a small percentage of the writes.

this will trigger wear leveling a little sooner, but it's by the same 
~1.005/1.03 factor

>> 
>> now, if a FAT or superblock happens to span an eraseblock, then you will 
>> have a much more significant issue, but nothing that is said in this 
>> document refers to this problem (and in fact, it indicates that things like 
>> this follow the start of the partition very closely, which implies that 
>> unless the partition starts very close to the end of an eraseblock it's 
>> highly unlikely that these will span eraseblocks)
>> 
>> so I still see this as crying wolf.
>
> It has been my experience that USB sticks and SD cards with intact factory 
> formatting tend to last longer and run faster than ones that have been 
> reformatted with random layouts.  I don't have quantifiable numbers, but I do 
> have enough accumulated experience - I have over two dozen FLASH devices 
> within each reach - to convince me that something interesting is happening. 
> And I know enough about how these things work internally to convince me that 
> aligned accesses inherently result in less internal data traffic than 
> unaligned accesses.

or is it that the people who reformat them tend to be heavy users? it may 
be correlation not causation.

David Lang