LBA NAND corruption

Mitch Bradley wmb at
Tue Oct 21 14:41:48 EDT 2008

David Woodhouse wrote:
> On Tue, 2008-10-21 at 12:22 -0400, John Watlington wrote:
>> Mitch,
>>     One of the LBA NAND test machines killed it's MBR.
>> It started with a failed comparison of the commonly
>> written blocks, then stopped talking to the device at
>> all.
>> On reboot, fdisk showed no partition table.
>> dd of /dev/lba showed all FFs for the first 16K,
>> then 00 for the next 2K, then data.
>> Suggestions on how to proceed w. debugging
>> are welcome.
> This is one of the reasons I'm so concerned about this type of device.

This is indeed a serious concern.  But it has to be balanced against the 
hardware problem that CaFe doesn't work with the next generation of raw 
NAND chips.  Maybe that hardware problem can be solved, maybe not.  At 
the moment, there are no obviously-good solutions.

It seems clear to me that the industry is moving rapidly toward managed 
NAND.  I could be wrong about that, but I don't think I am.  It's pretty 
hard to win by betting against the volume hardware.  If that is true, 
then the winning strategy is some combination of making do with what the 
industry has to offer and influencing them to fix problems.

> When you're dealing with stuff in software, if you have a bug you can
> whip the developers harder. When something goes wrong inside the
> device's internal firmware, there really isn't much you can do about it
> at all.

As an individual, that is true; you have almost no leverage over the 
device vendor.  But as a volume customer, you do have leverage. Sun, 
even in its early days of modest volumes, was able to get bug fixes for 
disk drive and tape drive firmware problems.

On the other side, it's not entirely clear how you "whip harder" FOSS 
developers in general.  It appears to me that it's a hit-or-miss 
proposition as to whether you can get sufficient attention from a given 
expert.  For example, consider yourself.  When you worked for RH, OLPC 
could get lots of your valuable attention because of the OLPC/RH 
connection.  But now that you are associated with Intel, what is the 
situation?  (Perhaps we could in fact get some of your cycles; I'm just 
saying that the answer doesn't seem obvious and straightforward.)

In summary, it looks to me like there are valid arguments on both sides 
of this question.

More information about the Devel mailing list