#6578 NORM Never A: Cafe NAND: Corrected 1 symbol errors - error recovery should be improved

Wed Feb 27 15:38:35 EST 2008

#6578: Cafe NAND: Corrected 1 symbol errors - error recovery should be improved
--------------------+-------------------------------------------------------
 Reporter:  gnu     |       Owner:  jg            
     Type:  defect  |      Status:  new           
 Priority:  normal  |   Milestone:  Never Assigned
Component:  distro  |     Version:                
 Keywords:          |    Verified:  0             
 Blocking:          |   Blockedby:                
--------------------+-------------------------------------------------------
 I received these kernel messages on my MP G1G1 laptop, running
 update.1-691, today:

 [  124.794431] CAF<89> NAND 0000:00:0c.0: Corrected 1 symbol errors
 [  124.838537] CAF<89> NAND 0000:00:0c.0: Corrected 1 symbol errors
 [  134.750821] msh0: no IPv6 routers present
 [  146.301007] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
 [  157.140197] ADDRCONF(NETDEV_CHANGE): msh0: link becomes ready
 [  162.128785] eth0: no IPv6 routers present
 [  167.780302] CAF<89> NAND 0000:00:0c.0: Corrected 1 symbol errors
 [  167.961249] msh0: no IPv6 routers present
 [  168.741263] CAF<89> NAND 0000:00:0c.0: Corrected 1 symbol errors

 It reports a PCI device number (0000:00:0c.0).

 It does not report anything further about the error -- not the raw block
 or chip address, not the symbol in error, not the high-level inode or
 filename involved.  This makes it hard to diagnose or even recognize
 patterns.

 It does not apparently push the error information up a level into the
 filesystem, so the filesystem is unable to store a corrected copy of the
 file data elsewhere (in case a second error arises in this block, making
 it uncorrectable).  Thus, the filesystem is apparently rereading this
 block several times (producing this message each time).  Of course, I
 can't tell if it's rereading, or if it is encountering errors in several
 different blocks, since it doesn't tell me which block.

 There was info at boot time about a "bad block table":

 [   26.816456] NAND device: Manufacturer ID: 0xad, Chip ID: 0xdc (Hynix
 NAND 512
 MiB 3,3V 8-bit)
 [   26.850413] 2 NAND chips detected
 [   26.878771] Bad block table found at page 524224, version 0x01
 [   26.879000] Bad block table found at page 524160, version 0x01
 [   26.879154] nand_read_bbt: Bad block at 0x038a0000
 [   26.879172] nand_read_bbt: Bad block at 0x038c0000
 [   26.879206] nand_read_bbt: Bad block at 0x05bc0000
 [   26.879266] nand_read_bbt: Bad block at 0x0b8a0000
 [   26.879284] nand_read_bbt: Bad block at 0x0b8c0000
 [   26.879754] Searching for RedBoot partition table in NAND 512MiB 3,3V
 8-bit a
 t offset 0xfd80000
 [   26.920028] No RedBoot partition table detected in NAND 512MiB 3,3V
 8-bit

 It is not clear whether single-symbol errors like this will cause a block
 of NAND to be added to the "bad block table".  I suggest that they be
 added to a "provisional bad block table", with a count of errors
 encountered.  If such a block is rewritten with new data, and continues to
 produce errors, it should go into the bad block table and no longer be
 used.

-- 
Ticket URL: <http://dev.laptop.org/ticket/6578>
One Laptop Per Child <http://dev.laptop.org>
OLPC bug tracking system