#6532 BLOC Update1: SD Card Corruption

Wed Jun 18 00:52:57 EDT 2008

#6532: SD Card Corruption
----------------------+-----------------------------------------------------
  Reporter:  haralds  |       Owner:  dsaxena          
      Type:  defect   |      Status:  assigned         
  Priority:  blocker  |   Milestone:  Update1.1 (8.1.1)
 Component:  kernel   |     Version:                   
Resolution:           |    Keywords:  release?         
  Verified:  0        |    Blocking:  6893             
 Blockedby:           |  
----------------------+-----------------------------------------------------
Changes (by dsaxena):

 * cc: dilinger (added)

Comment:

 I've spend some time digging deep into the bowels of the VFS and block
 layer and
 gathering some debug output and have an explanation for the partition
 table corruption:

 Upon coming out of resume, the SD code, with CONFIG_MMC_UNSAFE_SUSPEND
 enabled, checks
 to see if there is a card plugged into the system and whether that card is
 the same
 as the one that was plugged into the system at suspend time. This is
 accomplished by
 reading the card ID of the device and for some reason, very possibly
 #1339, we fail
 this detection. In this case, the kernel removes the old device from the
 system and in
 this execution path, the partition information for this device is zeroed.

 Even though the device is removed, the device is still mounted and upon
 unmount,
 ext2 syncs the superblock, even if the file system is sync'd beforehand.
 The superblock
 is block 0 of the partition and the block layer adds to this the partition
 start
 offset before submitting the write to the lower layers. As the partition
 information
 has already been zeroed out, we end up writing to block 0 of the disk
 itself, overwriting
 the partition table and the geometry information. I've verified this by
 both gathering
 debug output and 'dd' + 'hexdump' of corrupted and uncorrupted media.

 Some interesting points:

   1. We are able to delete a block device even though it is still mounted.
   1. Even though the device has been deleted, the write submitted to it
 does not fail.

 Note that this is still not 100% reproducible and in certain cases the
 superblock
 write during unmount does fail with block I/O errors, meaning that the
 queue is properly deleted. As per dilinger's comments on IRC, the VFS has
 lots of refcounts and there is a timing issue/race condition that we're
 hitting. As per #1339, we may be able to add an OLPC  specific hackto wait
 500ms or so upon resume to get around this. I will try this but I don't
 think this is acceptable given our suspend/resume requirements.

 Something I don't quite understand at the moment is how/when our userland
 env (journal
 specifically I think?) unmounts the device as I've been testing via
 command line suspend
 mount, and unmount while running in console mode.

 Next steps:

   1. Get an understanding of the what is happening with our userland and
 brainstorm with cjb about the possibility of simply unmounting the SD
 device upon suspend. There are issues around this as we may have files
 open and that will keep us from suspending.
   1. Test adding a timeout to the resume path to see if it solves our
 problem to validate that it is indeed something related to our HW.
   1. Dig into the unmount/write to non-existing bdev some more nad discuss
 this upstream if needed.

 (Adding dilinger to cc:)

-- 
Ticket URL: <http://dev.laptop.org/ticket/6532#comment:34>
One Laptop Per Child <http://laptop.org/>
OLPC bug tracking system