#6532 BLOC Update1: SD Card Corruption
Zarro Boogs per Child
bugtracker at laptop.org
Wed Jun 18 00:52:57 EDT 2008
#6532: SD Card Corruption
----------------------+-----------------------------------------------------
Reporter: haralds | Owner: dsaxena
Type: defect | Status: assigned
Priority: blocker | Milestone: Update1.1 (8.1.1)
Component: kernel | Version:
Resolution: | Keywords: release?
Verified: 0 | Blocking: 6893
Blockedby: |
----------------------+-----------------------------------------------------
Changes (by dsaxena):
* cc: dilinger (added)
Comment:
I've spend some time digging deep into the bowels of the VFS and block
layer and
gathering some debug output and have an explanation for the partition
table corruption:
Upon coming out of resume, the SD code, with CONFIG_MMC_UNSAFE_SUSPEND
enabled, checks
to see if there is a card plugged into the system and whether that card is
the same
as the one that was plugged into the system at suspend time. This is
accomplished by
reading the card ID of the device and for some reason, very possibly
#1339, we fail
this detection. In this case, the kernel removes the old device from the
system and in
this execution path, the partition information for this device is zeroed.
Even though the device is removed, the device is still mounted and upon
unmount,
ext2 syncs the superblock, even if the file system is sync'd beforehand.
The superblock
is block 0 of the partition and the block layer adds to this the partition
start
offset before submitting the write to the lower layers. As the partition
information
has already been zeroed out, we end up writing to block 0 of the disk
itself, overwriting
the partition table and the geometry information. I've verified this by
both gathering
debug output and 'dd' + 'hexdump' of corrupted and uncorrupted media.
Some interesting points:
1. We are able to delete a block device even though it is still mounted.
1. Even though the device has been deleted, the write submitted to it
does not fail.
Note that this is still not 100% reproducible and in certain cases the
superblock
write during unmount does fail with block I/O errors, meaning that the
queue is properly deleted. As per dilinger's comments on IRC, the VFS has
lots of refcounts and there is a timing issue/race condition that we're
hitting. As per #1339, we may be able to add an OLPC specific hackto wait
500ms or so upon resume to get around this. I will try this but I don't
think this is acceptable given our suspend/resume requirements.
Something I don't quite understand at the moment is how/when our userland
env (journal
specifically I think?) unmounts the device as I've been testing via
command line suspend
mount, and unmount while running in console mode.
Next steps:
1. Get an understanding of the what is happening with our userland and
brainstorm with cjb about the possibility of simply unmounting the SD
device upon suspend. There are issues around this as we may have files
open and that will keep us from suspending.
1. Test adding a timeout to the resume path to see if it solves our
problem to validate that it is indeed something related to our HW.
1. Dig into the unmount/write to non-existing bdev some more nad discuss
this upstream if needed.
(Adding dilinger to cc:)
--
Ticket URL: <http://dev.laptop.org/ticket/6532#comment:34>
One Laptop Per Child <http://laptop.org/>
OLPC bug tracking system
More information about the Bugs
mailing list