#7458 BLOC 8.2.0 (: Intermitent suspend/resume lockup

Zarro Boogs per Child bugtracker at laptop.org
Tue Jul 22 05:16:39 EDT 2008


#7458: Intermitent suspend/resume lockup
----------------------------+-----------------------------------------------
   Reporter:  dsaxena       |       Owner:  dsaxena                          
       Type:  defect        |      Status:  assigned                         
   Priority:  blocker       |   Milestone:  8.2.0 (was Update.2)             
  Component:  not assigned  |     Version:  Development build as of this date
 Resolution:                |    Keywords:  joyride-2131:-                   
Next_action:  diagnose      |    Verified:  0                                
  Blockedby:                |    Blocking:  7393                             
----------------------------+-----------------------------------------------
Changes (by dsaxena):

 * cc: mbletsas (added)


Comment:

 I've spent some time running many many test iterations to find a pattern
 and get
 a better understanding of where the issue might be. I started by adding
 debug code
 to various paths in the kernel to trace the suspend path as we go down the
 kernel stack
 the main issue I was seeing is a lockup in the Libertas driver's suspend
 path where
 the driver submits a sleep request and the the device never responds.

 I wanted to separate out the EC timeouts from the wireless lockup so
 looked at the few changes in the EC handling code from 2.6.22 to 2.6.25,
 backed them out, and ran the tests again. No luck.

 Next I started looking at the driver itself and trying to see if there
 are any major changes in the suspend/resume path. Since the code has been
 massively updated between stable and 2.6.25, I ended up just replacing the
 full 2.6.25 driver with the 2.6.22 driver. Ran tests again, no luck.

 I was using 2.6.22 as a baseline b/c it is a know quantity and I
 assumed that it did not have the same issues since we're shipping
 703 and 708 based on it. This turned out to be an incorrect assumption.
 When I tested with the 2.6.22 kernel, I was able to reproduce a
 suspend/resume lockup under stress, but somewhat different from
 what I saw with 2.6.25.

 I also wanted to rule out the firmware as a source of the issue, so
 I backed out the firmware that we are including in Joyride (5.110.22.p14)
 and dropped in the firmware from build 703 (5.110.22p1). Still got a lock
 up, but again, somewhat different.

 Following is a summary of the various failure cases I saw in my extensive
 testing. All tests
 were performed with ec_timeout set to 40ms, using XO firmware q2e11. Test
 consists of XO looping continuously and going to sleep with a host
 continuously pinging the XO.

 ----

 === 2.6.25 kernel + 2.6.25 driver + old fw ===

  * Sometimes see long (>> 10000ms) ping delays once we start sleep loop
  *  After a few sleep cycles, kernel stops at olpc_do_sleep(), and does
 not wake back up from WLAN ("Destination Host Unreachable" messages on
 host).
  *  ctrl-c on console does not do anything while stuck
  *  Moving mouse, power button, or other input wakes system back up but
 goes back to waiting state on next sleep cycle.
  *  Can stop the sleep loop by hitting ctrl-c in the small window between
 wake and sleep in above case and system immediately starts responding to
 pings.

 === 2.6.25 kernel  + 2.6.25 driver + new fw ===

 ''Failure Mode 1:''

  * Sporadic ping response times after sleep looping beings, up to 20000ms
  * After several cycles (anywhere from 30 to several 100s), system gets
 stuck waiting for libertas to respond to sleep command.
  * Even though the wireless card did not respond back (or the kernel did
 not receive the response), we are receiving WLAN wake up events from the
 EC:

    {{{
       olpc-ec:  running cmd 0x84
       olpc-ec:  received 0x20
       olpc-pm:  SCI 0x20 received
       olpc-ec:  running cmd 0x84
       olpc-ec:  received 0x0
       olpc-pm:  SCI 0x0 received
    }}}

   Javier, Richard, is it possible to receive a 0x20 SCI if the WLAN chip
 is not actually in suspend mode?

  * Hitting ctrl-c on serial console causes sleep cycle to complete but
 upon waking up we get a list corruption bug in libertas driver's
 lbs_queue_cmd() function. What this points to is that the driver is still
 processing when it should not. At this point the system is completely
 dead.

 ''Failure Mode 2:''

  * System hits olpc_do_sleep(), but we see an EC error when disabling
 EC_SCI_SRC_BATSOC && EC_SCI_SRC_ACPWR events (Not sure why we do
 this...Dilinger?):

 {{{
         /* We don't want to wake up on superfluous events */
         olpc_ec_mask_unset(EC_SCI_SRC_BATSOC | EC_SCI_SRC_ACPWR);
 }}}

 {{{
 [  295.122289] olpc-ec:  running cmd 0x1c
 [  295.122289] olpc-ec:  timeout waiting for EC to read command!
 [  295.122289] olpc-pm:  error getting SCI mask: -5
 }}}

  * Kernel tells EC to sleep:
 {{{
 [  295.122289] olpc-ec:  running cmd 0x32
 [  295.122289] olpc_do_sleep!
 }}}

  * We wake up from wlan event but kernel hangs on trying to wake up the
 wireless device:

 {{{
 +r[  295.122289] olpc-ec:  running cmd 0x34
 [  295.122289] olpc-ec:  running cmd 0x24
 [  295.122289] olpc-ec:  running cmd 0x24
 }}}

 === 2.6.25 kernel + 2.6.22 driver + new fw ===

  * Sleeps once, wakes up, does not sleep again. Driver gets stuck waiting
 for libertas to respond to sleep command
  * Hitting ctrl-c on console causes sleep cycle to complete.
 Touchpad/keybd/power button does not.
  * Waking system back up casues list corruption bug in lbs_queue_cmd(),
 but is not fatal. System still running but network interface is dead at
 this point.

 === 2.6.25 kernel + 2.6.22 driver + old fw ===

  * Sleeps once, wakes up, does not sleep again. Gets stuck waiting for
 libertas to respond to sleep command
  * Hitting ctrl-c on console causes sleep cycle to complete.
  * Touchpad/keybd/power button does not.
  * System does not have network connectivity at this point.
  * Hitting ctrl-c on console at this point does nothing
  * Touchpad/keybd/power causes system to wakeup and crash with a spinlock
 error in the khub kernel driver.

 === 2.6.22 kernel + old fw ===

  * Hangs at olpc_do_sleep() after a number of cycles.
  * Keybd/mouse/pwr wakes system up
  * Can wake system up and hit ctrl-c to stop sleep loop and system
 immediately starts responding to pings.

 === 2.6.22 kernel + new fw ===

  * Hangs at suspend of libertas driver
  * ctrl-c completes suspend, stops at olpc_do_sleep()
  * Waking up with touchpad/keyboard leads to list corruption in
 lbs_queue_cmd()
  * System completely hung at this point.

 ----

 Note that even 2.6.22 kernel + old FW, which is what we are shipping in
 703, does not always successfuly wake-on-lan, it just happens to not be
 fatal.

 I'm fairly certain the root cause of this is a race somewhere in the
 kernel/EC/Libertas suspend stack. The kernel is sending a command to the
 device and it may or not be succeeding but if it is, we're not getting
 notified in time. Due to the race, we're getting out of sync, either
 driver <-> libertas, or driver <-> EC. How this ends up impacting the
 system can drastically change based on timing of the system. For example,
 if I enable full debug output from the libertas driver ({{{echo 0xffffffff
 > /sys/module/libertas/libertas_debug) I get a very different set of
 errors from the driver compared to having it just set to 0.

 I'm not sure what role the EC timeouts play in this whole mess. With the
 40ms timeout, I'm barely seeing them but still running into the same
 crashes so they might be orthogonal to the issues we're seeing here;
 however, they still need to be looked into.

 To really root cause this I think we're going to have to look at debug
 output of all three
 components of the system (kernel, EC, wireless) to see what they are
 sending and how each component is reacting. Our goal is to ship latest
 kernel + firmware, so we need to
 focus on that specific combination.

 Next steps:

  * Javier, can you do some debug on your end to see what's happening in
 the firmware when we hit the condition where the kernel does not get a
 response to sleep? I can post my debug bits for kernel side.

  * I will try to track down the lbs_queue_cmd() list corruption in the
 kernel and look for general racyness in the generic olpc and driver
 suspend/resume path.

  * Richard/Dilinger/me should look into the EC timeouts.

-- 
Ticket URL: <http://dev.laptop.org/ticket/7458#comment:33>
One Laptop Per Child <http://laptop.org/>
OLPC bug tracking system


More information about the Bugs mailing list