#7458 BLOC 8.2.0 (: Intermitent suspend/resume lockup

Zarro Boogs per Child bugtracker at laptop.org
Sun Jul 13 02:26:32 EDT 2008


#7458: Intermitent suspend/resume lockup
----------------------------+-----------------------------------------------
   Reporter:  dsaxena       |       Owner:  dsaxena                          
       Type:  defect        |      Status:  new                              
   Priority:  blocker       |   Milestone:  8.2.0 (was Update.2)             
  Component:  not assigned  |     Version:  Development build as of this date
 Resolution:                |    Keywords:  joyride-2131:-                   
Next_action:  diagnose      |    Verified:  0                                
  Blockedby:                |    Blocking:  7393                             
----------------------------+-----------------------------------------------

Comment(by dsaxena):

 Ugh, this is getting ugly.

 I ran three test cycles with EC timeout set to 40 ms and I got hangs, but
 with different behaviour than the ones above.

 Once the system continued to respond to pings after 101 S/R cycles but
 keyboard, serial console, and mouse input all did not work and system
 did not suspend again. The kernel was obviously still running, but it
 looks like userland was spinning on something.

 Second time, after 2361 S/R cycles, the system seems to have spontaneously
 rebooted while I was gone. The console was working but the screen was not
 on. I hit the power button and the system went into suspend and now cannot
 get out of suspend as OHM puts the system right back to sleep upon wakeup
 from any source.

 Third time after 532 cycles, I saw the "+r" from OFW but nothing from
 kernel.

 None of these cases had EC timeouts or 3strikes in the logs. As soon as
 I set the timeout back to 20ms, I saw this again after 13 cycles, so it
 appears that there are two (or more?) separate suspend/resume lockup
 issues
 we're dealing with (see #7479 for another error).

 I also saw the following after the 3strikes. Chris, is this the original
 error you saw when you reported this?

 {{{
 [  336.063874] libertas: Command 1f timed out
 [  336.078771] libertas: requeueing command 1f due to timeout (#1)
 }}}

 Next Steps:

    * For the EC issues with the default 20ms timeout, Richard needs to log
 data on the EC side.

    * I'll go read #2621 in detail to understand the 3strikes issue.

    * Will rerun with test suspend mode, as was done in #2621

    * I'll run testing kernel with CONFIG_PROVE_LOCKING to see if it
 catches a deadlock that might be showing up in the 40ms timeout case.

    * Merge KGDB or KDB into testing kernel so I can poke at the system
 state when a hang does occur if no deadlock is found via above.

    * Run joyride with master kernel to see if same issues show up. Not
 much of the  XO-specific code has changed between testing and master, but
 there maybe some other change in the x86 codepath that impacts this so I
 will just do this as a quick sanity check.

 Any other ideas, please throw them my way. :)

-- 
Ticket URL: <http://dev.laptop.org/ticket/7458#comment:16>
One Laptop Per Child <http://laptop.org/>
OLPC bug tracking system


More information about the Bugs mailing list