#7458 BLOC 8.2.0 (: Intermitent suspend/resume lockup

Zarro Boogs per Child bugtracker at laptop.org
Thu Jul 17 20:13:22 EDT 2008


#7458: Intermitent suspend/resume lockup
----------------------------+-----------------------------------------------
   Reporter:  dsaxena       |       Owner:  dsaxena                          
       Type:  defect        |      Status:  assigned                         
   Priority:  blocker       |   Milestone:  8.2.0 (was Update.2)             
  Component:  not assigned  |     Version:  Development build as of this date
 Resolution:                |    Keywords:  joyride-2131:-                   
Next_action:  diagnose      |    Verified:  0                                
  Blockedby:                |    Blocking:  7393                             
----------------------------+-----------------------------------------------

Comment(by dsaxena):

 Was running another test and stumbled upon this scenario. I hit a lockup
 we've seen before, where we got timeouts and at somepoint the system stops
 responding in the middle of going to sleep:

 {{{
 [    2.549039] PM: Syncing filesystems ... done.
 [    2.554305] PM: Preparing system for mem sleep
 [    2.559303] Freezing user space processes ... (elapsed 0.00 seconds)
 done.
 [    2.573643] Freezing remaining freezable tasks ... (elapsed 0.00
 seconds) done.
 [    2.581921] PM: Entering mem sleep
 }}}

 I hit <ctrl>-<c> on serial console and the system came back:


 {{{


 [   56.856441] olpc-ec:  running cmd 0x1c
 [   56.873266] olpc-ec:  received 0xff
 [   56.876778] olpc-ec:  running cmd 0x1b
 [   56.886569] olpc-ec:  sending cmd arg 0xbb
 [   56.907746] olpc-ec:  running cmd 0x32
 [   56.921724] olpc_do_sleep!
 +r[    0.005723] olpc-ec:  running cmd 0x34
 [    0.021556] olpc-ec:  running cmd 0x24
 [    0.042385] olpc-ec:  running cmd 0x24
 [    0.063217] olpc-ec:  running cmd 0x1c
 [    0.080031] olpc-ec:  received 0xbb
 [    0.083541] olpc-ec:  running cmd 0x1b
 [    0.092331] olpc-ec:  sending cmd arg 0xff
 }}}

 When we get to wireless resume, we see:

 {{{
 [    2.158817] list_add corruption. prev->next should be next (cda05884),
 but was cda0587c. (prev=ce3f54d0).
 [    2.168417] ------------[ cut here ]------------
 [    2.173053] Kernel BUG at c04e0e77 [verbose debug info unavailable]
 [    2.179336] invalid opcode: 0000 [#1] PREEMPT
 [    2.183848] Modules linked in: usb8xxx libertas ieee80211 mmc_block
 ieee80211_crypt pcspkr mousedev ehci_hcd sdhci ohci_hcd mmc_core
 cs5535_gpio usbcore
 [    2.197851]
 [    2.199366] Pid: 1809, comm: sleep_test Not tainted (2.6.25.11 #5)
 [    2.205481] EIP: 0060:[<c04e0e77>] EFLAGS: 00010046 CPU: 0
 [    2.210995] EIP is at __list_add+0x34/0x4a
 [    2.215109] EAX: 00000070 EBX: ce3f5504 ECX: ce3c3000 EDX: 00000001
 [    2.221392] ESI: ce3f5504 EDI: cda04420 EBP: ce3c3e30 ESP: ce3c3e1c
 [    2.227675]  DS: 0068 ES: 007b FS: 0000 GS: 0033 SS: 0068
 [    2.233090] Process sleep_test (pid: 1809, ti=ce3c3000 task=cdc6c890
 task.ti=ce3c3000)
 [    2.240762] Stack: c069c646 cda05884 cda0587c ce3f54d0 00000001
 ce3c3e4c d094adf7 cda058bc
 [    2.249349]        00000283 0000001f cda04420 cd328800 ce3c3e84
 d094c951 00000000 cda04420
 [    2.257935]        ce3f5504 ccfea000 d0ad7293 00000020 00000002
 ccf47400 ccfd80c0 cda04420
 [    2.266523] Call Trace:
 [    2.269194]  [<d094adf7>] ? lbs_queue_cmd+0x115/0x1b1 [libertas]
 [    2.275379]  [<d094c951>] ? lbs_prepare_and_send_command+0x14ba/0x164e
 [libertas]
 [    2.283003]  [<d0ad7293>] ? usb_submit_urb+0x1d5/0x1f1 [usbcore]
 [    2.289237]  [<d094402d>] ? lbs_resume+0x64/0xc8 [libertas]
 [    2.294936]  [<d0935b6f>] ? if_usb_resume+0x5f/0xab [usb8xxx]
 [    2.300777]  [<d0ad8e35>] ? usb_resume_interface+0x52/0xf4 [usbcore]
 [    2.307300]  [<d0ad8f8e>] ? usb_external_resume_device+0xb7/0xeb
 [usbcore]
 [    2.314347]  [<d0ad8ff3>] ? usb_resume+0x31/0x37 [usbcore]
 [    2.319993]  [<c051e748>] ? device_resume+0x7a/0x128
 [    2.325038]  [<c0431e66>] ? suspend_devices_and_enter+0x103/0x11f
 [    2.331225]  [<c0431fb4>] ? enter_state+0x110/0x16c
 [    2.336194]  [<c04320a3>] ? state_store+0x93/0xab
 [    2.340971]  [<c0432010>] ? state_store+0x0/0xab
 [    2.345663]  [<c04d99a5>] ? kobj_attr_store+0x1a/0x22
 [    2.350792]  [<c048daa6>] ? sysfs_write_file+0xb0/0xdb
 [    2.356022]  [<c048d9f6>] ? sysfs_write_file+0x0/0xdb
 [    2.361146]  [<c0457cda>] ? vfs_write+0x8c/0x10e
 [    2.365855]  [<c0458196>] ? sys_write+0x3b/0x60
 [    2.370465]  [<c0403812>] ? sysenter_past_esp+0x5f/0x85
 [    2.375778]  =======================
 [    2.379364] Code: 39 d0 74 14 51 50 52 68 f6 c5 69 c0 e8 bd 4b f3 ff 0f
 0b 83 c4 10 eb fe 8b 10 39 ca 74 14 50 52 51 68 46 c6 69 c0 e8 a3 4b f3 ff
 <0f> 0b 83 c4 10 eb fe 89 5a 04 89 13 89 43 04 89 18 8b 5d fc c9
 [    2.399892] EIP: [<c04e0e77>] __list_add+0x34/0x4a SS:ESP 0068:ce3c3e1c
 [    2.406578] ---[ end trace a7fd94e034e86e3b ]---
 [    2.411217] note: sleep_test[1809] exited with preempt_count 1
 [    2.416979] BUG: sleeping function called from invalid context at
 kernel/rwsem.c:21
 [    2.424652] in_atomic():1, irqs_disabled():1
 [    2.428947] Pid: 1809, comm: sleep_test Tainted: G      D  2.6.25.11 #5
 [    2.435497]  [<c040f4a6>] __might_sleep+0x98/0x9f
 [    2.440262]  [<c05f449e>] down_read+0x15/0x23
 [    2.444670]  [<c04330fe>] acct_collect+0x36/0x14c
 [    2.449422]  [<c0417943>] do_exit+0x189/0x5cc
 [    2.453837]  [<c0415a2f>] ? printk+0x15/0x17
 [    2.458182]  [<c0414da2>] ? print_oops_end_marker+0x1e/0x23
 [    2.463834]  [<c04041e9>] die+0x113/0x11b
 [    2.467895]  [<c05f566b>] do_trap+0x7d/0x85
 [    2.472135]  [<c040439c>] ? do_invalid_op+0x0/0x76
 [    2.477001]  [<c0404408>] do_invalid_op+0x6c/0x76
 [    2.481754]  [<c04e0e77>] ? __list_add+0x34/0x4a
 [    2.486448]  [<c05f5184>] ? _spin_unlock_irqrestore+0x17/0x2c
 [    2.492279]  [<c0415455>] ? release_console_sem+0x193/0x19b
 [    2.497933]  [<c04159fd>] ? vprintk+0x2b3/0x2d0
 [    2.502539]  [<d0ad6eeb>] ? usb_hcd_submit_urb+0x83e/0x91b [usbcore]
 [    2.509075]  [<c05f5392>] error_code+0x6a/0x70
 [    2.513575]  [<c041007b>] ? sched_show_task+0x7f/0xa6
 [    2.518706]  [<c0450000>] ? unmap_hugepage_range+0xc/0x42
 [    2.524183]  [<c04e0e77>] ? __list_add+0x34/0x4a
 [    2.528878]  [<d094adf7>] lbs_queue_cmd+0x115/0x1b1 [libertas]
 [    2.534830]  [<d094c951>] lbs_prepare_and_send_command+0x14ba/0x164e
 [libertas]
 [    2.542253]  [<d0ad7293>] ? usb_submit_urb+0x1d5/0x1f1 [usbcore]
 [    2.548435]  [<d094402d>] lbs_resume+0x64/0xc8 [libertas]
 [    2.553935]  [<d0935b6f>] if_usb_resume+0x5f/0xab [usb8xxx]
 [    2.559575]  [<d0ad8e35>] usb_resume_interface+0x52/0xf4 [usbcore]
 [    2.565901]  [<d0ad8f8e>] usb_external_resume_device+0xb7/0xeb
 [usbcore]
 [    2.572747]  [<d0ad8ff3>] usb_resume+0x31/0x37 [usbcore]
 [    2.578203]  [<c051e748>] device_resume+0x7a/0x128
 [    2.583042]  [<c0431e66>] suspend_devices_and_enter+0x103/0x11f
 [    2.589019]  [<c0431fb4>] enter_state+0x110/0x16c
 [    2.593781]  [<c04320a3>] state_store+0x93/0xab
 [    2.598359]  [<c0432010>] ? state_store+0x0/0xab
 [    2.603049]  [<c04d99a5>] kobj_attr_store+0x1a/0x22
 [    2.607980]  [<c048daa6>] sysfs_write_file+0xb0/0xdb
 [    2.613010]  [<c048d9f6>] ? sysfs_write_file+0x0/0xdb
 [    2.618142]  [<c0457cda>] vfs_write+0x8c/0x10e
 [    2.622635]  [<c0458196>] sys_write+0x3b/0x60
 [    2.627046]  [<c0403812>] sysenter_past_esp+0x5f/0x85
 [    2.632160]  =======================
 }}}

 The first BUG in there is from the wireless driver's {{{lbs_queue_cmd()}}}
 function appending to a list that has been corrupted. The second BUG I'm
 still trying to trace down completely, but given that it has to do with
 sleeping at the wrong time, the fact that I could wake the system up from
 being stuck in some context, the list corruption, the message
 "sleep_test[1809] exited with preempt_count 1", and some other behaviour
 I've seen I ran a test with CONFIG_PREEMPT disabled to make 100% sure that
 our issues are not due to a preemption bug. I reproduced the exact same
 crash, so one more variable out of the way.

 Will analyze the crash path and other crashes in detail and see if I can
 find a correlation in all of them other than the EC timeout. One note of
 interest is that -stable kernel is hard coded with 40ms timeout.

-- 
Ticket URL: <http://dev.laptop.org/ticket/7458#comment:32>
One Laptop Per Child <http://laptop.org/>
OLPC bug tracking system


More information about the Bugs mailing list