#11982 BLOC 12.1.0: XO-1.5 os16 runin hang

Wed Jul 25 17:36:07 EDT 2012

#11982: XO-1.5 os16 runin hang
--------------------------------+-------------------------------------------
           Reporter:  Quozl     |       Owner:  dsd                              
               Type:  defect    |      Status:  assigned                         
           Priority:  blocker   |   Milestone:  12.1.0                           
          Component:  kernel    |     Version:  Development build as of this date
         Resolution:            |    Keywords:                                   
        Next_action:  diagnose  |    Verified:  0                                
Deployment_affected:            |   Blockedby:                                   
           Blocking:            |  
--------------------------------+-------------------------------------------

Comment(by dsd):

 Todays update:

 My overnight testing confirms that running the 3.1 kernel and runin
 version from 12.1.0 on top of 11.3.1 does not reproduce the issue. So it
 seems fair to say that this hang is either triggered by a userspace
 application, or is a kernel bug that is only now exposed due to a change
 in userspace.

 We found another case of the hang occurring in read_unlock(&tasklist_lock)
 but this one also printed "dcon_source_switch to CPU" before the call to
 read_unlock() had returned (it never did). This suggests that the problem
 is not actually in the unlocking, its in some other task that is being
 scheduled at that point due to kernel pre-emption being re-enabled when
 unlocking a spinlock.

 Disabling kernel preemption helped to confirm this - the hang was then
 postponed until thaw_processes() calls schedule() at the end - and
 schedule() never returned. This was confirmed on 2 systems.

 (The DCON is not really a suspect, because in the cases where kernel
 preemption was disabled, the DCON was fully unfrozen before tasks were
 restarted)

 Sam has seen two instances of a hang that occur at a slightly later stage,
 after all tasks have been resumed and even after the wifi card has been
 re-detected. So perhaps we don't always see this immediately during
 resume.

 We are currently testing CONFIG_HARDLOCKUP_DETECTOR (early indications
 suggest that this doesn't catch anything) and a modified resume routine
 where debug info is printed from !__schedule(), to hopefully tell us which
 process is scheduled immediately before the hang.

-- 
Ticket URL: <http://dev.laptop.org/ticket/11982#comment:22>
One Laptop Per Child <http://laptop.org/>
OLPC bug tracking system