#2943 NORM Untriag: Kernel oops when pressing Ctrl-C while resuming
Zarro Boogs per Child
bugtracker at laptop.org
Wed Oct 31 16:25:06 EDT 2007
#2943: Kernel oops when pressing Ctrl-C while resuming
-----------------------+----------------------------------------------------
Reporter: jcardona | Owner: dilinger
Type: defect | Status: new
Priority: normal | Milestone: Untriaged
Component: kernel | Version:
Resolution: | Keywords:
Verified: 0 |
-----------------------+----------------------------------------------------
Comment(by dilinger):
Here's what we know:
- Our kernels have CONFIG_PREEMPT enabled. With this enabled, it's
impossible to debug this problem; stack backtraces show up in different
places.
- With CONFIG_PREEMPT disabled, we can get a reproducing stack trace
(shown above).
- In the above backtrace, the stack trace is incomplete.
usb_resume_interface is actually calling hub_port_init, which calls
hub_port_reset, which calls msleep(50).
- The bug is triggered via ctrl-c; char/n_tty.c passes SIGINT to the bash
process. The pending signal is stored in the task_struct, and execution
continues. It is not until we're resetting USB, later, that we actually
oops. So, the bug does not appear to be in the signal delivery; it
appears to be elsewhere, and is simply triggered by signal delivery.
- We have a local variable in hub_port_int called 'hdev'. When we return
from hub_port_reset, this variable is NULL (and when we do hdev->speed, we
trigger the oops). Since 'hdev' is on the stack, examining the stack
while it is still correct shows us that prior to the call to
hub_port_reset, 0x38(%esp) has a valid address; after hub_port_reset,
0x38(%esp) and the surrounding values on the stack are all 0's.
- There is no logical reason for hub_port_reset to ever trash a variable
that is local to hub_port_init without some sort of stack corruption.
However, it doesn't appear to be a simple stack overflow; there's not
garbage on the stack, there are 0's.
- From msleep(50), we jump into the kernel scheduler. This is where the
corruption is happening, and it's not simple stack corruption; it is CPU
register corruption. %eip is pointing to the wrong place in memory after
some schedule()s. If we run with CONFIG_PREEMPT enabled, we call
schedule() a lot more often, and in random places; that's why we need to
disable CONFIG_PREEMPT in order to examine this bug.
- If we accept that the bug is in the scheduler, and we accept that ESP
is garbage by the time we realize the bug, our method(s) for debugging
this is the following.. We save a reference to hdev in a global variable,
and we sprinkle schedule() with code that checks for hdev to be NULL.
Every time we make that check, if hdev is not NULL, we save EIP and ESP.
If hdev *is* NULL, we have hit the bug, and we print out EIP and ESP.
- By the time hdev is NULL, we are in switch_to(), doing a popl %ebp.
This tells me that somewhere in the context switch, we are corruption
%esp.
This is a really difficult problem to figure out. By the time we've
corrupted 'hdev', %esp has already been overwritten. A few more data
points:
- Vserver touches the scheduler quite a bit, for its process accounting
(and possibly other things)
- Removing vserver completely from the kernel makes this problem
disappear completely. It could be that the bug is in vserver, or it could
be that vserver simply triggers a race that still exists without vserver,
but we just don't hit the code path.
- The first known occurences of this bug showed up around Aug 17th.
Vserver was merged on Aug 14th (548a11da3f5badacdb88995f3ac9feb9070be4d9).
Javier claims that the bug does not happen with kernels from July 20.
Unfortunately, I am running low on time to debug this; 2+ full days is
really too long. However, this bug scares me quite a bit; corrupting CPU
registers is *bad*.
--
Ticket URL: <https://dev.laptop.org/ticket/2943#comment:6>
One Laptop Per Child <https://dev.laptop.org>
OLPC bug tracking system
More information about the Bugs
mailing list