[Trac #112] Board ..0958 Failure
Zarro Boogs per Child
bugtracker at laptop.org
Sat Oct 7 19:09:53 EDT 2006
#112: Board ..0958 Failure
-------------------------------+--------------------------------------------
Reporter: wmb at firmworks.com | Owner: mfoster
Type: defect | Status: new
Priority: normal | Milestone:
Component: hardware | Resolution:
Keywords: |
-------------------------------+--------------------------------------------
Old description:
> I'm opening this ticket to record information about a particular failed
> board that was sent to me for analysis.
>
> Identification: The small bar code tag on the bottom says 0017C4000958
>
> Failure description: The board had a sticky note that says "Failed
> 9/6/06". There was no indication of the system configuration or the
> software being used when the "failure" was detected. So it's difficult
> to know what "failure" means. Jim says that the people who were using
> the board were involved in wireless testing, so perhaps the wireless
> hardware is the problem. That certainly agrees with the results below.
>
> Configuration: The board had two wireless antennae installed. The SPI
> FLASH contains Insyde BIOS.
>
> Initial tests:
> a) Attach CRT and serial port and power on - Inysde BIOS screen comes on.
>
> b) Try to boot my diags off USB key - VSA USB enumeration does not report
> the existence of the USB key, either directly connected or via a powered
> USB 2.0 hub
>
> c) Connect a USB keyboard (directly, no hub) and get control of BIOS via
> F1. Change boot order so USB HD is first (instead of floppy first).
>
> d) Disconnect keyboard, insert USB key. VSA USB enumeration still does
> not see it.
>
> e) Plug in ROM emulator in PLCC socket, loaded with my LB + diagnostic
> payload. Gets about halfway through LB startup then dies after "Call
> real_mode_switch_call_vsm"
>
> f) Another try dies shortly thereafter, saying "PCI: Sanity check failed"
>
> g) Suspecting memory flakiness, I apply the RAM timing patch described in
> trac #108. (The board has Hynix RAMs).
>
> h) Now my Open Firmware diag boots just fine, turns on the video screen,
> etc.
>
> i) I use it to boot Linux (build from yesterday) from USB key. vmlinuz
> and initrd load just fine, Linux starts running. But then I get a bunch
> of messages from the USB stack : hub 1-0:1.0: connect-debounce failed,
> port 4 disabled
> The messages never stop, they just keep coming out every second or so.
>
> So something is funny in USB land.
>
> j) I go back to my Open Firmware diagnostic and look at the USB device
> tree. Aha! The Marvell wireless device is not showing up in the tree.
>
> Next step (after getting some sleep): Figure out what's wrong with the
> Marvell wireless.
>
> Next day:
>
> Step-by-step execution of the USB probe sequence for the port connected
> to the Marvell Wireless device. It initially identifies as a high-speed
> device (signified by the value in the portsc@ register after the "reset"
> bit is written. The low bits are 0x3, indicating port status change (2)
> and device present (1). Then I clear the 2 it by writing to it, and
> about 8 milliseconds later, the 2 bit comes back on (another port status
> change) and the 1 bit goes off (device not present).
>
> This behavior is repeatable - the device "goes away" as far as the USB
> bus is concerned (but the USB protocol is gnarly so just about anything
> could make a device disappear). That result is consistent with the
> repeated "connect debounce" messages from Linux.
>
> Connected a scope to the USBP4 +/- lines (first did it on a good board to
> see what the waveform is supposed to look like). Sure enough, the bad
> board has much different waveforms from the good board. On the bad
> board, as soon as 1 is written to the EHCI CFGFLAG register, the USBP4+
> line starts pulsing high 0V to 4.25V. The pulse width is 3.5 ms, the
> repetition period 86 ms. Nothing even remotely like that pattern occurs
> on the good board, and it doesn't correspond to any USB protocol
> characteristics of which I'm aware. I wonder if that line is shorted to
> something else?
>
> Another possibility is that the 88W8388 chip is insane. It's a System-
> On-Chip device with its own internal processor, so it could be doing
> arbitrarily strange things. Perhaps its microcode (which is loaded from
> a separate serial FLASH) got corrupted. Lacking development/debug tools
> for that device, discovering what is happening could be very difficult.
New description:
Hi, Mitch!
This sounds similar to a board I found which had been damaged in shipment.
One of the inductors had broken, and the USB port was flaky as heck.
Please ship the board back to Quanta, and we'll find out exactly what went
wrong with the hardware.
Thanks!
Mark
Comment (by mfoster):
Replying to [ticket:112 wmb at firmworks.com]:
> I'm opening this ticket to record information about a particular failed
board that was sent to me for analysis.
>
> Identification: The small bar code tag on the bottom says 0017C4000958
>
> Failure description: The board had a sticky note that says "Failed
9/6/06". There was no indication of the system configuration or the
software being used when the "failure" was detected. So it's difficult to
know what "failure" means. Jim says that the people who were using the
board were involved in wireless testing, so perhaps the wireless hardware
is the problem. That certainly agrees with the results below.
>
> Configuration: The board had two wireless antennae installed. The SPI
FLASH contains Insyde BIOS.
>
> Initial tests:
> a) Attach CRT and serial port and power on - Inysde BIOS screen comes
on.
>
> b) Try to boot my diags off USB key - VSA USB enumeration does not
report the existence of the USB key, either directly connected or via a
powered USB 2.0 hub
>
> c) Connect a USB keyboard (directly, no hub) and get control of BIOS via
F1. Change boot order so USB HD is first (instead of floppy first).
>
> d) Disconnect keyboard, insert USB key. VSA USB enumeration still does
not see it.
>
> e) Plug in ROM emulator in PLCC socket, loaded with my LB + diagnostic
payload. Gets about halfway through LB startup then dies after "Call
real_mode_switch_call_vsm"
>
> f) Another try dies shortly thereafter, saying "PCI: Sanity check
failed"
>
> g) Suspecting memory flakiness, I apply the RAM timing patch described
in trac #108. (The board has Hynix RAMs).
>
> h) Now my Open Firmware diag boots just fine, turns on the video screen,
etc.
>
> i) I use it to boot Linux (build from yesterday) from USB key. vmlinuz
and initrd load just fine, Linux starts running. But then I get a bunch
of messages from the USB stack : hub 1-0:1.0: connect-debounce failed,
port 4 disabled
> The messages never stop, they just keep coming out every second or so.
>
> So something is funny in USB land.
>
> j) I go back to my Open Firmware diagnostic and look at the USB device
tree. Aha! The Marvell wireless device is not showing up in the tree.
>
> Next step (after getting some sleep): Figure out what's wrong with the
Marvell wireless.
>
> Next day:
>
> Step-by-step execution of the USB probe sequence for the port connected
to the Marvell Wireless device. It initially identifies as a high-speed
device (signified by the value in the portsc@ register after the "reset"
bit is written. The low bits are 0x3, indicating port status change (2)
and device present (1). Then I clear the 2 it by writing to it, and about
8 milliseconds later, the 2 bit comes back on (another port status change)
and the 1 bit goes off (device not present).
>
> This behavior is repeatable - the device "goes away" as far as the USB
bus is concerned (but the USB protocol is gnarly so just about anything
could make a device disappear). That result is consistent with the
repeated "connect debounce" messages from Linux.
>
> Connected a scope to the USBP4 +/- lines (first did it on a good board
to see what the waveform is supposed to look like). Sure enough, the bad
board has much different waveforms from the good board. On the bad board,
as soon as 1 is written to the EHCI CFGFLAG register, the USBP4+ line
starts pulsing high 0V to 4.25V. The pulse width is 3.5 ms, the
repetition period 86 ms. Nothing even remotely like that pattern occurs
on the good board, and it doesn't correspond to any USB protocol
characteristics of which I'm aware. I wonder if that line is shorted to
something else?
>
> Another possibility is that the 88W8388 chip is insane. It's a System-
On-Chip device with its own internal processor, so it could be doing
arbitrarily strange things. Perhaps its microcode (which is loaded from a
separate serial FLASH) got corrupted. Lacking development/debug tools for
that device, discovering what is happening could be very difficult.
--
Ticket URL: <http://dev.laptop.org/ticket/112#comment:2>
One Laptop Per Child <http://laptop.org/>
More information about the Devel
mailing list