Brick Insurance

Jim Gettys jg at laptop.org
Tue Jan 30 23:05:10 EST 2007


On Tue, 2007-01-30 at 09:48 -0700, ron minnich wrote:
> On 1/30/07, Mitch Bradley <wmb at firmworks.com> wrote:
> 
> > For now it won't help much because the EC code, which is a single point
> > of failure, has to be upgraded so often due to changes from Quanta.
> 
> yes, this is another reason I did not bring it up earlier :-)
> 
> > When that settles down, we should do something along the lines that Ron
> > suggests.  There are quite a few ways it can be done.  We'll need to
> > consider the characteristics of the FLASH device and the likely failure
> > modes to pick the most effective strategy.
> 
> What we've seen, over the last couple years, is that the most common
> is flashus interruptus, due to things like power dropping, breaker
> tripping, or, literally, people tripping on a power cable :-) Yes,
> there's a battery, but ... maybe their battery has not charged for
> some reason. Like mine :-)

Yes, and we should probably refuse to reflash (or require human
intervention; as you note, batteries may be "broken", as so many of our
B1 batteries have ended up due to our (re)charging problems) unless a
battery is installed and indicating good charge.  That avoids the
breaker tripping and power cable problem you face in machine room
environments.  We can't presume people read the directions about having
their machines plugged in (or even be able to plug them in).

> Second is flashing the wrong image, while over-riding the "are you
> sure" question.

Yup.  Good source of bricking.  We do check a checksum and an ID, which
avoids the worst of this failure mode.

> 
> Third is flashing a flash part that, at about that time, decides it
> has done enough flashing for one lifetime, thank you. In spite of the
> nominal guarantee of 100,000 cycles, we have parts that seem to last
> about 10 cycles, as delivered from a vendor. We theorized that the
> flash parts were recycled from some other application. These are the
> worst, in many ways. If you have a fallback, at least you can live on
> that backup image, and have a usable, albeit outdated, system.
> 

SPI boot flash is a very different animal than flash intended for other
use; I can easily believe low cycle cycle life on them.

As Mitch notes, there are ROI questions of how far to go: I suspect we
should go a bit further, and leave it at that.  Whether a backup set of
firmware is even possible is not yet clear.  And if we have to trade
that for space for diagnostics to help make the system more easy to
repair in the field, that may easily come first on the priority list.
Having Mitch's touch pad diagnostic has been a life saver this week, to
name a simple bright in my mind.
                             - Jim

-- 
Jim Gettys
One Laptop Per Child





More information about the Devel mailing list