[Testing] [sugar] Automated testing, OLPC, code+screencasts.

Michail Bletsas mbletsas at laptop.org
Thu Mar 27 13:53:17 EDT 2008


"Benjamin M. Schwartz" <bmschwar at fas.harvard.edu> wrote on 03/27/2008 
01:37:16 AM:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Michail Bletsas wrote:
> | testing-bounces at lists.laptop.org wrote on 03/26/2008 09:19:19 PM:
> |
> |> 2. Many, and perhaps most, of OLPC's remaining difficult bugs are
> | related
> |> to the network.  They are most commonly related to the closed 
wireless
> |> firmware, which is buggy and lacks key features regarding mesh 
routing
> | and
> |> multicast.
> |
> | Can you qualify your statement?
> I have seen the wireless hardware silently drop all outgoing packets but
> continue to route incoming packets for several minutes, until forcibly
> reset by the user (about a month ago).  The firmware is so unstable that
> the wireless driver even contains a mechanism to recognize when the
> firmware has wedged and reset it.  This is what I mean by buggy.

So according to your thinking everything that has a reset button is buggy.
I guess that, technically speaking, you are correct ;-)
I also tend to believe that a "thick" firmware like the one that we use on 
the 8388 will always have bugs given that it is several hundred thousand 
lines of code so I don't feel bad for putting the reset functionality 
there in the first place.

You are also very quick to point fingers to the firmware for everything 
that goes wrong with the networking subsystem of the laptop. 
The behavior that you are describing can be explained when the wireless 
firmware doesn't communicate with the host CPU and is only forwarding 
frames for other mesh nodes. There has also been a major rewrite of the 
(completely open source) driver in use with the laptop, after which we 
started to see that behavior (which was not observed before the rewrite).
It is very easy to point fingers on religious grounds, it is much more 
difficult to fix problems. 


> 
> | What features does it lack when it comes to mesh routing?
> For me, the #1 missing feature is whitelisted wake-on-multicast.  To be
> specific, it should be possible for the firmware to be told which
> multicast addresses refer to this host.  The firmware would then wake up
> the CPU only when a multicast packet arrives with a destination that is 
on
> the whitelist.  Without this feature, we are forced to choose between
> never waking on multicast, and missing lots of important packets, and
> waking up on every single multicast packet, which essentially means 
never
> sleeping at all.

First of all, what you are describing is standard WOL behavior (Wakeup on 
LAN) which was not present in the original spec of the mesh firmware in 
favor of the more general wakeup on broadcast, mcast or unicast. Marvell 
is working on adding that in.
So, no bug here, just oversight on our part which is going to be remedied. 


Even with that support in place, we will still be "missing lots of 
important packets" unless we decide to wakeup on every multicast frame. So 
a more specific filter is required because you don't want to wakeup on 
Avahi announcements but you do want to wakeup on traffic from activities 
that you already participate. You can do that on the application level, by 
stopping the Avahi listener before you suspend, however that will add a 
lot of time to suspend and resume. 



> 
> My #2 missing feature is a control for transmit gain and receive gain. 
By
> decreasing gain, the range of each transmission could be reduced, 
turning
> dense meshes in a single classroom into multihop meshes.  This might
> compensate somewhat for the firmware's simplistic multicast routing. 
It's
> not clear that this would work, but at present we cannot even try it.

Why would you ever want to turn a classroom into a multihop mesh?
Just because you have a hammer, doesn't turn everything into a nail.
That is exactly the approach that has created all the unrealistic 
expectations about what the mesh can and cannot do.
If you are in a classroom, an AP will always be a lot more efficient since 
it doesn't have to do with the mesh control plane traffic.

As far as the support for transmit gain and receive gain is concerned, 
transmit power control is definitely supported and the firmware even 
supports per frame tx power setting. The D/A on the power amplifier used 
on the XO's module is not fast enough for that to work, so one has to 
settle for coarser grain control. The bottom line is that this is a 
hardware limitation, not software.

I don't really understand what receive gain adjustment will buy you in a 
dense scenario. One of the fundamental issues with WiFi radios in general, 
is that interference range is much larger than decode range. What you can 
play with is the clear channel assessment threshold, however that is 
different from receiver gain (usually done via an AGC in the analog 
domain).


> 
> Smart multicast routing is the other obvious missing feature; I 
appreciate
> that this is still considered an academic research problem.
It is and the 802.11s standards committee is also struggling with it. 

> 
> | Can you point me to a better working implementation out there when it
> | comes to multicast routing?
> No, I cannot.
> 
> This wireless firmware may be the best mesh implementation in all of
> history, in the whole world.  It's still disgustingly buggy, and has
> already set the project back months.  Its multicast and wakeup behaviors
> have forced us to drop critical features.  The software team has come to
> regard the wireless system as so unpredictable that any task involving 
it
> is "science, not programming".

Just looking at the number of bugs in the trac contradicts your statement.
And yes, the wireless subsystem does many things that existing radios 
don't do. 
It is also asked to do "magic" as opposed to what physics realistically 
allow.
That's the main bug with it right now. It just doesn't make spectrum out 
of thin air...


> 
> I am also quite convinced that if OLPC developers were free to read the
> source code and modify it, given access to Marvell's internal
> documentation, we would be much further along.

That is generally true. It runs against long established practice in the 
wireless industry that is enforced by some valid and some not so valid 
reasons.  Unfortunately, there is no example in the industry right now of 
an open fully-functional low-level wireless stack and that will take some 
time to change. If the XO ends up being produced in really high volumes, 
then we will definitely revisit that. The bottom line is that right now 
the volumes of the devices that require their radios to be "closed" are 
much higher than those of the open source devices.



> 
> |> 3. Almost all of OLPC's major bugs are Heisenbugs.  They often don't
> |> appear at all with only one laptop, and appear rarely until one has 
12
> | or
> |> more laptops sharing a wireless mesh.
> |
> | And most of them are due to the fact that our application traffic
> | saturates the wireless spectrum.
> 
> Indeed.  And that is due to a mismatch between Salut, which assumes
> efficient multicast routing, and the firmware, which doesn't provide it.
> I know very little about the ongoing work with Cerebro, but that seems 
to
> be a very reasonable next step.

Yes, it is.


M.




More information about the Testing mailing list