[Testing] [sugar] Automated testing, OLPC, code+screencasts.

Thu Mar 27 16:35:39 EDT 2008

On Thu, 2008-03-27 at 13:53 -0400, Michail Bletsas wrote:
> "Benjamin M. Schwartz" <bmschwar at fas.harvard.edu> wrote on 03/27/2008 
> 01:37:16 AM:
> 
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> > 
> > Michail Bletsas wrote:
> > | testing-bounces at lists.laptop.org wrote on 03/26/2008 09:19:19 PM:
> > |
> > |> 2. Many, and perhaps most, of OLPC's remaining difficult bugs are
> > | related
> > |> to the network.  They are most commonly related to the closed 
> wireless
> > |> firmware, which is buggy and lacks key features regarding mesh 
> routing
> > | and
> > |> multicast.
> > |
> > | Can you qualify your statement?
> > I have seen the wireless hardware silently drop all outgoing packets but
> > continue to route incoming packets for several minutes, until forcibly
> > reset by the user (about a month ago).  The firmware is so unstable that
> > the wireless driver even contains a mechanism to recognize when the
> > firmware has wedged and reset it.  This is what I mean by buggy.
> 
> So according to your thinking everything that has a reset button is buggy.
> I guess that, technically speaking, you are correct ;-)
As you might expect, I am only echoing the driver authors, who tend to
use phrases like "kick the firmware in the head".  I have not scanned
the logs to determine how often this reset is triggered.

> I also tend to believe that a "thick" firmware like the one that we use on 
> the 8388 will always have bugs given that it is several hundred thousand 
> lines of code so I don't feel bad for putting the reset functionality 
> there in the first place.

There are bugs, and then there are bugs that cause the subsystem to
become completely nonfunctional.  The Linux kernel is big and
complicated, and has plenty of bugs, but we don't worry about it
panicking all the time and having to reset our XO.

> 
> You are also very quick to point fingers to the firmware for everything 
> that goes wrong with the networking subsystem of the laptop. 
> The behavior that you are describing can be explained when the wireless 
> firmware doesn't communicate with the host CPU and is only forwarding 
> frames for other mesh nodes. There has also been a major rewrite of the 
> (completely open source) driver in use with the laptop, after which we 
> started to see that behavior (which was not observed before the rewrite).
> It is very easy to point fingers on religious grounds, it is much more 
> difficult to fix problems. 

Indeed.  I am again echoing the driver authors, who have told me that
the relevant codepath in the driver is so simple that it is not likely
to be the source of this problem.

> 
> 
> > 
> > | What features does it lack when it comes to mesh routing?
> > For me, the #1 missing feature is whitelisted wake-on-multicast.  To be
> > specific, it should be possible for the firmware to be told which
> > multicast addresses refer to this host.  The firmware would then wake up
> > the CPU only when a multicast packet arrives with a destination that is 
> on
> > the whitelist.  Without this feature, we are forced to choose between
> > never waking on multicast, and missing lots of important packets, and
> > waking up on every single multicast packet, which essentially means 
> never
> > sleeping at all.
> 
> First of all, what you are describing is standard WOL behavior (Wakeup on 
> LAN) which was not present in the original spec of the mesh firmware in 
> favor of the more general wakeup on broadcast, mcast or unicast. Marvell 
> is working on adding that in.
> So, no bug here, just oversight on our part which is going to be remedied. 

That is wonderful news.

> 
> 
> Even with that support in place, we will still be "missing lots of 
> important packets" unless we decide to wakeup on every multicast frame. So 
> a more specific filter is required because you don't want to wakeup on 
> Avahi announcements but you do want to wakeup on traffic from activities 
> that you already participate. You can do that on the application level, by 
> stopping the Avahi listener before you suspend, however that will add a 
> lot of time to suspend and resume. 

I think we do want to wake up on Avahi announcements.  If my XO is
suspended looking at the mesh view, and someone else shares an Activity,
my XO should wake up, process the Avahi event through the presence
service, draw the icon in the mesh view, and then suspend again.  The
same is true for buddies joining and leaving.  Otherwise, I will never
see shared activities, because my XO is almost always suspended and
ignoring announcements.

> > 
> > My #2 missing feature is a control for transmit gain and receive gain. 
> By
> > decreasing gain, the range of each transmission could be reduced, 
> turning
> > dense meshes in a single classroom into multihop meshes.  This might
> > compensate somewhat for the firmware's simplistic multicast routing. 
> It's
> > not clear that this would work, but at present we cannot even try it.
> 
> Why would you ever want to turn a classroom into a multihop mesh?
> Just because you have a hammer, doesn't turn everything into a nail.

The best theory I've been given for multicast meltdown is that every
multicast packet is echoed back by all 30 other XO's simultaneously.
Lower transmit power would mean that only a handful of XOs receive the
initial transmission, allowing the flood-fill to work as designed.

> That is exactly the approach that has created all the unrealistic 
> expectations about what the mesh can and cannot do.

The problem is not what the mesh can do, but what it _must_ do.  By the
end of this year, many hundreds of thousands of children will have XOs
and no other infrastructure.  The mesh must enable communication. You're
one of many people around here whose job description violates a law of
nature.

More seriously, it seems that the hardware really does have the
capability to make large meshes work. For example, the MPP functionality
suggests that eth0 and msh0 can run on two different channels.  Thus,
one can imagine keeping msh0 at ttl=1, and using eth0 to create a
sparse, longer-distance overlay mesh consisting of a small fraction of
the XOs.  Unfortunately, we can't test new routing proposals (even ones
written by researchers in mesh routing), because we have neither a Free
firmware nor a thin-mac firmware.  I've heard suggestions that one or
the other may be coming, eventually.

> If you are in a classroom, an AP will always be a lot more efficient since 
> it doesn't have to do with the mesh control plane traffic.

If you don't have a reliable power supply, and your XO's are charged by
foot pedals and cows walking in circles, then you don't have any APs.
Those villages are precisely the ones where OLPC is most important.  I
understand that mesh is difficult, but "buy an AP" is an inappropriate
answer.  Remember that these countries are already taking out loans to
buy the XOs.

> 
> As far as the support for transmit gain and receive gain is concerned, 
> transmit power control is definitely supported and the firmware even 
> supports per frame tx power setting. The D/A on the power amplifier used 
> on the XO's module is not fast enough for that to work, so one has to 
> settle for coarser grain control. The bottom line is that this is a 
> hardware limitation, not software.
> 
> I don't really understand what receive gain adjustment will buy you in a 
> dense scenario. One of the fundamental issues with WiFi radios in general, 
> is that interference range is much larger than decode range.

That is a good explanation for why my proposal won't work.  I didn't
know that.

--Ben