Monday's Testing

Thu Mar 27 01:26:52 EDT 2008

Some context, for the larger community:   I've been using our new  
testbed
http://wiki.laptop.org/go/Collaboration_Network_Testbed
to explore why we can't reliably DHCP from the school server over the
mesh (#5963).   The problem APPEARS to be that in the presence of
congestion in the mesh, the current route discovery algorithm favors
many hop routes over single hop routes.   This is a predictable result
of the current algorithm, not an implementation error.

On Mar 26, 2008, at 6:06 PM, Ricardo Carrano wrote:

> We may have to change some metrics in the future. I agree with that  
> principle. Today, a 3 hops 54Mbps path will be preferred over a one  
> hop 1Mbps, what makes sense in terms of spectrum resources (a 1000  
> bytes frame transmitted at 1Mbps will take 47 times the time  
> necessary to transmit the same frame at 54Mbps), but it is not  
> optimal in terms of buffer space and probability of error. It seems  
> to me that this is such a delicate balance that should be put aside  
> for some time, while we do some other simpler and easier tunings. I  
> suggest this engineering approach.

> I returned to the captures, to look specifically for path length  
> and used some tcp traffic generated during the tests (btw is this  
> test reproducible?) to I find that 91% of this traffic is being  
> delivered at a single hop and the average path length to be 1.09.  
> Not bad, is it?

See http://wiki.laptop.org/go/ 
Collaboration_Network_Testbed#Startup_Problems_.28.235963.29 for  
reproducibility.
Tests starting with the 0323 series looking at this problem use  
multiple antennas on the same channel, in an attempt
to get a better picture of what is going on.   (Time to stop using  
active antennas for packet capture ?)

> I also see a strong correlation between data rate and path length.  
> Only 3.3% of the direct traffic (one hop) was transmitted at 1Mbps  
> but this percentage jumps to 46% for frames with ttl 3. I believe  
> this is an indicator of sanity, but this deserves some more  
> thinking. What do you guys think?

I thought TTL 3 traffic is mostly retransmissions of RREQs.  Why only  
at the lowest rate ?  This might be a sampling artifact.
The active antenna/driver can't keep up with higher packet rates (at  
54 Mbps, short packets come fast).

> I assume some of the XOs are placed in another room (not in the  
> server room - since there is only 30 there) and this would account  
> for at least some of the long paths and lower rates.

Check the above link, there is even a diagram of the laptop location  
(variances from this location are noted in some of the experiments).   
Many of the ten laptop tests had all laptops on a single 1.5m x 0.7m  
table.

> I also would like to note that I believe that higher noise floor  
> would hurt higher rates, (not related to distance). [Wad, Do you  
> think differently?] Btw, what is the noise floor for this testbed?

I don't have a spectrum analyzer onsite yet, but we have reason to  
believe it is pretty low.    Neighbors
(such as the document storage warehouse across the street) are spaced  
well away.
We don't see any other nets on the laptops neighborhood view,  
although I have caught one beacon
packet from a neighbor at least 300m away, when the laptops are all off.

> So, I really don't see such a big pathology (some, maybe) in the  
> protocol decisions. Burstiness caused by path discovery traffic  
> seems much more scary to me.

Can this algorithm be tweaked ?   Cerebro to the rescue  ?

Evenin',
wad

> On Tue, Mar 25, 2008 at 9:52 PM, John Watlington <wad at laptop.org>  
> wrote:
>
> The firmware behavior that seemed broken to me was the fact that
> during the barrages the school
> server kept trying to set up 3 or 4 hop routes.  I contend that the
> routing algorithm is flawed in some
> basic way which leads to it favoring multiple hops when faced with
> congestion (which makes
> the congestion worse).   This is even predictable: when faced with a
> high noise floor, closer
> hops will perform better.
>
> Changing the expiry time will help, but it doesn't address the root
> cause.  In particular, the
> initial barrages are currently only around two expiry times long!
>
> Can we change the algorithm to be weighed more heavily in favor of a
> minimal hop count ?
> wad
>
> On Mar 25, 2008, at 8:41 PM, Ricardo Carrano wrote:
>
> > The test consists of 10 nodes pinging the multicast address
> > 224.0.0.1 in a quiet environment (< -96dbm noise floor).
> > We vary the expiration time for route entries.
> >
> > EXP TIME%RREQs%RREPsavg Path lengh% Retransmissions
> > 10.60.021.330.65
> > 50.550.021.310.61
> > 100.50.021.220.6
> > 150.430.011.20.6
> > 200.390.011.120.59
> > 300.290.011.120.6
> > 600.210.011.090.62
> > 3000.270.011.090.61
> >
> > EXP TIME = route expiration time in seconds
> > %RREQs, %RREPs = percentage of this kind of frames in the capture
> > Avg Path lengh = the average size of paths between source and
> > destination of unicast traffic
> > % of retransmissions = frames with retry bit on
> >
> > Comments:
> > * Retry rate is high (contention window was left with the default
> > values - so it could be half of that). But the important is that is
> > kept constant (around 60%) indicating that expiration of routes did
> > not hurt delivery rate (though the nodes were stationary).
> > * The nodes are within a small area (table size), so one would
> > expect avg path lengh to be closer to 1. But it is interesting to
> > note that increasing expiration time leads to more reasonable path
> > lengths (with +20 seconds, no path was registered with more than 3
> > hops but with -20 seconds 6% to 8% of the paths would have 3 or
> > more hops.)
> > * 300  seconds and 1 second are there just for setting limits.
> >
> >
> > On Tue, Mar 25, 2008 at 8:11 PM, Ricardo Carrano
> > <carrano at laptop.org> wrote:
> > I've taken a look at the capture for 0321D.
> >
> > What I see is a clearly congested scenario with a clear root cause:
> > path discovery involving the school server [1]
> > This accounts for 71% of the captured frames and analysis on a
> > millisecond resolutions reveals many saturated periods.
> >
> > So, my guess is that RREPs to the failing nodes are not coming
> > through due to congestion (btw, they do evententually, by the end
> > of the capture, what indicates a transitory condition).
> >
> > We have a scenario where there is concentration of traffic towards
> > a point (the school server) and no mobility. Increasing the route
> > expiration time from 10 to 20 seconds will bring a lot of
> > improvement. This is another example of the adaptiveness we need
> > in the presence of a school server.
> >
> > I've been studying the impact of this and found out that there is
> > an interesting side effect. When you increase route expiring time
> > you reduce average route length (though I fail to see why) (I will
> > send some data in another message).
> >
> > I would like to study the data more carefully before publishing,
> > but test D0321 results clearly demonstrate that  we need to be less
> > chatty in our path discovery if we want to have 40 nodes and a
> > school server.
> >
> > We have two ways of achieving so:
> > 1 - Reduce the RREQ barrage (from 4 to 3 frames, for instance)
> > 2 -  Increase route expire time
> >
> > Option 1 would involve changes in the firmware. Option 2 is an easy
> > test to perform (iwpriv).
> >
> > In short I suggest this tweak.
> >
> > [1] wlan_mgt.fixed.action_type == 1 and wlan contains
> > 00:50:43:28:26:2d
> >
> > --
> > Ricardo Carrano
> >
> >
> >
> > On Tue, Mar 25, 2008 at 12:43 PM, John Watlington <wad at laptop.org>
> > wrote:
> >
> > On Mar 25, 2008, at 11:59 AM, Ricardo Carrano wrote:
> >
> > > Wad,
> > >
> > > I will insist in #6589 because it seems key to the problem.
> > > For all I've seen, it cannot be caused by stressing. It is
> > > something that fails at boot time. So, it would not be a protocol
> > > issue, but a bug in driver/firmware.\
> >
> > The reason I don't believe it is #6589 (as it is currently  
> described)
> > is that it
> > is the school server that is not responding (well) to the path
> > request of failing
> > laptops.  But other laptops are succesfull in getting a response  
> from
> > the
> > server at the same time.   What I'm seeing in the traces is that the
> > server
> > responds occasionally to failing laptops, but selects a four hop  
> path
> > instead
> > of the direct path which it uses for working laptops.
> >
> > For example, see:  http://wiki.laptop.org/go/ 
> Collab_Network_Test_0321D
> >
> > > Other than that, I agree that we need some tweaking, and am  
> working
> > > on them now. Setting a bigger contention window, for instance, is
> > > definitely worth trying. (please see below).
> > >
> > > The problem with tunning is that bugs get in the way and mask
> > > everything. So we really really need to get rid of #6589.
> >
> > Agreed.
> >
> > > ======================================
> > > Here are some results for the tests with CW settings.
> > >
> > > Setup:
> > > * Build 695 + firmware 22.p6.
> > > * 10 XOs in a quiet environment (noise floor < -96dbm)
> > > * The 10 XOs pinging multicast address 224.0.0.1 (100 pings each,
> > > default size, default interval).
> > > * Test was repeated 3 times (alternated)
> > >
> > > Summary:
> > > * There is strong correlation between the retry bit and CW  
> settings
> > > * Frames with retry bit set dropped from 12.4% to 5.7% when the CW
> > > size was increased.
> > > * For this particular test delivery rates improved with big CW
> > > (from 86% to 92%)
> > >
> > >
> > > Retry bit
> > > =========
> > >
> > > Default CW (CWmin = 7, CWmax = 31)
> > > run 1: 7588/62109 12.2%
> > > run 2: 7312/58398 12.5%
> > > run 3: 7566/60949 12.4%
> > >
> > > Big CW (CWmin = 31, CWmax = 1023)
> > > run 3: 3734/64585 5.8%
> > > run 3: 4042/69035 5.9%
> > > run 3: 3909/70745 5.5%
> > >
> > > ping success rate:
> > > ==================
> > > Default CW (CWmin = 7, CWmax = 31)
> > > run 1: 86%
> > > run 2: 85%
> > > run 3: 88%
> > >
> > > Big CW (CWmin = 31, CWmax = 1023)
> > > run 1: 90%
> > > run 2: 95%
> > > run 3: 91%
> > >
> > > Retries for echo replies:
> > > =================
> > > Default CW: 4260 in 7214 (59%)
> > > Big CW: 2004 in 7235 (28%)
> > >
> > > Cheers!
> > > Ricardo Carrano
> >
> >
> >
> >
>
>