[OLPC Networking] TCP is broken in mesh mode

Polychronis Ypodimatopoulos ypod at mit.edu
Tue Jun 10 15:20:02 EDT 2008


nice report.

Benjamin M. Schwartz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Dear Networking experts,
>
> I have been fighting for several months with the fact that invitations
> often seem not to work, when running on a serverless mesh.  The symptoms
> are quite strange.  If an invitation works once between two laptops, it
> continues to work between them reliably.  If it fails once, it continues
> to fail between them consistently. Sometimes, in the same place,
> invitations will work on one mesh channel and not on another.  The same
> two XOs may be reliably successful in a particular high-noise environment,
> and consistently fail in an area of virtual radio silence, as well as the
> reverse.
>
> Even when invitations fail, other presence information continues to flow
> correctly.  Even activity sharing continues to work beautifully.
>
> With some help from Daf, we managed to get a tcpdump trace from two XOs
> exhibiting this behavior at 1CC.  The dumps are attached to ticket #6463.
> ~  What we saw is bizarre, but also consistent with the behavior in the UI.
> ~ The invitations are unicast, implemented using TCP.  When machine A sends
> an invitation to B, we see the following exchange:
>
> 1. A broadcasts an ARP request for B
> 2. B sees the ARP request and replies to A
> 3. A receives the ARP reply from B and sends a TCP SYN to B
> 4. B does not see the SYN packet (it does not appear in B's dump)
> 5. A retries a total of three times, but none of the SYN packets are seen
> by B.
> 3b. In parallel, A broadcasts a presence-info update with mDNS, indicating
> that it has shared the activity.
> 4b. B receives this broadcast, updates its presence-info cache, and even
> assigns B's XO icon a new location in the mesh view
>
> This behavior is fairly frightening.  I have seen it occur in low-noise
> network environments with a total of 3 XOs, so I suspect a serious bug
> somewhere in the lowest levels of the network stack.  Once this failure
> occurs, it is extremely reproducible.  All subsequent invitations will
> continue to fail.  I therefore suspect that the bug involves the driver or
> firmware reaching an invalid state and becoming stuck there.
>   


You have to keep in mind that the driver/firmware may very well have 
bugs, but:

1) the driver does not differentiate between different TCP/IP packets 
(but may wrongly differentiate between unicast and broadcast/multicast). 
Try establishing a separate TCP/IP connection when invitations 
reproducibly don't work.

2) the firmware (in terms of a route existing or not) does not 
differentiate between frames. Try pinging the other node when 
invitations reproducibly don't work.

> Given the variety of critical services that run over TCP, including the
> much-emphasized Read activity, I hope that people familiar with the driver
> and firmware will take a look at this bug.
>
> - --Ben Schwartz
>
> P.S. All this info is present at ticket #6463.  I am writing about it here
> in an attempt to increase awareness.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkhOz+oACgkQUJT6e6HFtqSVBQCeKPWmqeoKOzVv55JS/HTAgf1r
> bUYAoKCG+z1bBA+isc7Mun0VlQNGDars
> =4w83
> -----END PGP SIGNATURE-----
> _______________________________________________
> Networking mailing list
> Networking at lists.laptop.org
> http://lists.laptop.org/listinfo/networking
>   

-- 
Polychronis Ypodimatopoulos
Graduate student
Viral Communications
MIT Media Lab
Tel: +1 (617) 459-6058
http://www.mit.edu/~ypod/



More information about the Networking mailing list