[OLPC Networking] TCP is broken in mesh mode
Polychronis Ypodimatopoulos
ypod at mit.edu
Tue Jun 10 15:20:02 EDT 2008
nice report.
Benjamin M. Schwartz wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Dear Networking experts,
>
> I have been fighting for several months with the fact that invitations
> often seem not to work, when running on a serverless mesh. The symptoms
> are quite strange. If an invitation works once between two laptops, it
> continues to work between them reliably. If it fails once, it continues
> to fail between them consistently. Sometimes, in the same place,
> invitations will work on one mesh channel and not on another. The same
> two XOs may be reliably successful in a particular high-noise environment,
> and consistently fail in an area of virtual radio silence, as well as the
> reverse.
>
> Even when invitations fail, other presence information continues to flow
> correctly. Even activity sharing continues to work beautifully.
>
> With some help from Daf, we managed to get a tcpdump trace from two XOs
> exhibiting this behavior at 1CC. The dumps are attached to ticket #6463.
> ~ What we saw is bizarre, but also consistent with the behavior in the UI.
> ~ The invitations are unicast, implemented using TCP. When machine A sends
> an invitation to B, we see the following exchange:
>
> 1. A broadcasts an ARP request for B
> 2. B sees the ARP request and replies to A
> 3. A receives the ARP reply from B and sends a TCP SYN to B
> 4. B does not see the SYN packet (it does not appear in B's dump)
> 5. A retries a total of three times, but none of the SYN packets are seen
> by B.
> 3b. In parallel, A broadcasts a presence-info update with mDNS, indicating
> that it has shared the activity.
> 4b. B receives this broadcast, updates its presence-info cache, and even
> assigns B's XO icon a new location in the mesh view
>
> This behavior is fairly frightening. I have seen it occur in low-noise
> network environments with a total of 3 XOs, so I suspect a serious bug
> somewhere in the lowest levels of the network stack. Once this failure
> occurs, it is extremely reproducible. All subsequent invitations will
> continue to fail. I therefore suspect that the bug involves the driver or
> firmware reaching an invalid state and becoming stuck there.
>
You have to keep in mind that the driver/firmware may very well have
bugs, but:
1) the driver does not differentiate between different TCP/IP packets
(but may wrongly differentiate between unicast and broadcast/multicast).
Try establishing a separate TCP/IP connection when invitations
reproducibly don't work.
2) the firmware (in terms of a route existing or not) does not
differentiate between frames. Try pinging the other node when
invitations reproducibly don't work.
> Given the variety of critical services that run over TCP, including the
> much-emphasized Read activity, I hope that people familiar with the driver
> and firmware will take a look at this bug.
>
> - --Ben Schwartz
>
> P.S. All this info is present at ticket #6463. I am writing about it here
> in an attempt to increase awareness.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkhOz+oACgkQUJT6e6HFtqSVBQCeKPWmqeoKOzVv55JS/HTAgf1r
> bUYAoKCG+z1bBA+isc7Mun0VlQNGDars
> =4w83
> -----END PGP SIGNATURE-----
> _______________________________________________
> Networking mailing list
> Networking at lists.laptop.org
> http://lists.laptop.org/listinfo/networking
>
--
Polychronis Ypodimatopoulos
Graduate student
Viral Communications
MIT Media Lab
Tel: +1 (617) 459-6058
http://www.mit.edu/~ypod/
More information about the Networking
mailing list