TCP is broken in mesh mode

Benjamin M. Schwartz bmschwar at fas.harvard.edu
Tue Jun 10 15:03:06 EDT 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dear Networking experts,

I have been fighting for several months with the fact that invitations
often seem not to work, when running on a serverless mesh.  The symptoms
are quite strange.  If an invitation works once between two laptops, it
continues to work between them reliably.  If it fails once, it continues
to fail between them consistently. Sometimes, in the same place,
invitations will work on one mesh channel and not on another.  The same
two XOs may be reliably successful in a particular high-noise environment,
and consistently fail in an area of virtual radio silence, as well as the
reverse.

Even when invitations fail, other presence information continues to flow
correctly.  Even activity sharing continues to work beautifully.

With some help from Daf, we managed to get a tcpdump trace from two XOs
exhibiting this behavior at 1CC.  The dumps are attached to ticket #6463.
~  What we saw is bizarre, but also consistent with the behavior in the UI.
~ The invitations are unicast, implemented using TCP.  When machine A sends
an invitation to B, we see the following exchange:

1. A broadcasts an ARP request for B
2. B sees the ARP request and replies to A
3. A receives the ARP reply from B and sends a TCP SYN to B
4. B does not see the SYN packet (it does not appear in B's dump)
5. A retries a total of three times, but none of the SYN packets are seen
by B.
3b. In parallel, A broadcasts a presence-info update with mDNS, indicating
that it has shared the activity.
4b. B receives this broadcast, updates its presence-info cache, and even
assigns B's XO icon a new location in the mesh view

This behavior is fairly frightening.  I have seen it occur in low-noise
network environments with a total of 3 XOs, so I suspect a serious bug
somewhere in the lowest levels of the network stack.  Once this failure
occurs, it is extremely reproducible.  All subsequent invitations will
continue to fail.  I therefore suspect that the bug involves the driver or
firmware reaching an invalid state and becoming stuck there.

Given the variety of critical services that run over TCP, including the
much-emphasized Read activity, I hope that people familiar with the driver
and firmware will take a look at this bug.

- --Ben Schwartz

P.S. All this info is present at ticket #6463.  I am writing about it here
in an attempt to increase awareness.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkhOz+oACgkQUJT6e6HFtqSVBQCeKPWmqeoKOzVv55JS/HTAgf1r
bUYAoKCG+z1bBA+isc7Mun0VlQNGDars
=4w83
-----END PGP SIGNATURE-----



More information about the Devel mailing list