coping with lost connectivity

Sun Jan 13 13:07:09 EST 2008

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Simon McVittie wrote:
> On Sat, 12 Jan 2008 at 12:00:02 -0500, Benjamin M. Schwartz wrote:
>> This is precisely what I am saying.  Telepathy should only register a disconnect
>> if there is no way to route between two XOs.  The mesh system should be designed
>> so that moving about within the mesh, or handing off between Salut and Gabble,
>> or switching from one internet-connected wireless network to another, does not
>> cause a Telepathy disconnect.
> 
> For the record: Telepathy, which is a standard API used on OLPC and elsewhere,
> does not and will not do this. A Telepathy Connection object represents a
> connection (e.g. a Gabble connection represents a TCP connection to the
> server) and we're not going to mangle the API to behave otherwise.
> 
> You're right that a possible solution for activities' networking would
> be for some lower layer to "paper over the cracks" - Telepathy is the
> wrong layer to be doing this, but the Presence Service could do it, or
> so could a library in the sugar. hierarchy.

OK, I was in the wrong layer.

> 
> However, each activity is fundamentally going to need a way to sync its state
> on initial connection; for at least the medium term, as Sjoerd said, we propose
> that the same mechanism be used to resync after connectivity loss.

I have no argument with this.  If the mesh splits unexpectedly, there is nothing
that can be done.

> 
> If we have enough developer time to be able to work on library code for
> activities' networking, it's likely that the API will involve an
> activity-supplied "resync" callback that's called when initially joining
> an activity, and when connectivity is regained after a connection loss.
> 
>> In each of these cases, the path between XOs
>> remains routable, with a gap of at most a few seconds.
> 
> I'm not convinced this is even technically feasible, given the
> constraints of the quality of the underlying network (if the packets
> aren't arriving, there's nothing we can do about it). It's certainly a
> much lower priority right now than making the servers scale properly.

I agree that this is not a high priority.

> 
> Many of Sjoerd's bugfixes to Salut have been to do with detecting and
> signalling loss of connectivity - not in the sense of "I lost my IP
> address", but in the sense of "I'm sending packets to Fred and he's not
> acknowledging any of them, so either he's not receiving them or I'm not
> receiving the acknowledgements". No amount of programming will fix
> packets just not turning up, so the only improvements we can bring here
> are by tweaking the trade-off of bandwidth use vs timely error recovery.

You misapprehend my use case.  Consider the following scenario:
A user is connected to an access point, or even a multi-AP wireless network.
The user is participating in a shared activity over Gabble.  The user begins
walking, while continuing to participate in the activity.  At some point,
NetworkManager notices that the link quality is becoming dangerously low, but
there is another open network with much higher signal strength.  NetworkManager
could inform Telepathy of this, and Telepathy could tell the other participants:
"Hey, I'm going to be offline for a bit, but no more than 20 seconds, and once
I'm back I might have a different IP address."  All messages bound for this user
would then be queued for 20 seconds.  Once NetworkManager reconnects, Telepathy
can broadcast "I'm back, and my new IP address is: ....".  The other users can
then send all the messages in their queues.

A similar logic applies to the case of IP address changes due to MPP switching,
but in that case there may not even be a gap.  The XO could easily be routable
by both addresses simultaneously.  In that case, the XO would simply announce a
change in IP address, wait for ACKs, and then relinquish the previous IP address.

Even without foreknowledge of an impending disconnection and reconnection, it
would be possible to extend the current reliable transport system to spread over
multiple IPs.  Conceptually, one can do TCP where the target's IP address is
replaced by their identity, and the actual IP address is shifting occasionally.

> 
> This suggests that however much we can improve the API, activities will still
> have to be able to deal with situations where other users "fall off the
> network", in a more or less graceful way - and if you can do that, then you
> can use the same mechanisms to resync after connectivity changes, in the way
> that Sjoerd suggests.

There is no doubt that Activities must accept unexpected connections and
disconnections as gracefully as possible.  I am attempting to design some simple
Python tools that will make this easier for Activity developers.

> 
> We can probably never make it entirely transparent, because however
> quick it becomes to reconnect after connectivity loss, you can't
> guarantee that you haven't lost messages; and in a message-passing
> system, as soon as you can't make that guarantee, the only way back to a
> consistent state is to ask someone else what's going on, which is exactly
> the "resync" I'm talking about.

TCP handles dropped packets just fine.  It is definitely possible to ensure that
you haven't lost messages in a small gap, when using a reliable protocol.

> 
>     Simon
> _______________________________________________
> Devel mailing list
> Devel at lists.laptop.org
> http://lists.laptop.org/listinfo/devel

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHilNNUJT6e6HFtqQRAjkVAKCRrHgcX5bP5U7iAqbuVQCHScPyewCbBOjN
e2uPo1MLx0AupvFkVDnkXNo=
=yLRY
-----END PGP SIGNATURE-----