coping with lost connectivity (was: mesh portal discovery)

Simon McVittie simon.mcvittie at collabora.co.uk
Sun Jan 13 10:58:33 EST 2008


On Sat, 12 Jan 2008 at 12:00:02 -0500, Benjamin M. Schwartz wrote:
> This is precisely what I am saying.  Telepathy should only register a disconnect
> if there is no way to route between two XOs.  The mesh system should be designed
> so that moving about within the mesh, or handing off between Salut and Gabble,
> or switching from one internet-connected wireless network to another, does not
> cause a Telepathy disconnect.

For the record: Telepathy, which is a standard API used on OLPC and elsewhere,
does not and will not do this. A Telepathy Connection object represents a
connection (e.g. a Gabble connection represents a TCP connection to the
server) and we're not going to mangle the API to behave otherwise.

You're right that a possible solution for activities' networking would
be for some lower layer to "paper over the cracks" - Telepathy is the
wrong layer to be doing this, but the Presence Service could do it, or
so could a library in the sugar. hierarchy.

However, each activity is fundamentally going to need a way to sync its state
on initial connection; for at least the medium term, as Sjoerd said, we propose
that the same mechanism be used to resync after connectivity loss.

If we have enough developer time to be able to work on library code for
activities' networking, it's likely that the API will involve an
activity-supplied "resync" callback that's called when initially joining
an activity, and when connectivity is regained after a connection loss.

> In each of these cases, the path between XOs
> remains routable, with a gap of at most a few seconds.

I'm not convinced this is even technically feasible, given the
constraints of the quality of the underlying network (if the packets
aren't arriving, there's nothing we can do about it). It's certainly a
much lower priority right now than making the servers scale properly.

Many of Sjoerd's bugfixes to Salut have been to do with detecting and
signalling loss of connectivity - not in the sense of "I lost my IP
address", but in the sense of "I'm sending packets to Fred and he's not
acknowledging any of them, so either he's not receiving them or I'm not
receiving the acknowledgements". No amount of programming will fix
packets just not turning up, so the only improvements we can bring here
are by tweaking the trade-off of bandwidth use vs timely error recovery.

This suggests that however much we can improve the API, activities will still
have to be able to deal with situations where other users "fall off the
network", in a more or less graceful way - and if you can do that, then you
can use the same mechanisms to resync after connectivity changes, in the way
that Sjoerd suggests.

We can probably never make it entirely transparent, because however
quick it becomes to reconnect after connectivity loss, you can't
guarantee that you haven't lost messages; and in a message-passing
system, as soon as you can't make that guarantee, the only way back to a
consistent state is to ask someone else what's going on, which is exactly
the "resync" I'm talking about.

    Simon



More information about the Devel mailing list