Salut/avahi/meshview issues

Thu Jan 31 07:32:36 EST 2008

> > 2. It takes up to 10min for avahi even to detect the inactivity of a
> peer.
> > i.e. If an XOs switches channels, for up to 10min avahi wont even
> know(it
> > used to be 1-2min).
>
> Is this with or without the patch from bug #6162 ? If without, then the
> time it
> takes avahi to discover it should still be 2 mintues. I'd like to how you
> test
> this. Oh and please file a bug, so we can actually track these issues.
>

The patch 6162, as well as the patch of 5501 are in included in the 689/690
that I am testing. So this indeed explains the 10minutes(Actually i just
found out of this bug).

> > 3. It will take a total of about 30min for the XO to vanish from the
> mesh
> > view(this is tooo long!)
>
> Again, file a bug. Needed info here is if there is a time difference
> between
> when avahi marks something as removed, when salut sends out the removed
> signal
> and when it actually disappears from the mesh view.
>

This is now filed as 6282, with all dbusmonitor/avahibrowse logs to
compare.
This case is also an example of a avahi/mesh view inconsistency.
Icons disappear form the mesh view/ but remain for about 1h longer in the
avahi cache
But these details should continue in trac anyway.

>
> > 4. Avahi/mesh view respond independently.
> > The situation used to be that when an entry dissappeared in avahi, it
> > disappeared in mesh view, and the same when new peers arrive.
> > This relation was very consistent.
> > However, now we have the following cases:
> > a) an XO will vanish from the mesh view, but remain "indefinitely" in
> the
> > avahi cache as "failed to resolve"
> > b) sometimes avahi shows alot less peers than the mesh view. The extra
> peers
> > in the mesh view are definitely active since they properly respond to
> > activity joining/sharing.
> > c)sometimes avahi included more active peers than the mesh view.
> > does anyone know why this is happening?
> > Is it a bug?
> > I have logs, if needed, that compare avahi-browse with timestamped
> > dbus-monitor logs, that indicate the inconsistencies.
>
> Well you all list them as undesired behaviour, so i would say they're
> bugs.
>

> > 5. An important improvement is that peers will not generally fail alot
> on
> > their own.
> > So, if many XOs join a mesh channel, and noone goes away, the will not
> start
> > failing. This used to be a common effect after 4-5 XOs. However, i
> noticed
> > once in 1cc, 61 active XOs in the mesh view!
>
> When you say salut, you actually mean avahi. It would help if you could be
> clear on what you mean :) This improvement is probably caused by the fix
> in
> #5501.
>

I mean avahi indeed. In the past these two were very tight to each other.
And i believe that the only direct way to examine salut is by checking the
buddy list in the Analyze activity.
I remember Ricardo had an interesting case were the buddy list included
plenty of XOs, which were also properly sharing in the mesh view, but the
avahi list was empty. Does this seem possible? (unfortunately no log at the
moment)

>
> Anyway for all the bugs you should have filed instead of sending this
> mail, i
> will need tcpdump logs, avahi logs, salut logs and if possible meshview
> logs
> indicating when contacts are removed from the mesh from a machine where
> you say
> the behaviour. Preferably with timestamps

I updated the trac with logs/tcpdumps/dbusmon/screenshots...enjoy!

The reason i send first this email before filing tons of bugs is because i
though it was necessary to describe the big picture, and the current status
of salut. And also to avoid duplicate bugs, or bugs that are in fact
intentional mods.

This conversation was unfortunately directed towards other issues(wireless
difficulties is a sensitive subject at olpc!), but in fact its purpose was
to determine some very specific bugs in salut, that have nothing to do "at
the point" with scalability or robustness of the protocol.  When these are
resolved, we can proceed with scalability, for which i am very confident.

I believe our current salut/avahi issues are described in the following
points:

1. I was under the impression that when a peer switches channels it sends a
"goodbye signal". And in fact only anorthodoxically removed peers(after
crashes/poweroffs by pressing the button etc) would delay to disappear from
mesh views.  The 10min TTL is not unreasonable, but it should only be used
for a routine check. In fact peers that leave/arrive should inform the mesh
instantly. In that case the 10min TLL will only affect only the mesh points
with noisy links that their "goodbye" signals will get lost. And these
connections are less priority anyway. Also we could send 2/3 "goodbye"
signals to "ensure" delivery.

2. We should definitely decrease the timeout window between a lost peer
being detected, and the actual disappearance from the mesh view. This used
to be 10min, now it is 20min, but really, to my experience, if a peer is for
more than 1-2min away he aint coming back.

3. Should we make the above TTL and timeout to be user specific, or custom
anyway?. Will there be a problem if two XOs have different TTL? I would
assume that it wont. The idea is that it is a waste of our resources to try
to calculate the ideal values of TTL and timeout by asking the collabora
team to fix, and fix again. Whereas we can make the test here in 1cc, and
find ourselves which suits as best. Is it easy to implement such a patch?

4. The 5501 bug(xmas tree effect). This is a very specific bug in the
protocol, and i believe it will be sorted soon.

5. Why are avahi/salut/mesh view not communicating well? I hope we will have
some answers on that as well.

yianni
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.laptop.org/pipermail/devel/attachments/20080131/bde3e7f8/attachment.html>