Salut and Suspend/Resume issues
gnu at toad.com
Wed Feb 20 05:42:31 EST 2008
OK, children of the world, please calm down. There are a few too many
bugs and egos flaring up to come to a reasonable resolution. This is
an interdisciplinary problem that crosses too many architectural
boundaries for any of us to be comfortable seeing the whole picture.
I filed a bug report about the network failing to wake us on multicast
four months ago (#4616). A key response by dmwm2 a month ago provides
a path forward: http://dev.laptop.org/ticket/4616#comment:20 .
Let me cut to the chase.
Many things are likely to work, if update.1 turns on "wake on
multicast" using the command "ethtool -s eth0 wol um", AND THE MESH IS
NOT IN USE:
* The laptop will suspend much of the time.
* If someone sends it a multicast, that it is listening for, it will
wake up and respond to the traffic (possibly dropping one packet).
* Random multicast traffic that the laptop isn't listening for will
NOT wake it up.
I hope that the people responsible for Presence and Sharing can
test this, and make sure their protocols work with this "wol" setting.
I don't know that stuff at all. I'm not even sure what protocols
are running in my laptops. They have no school server.
There are three bugs in update.1-691 around this:
* The packet that awakens us doesn't get responded to; it was probably
dropped, rather than passed to the kernel. Assuming the protocols
retry within 60 seconds, we'll see and respond to the second one.
* When the laptop is manually suspended (physically closed), it
should not awaken for any reason except being reopened. Instead,
it awakens for each received multicast packet that it is
listening for, and then goes immediately back to sleep. This is
a power consumption bug. I'd say ship the release and live with
* Receiving these multicasts while closed did also trigger
the laptop to refuse to stay resumed when I reopened it. I had
to hit the power button to get it to stay on. If cjb can
reproduce this reliably, he can fix it. It happened twice
for me. Merely closing and opening didn't fail, but closing,
sending a wakeup ping, then opening, did fail.
All of the above works WHEN USING AN ACCESS POINT.
There are several bugs in the mesh that prevent this from working
over the mesh. I recommend moving existing school deployments to
access points, until we get the bugs out of the mesh.
There appear to be more than one bug in the mesh around multicast. No
wonder people are confused. Using the same setup as in #4616, but
*without* suspending, in update.1-691 I can't get multicast packets
through reliably. Setup:
* Two XO's, MP G1G1s. One is using build 656, the other update.1-691.
* In NetworkManager screen, put both on "Mesh Network 1". Wait a
few minutes for things to settle down. Go to donut screen, make
sure both of them say "Mesh Network 1, Connected to a Simple
* Start a terminal on each laptop. Become root.
* "ping6 -I msh0 ff02::1" on each laptop.
* This will ping the all-nodes multicast address. The laptop that
sends this should get back a unicast IPv6 ping response from each
node on the network. Keep moving the mouse on the update.1-691
laptop to avoid suspending.
* On each laptop, it can see itself (btw, ping6 prints its own
address on its first line of output). It prints a very low
latency response (e.g. 0.154 ms) packet from its own kernel. It
seldom or never sees a ping response from the other laptop.
* Bizarrely, every once in a while, the Build 656 laptop will see
ping responses from the update.1-691 laptop. For about 10 seconds.
Then they will go away again. They say "(DUP!)" because it's the
second response packet from a single outgoing ping packet. Perhaps
these happen after it suspends and I resume it with mouse motion.
If I stop the pings, go back into NetworkManager, and associate both
XO's with a local access point (TrendNET TEW432-BRP), and replace
"msh0" with "eth0", the test works. The access point is doing NAT, so
the only nodes on the network are wireless. Oddly, for some reason,
each machine sees TWO packets come back from the other machine (sample
times: 5.51 ms and 6.37ms). This is not a violation of the IP protocols --
datagrams are free to get replicated -- but it looks like a bug in
either the Libertas or our kernel. So I've found two bugs so far,
and it's only by running simple commands and knowing what to expect.
Now back to what I really wanted to test: whether the driver support
for wake-on-multicast works, and whether it only wakes up when the
multicast packets match the filter. See the month-old comment in #4616,
http://dev.laptop.org/ticket/4616#comment:20 . So, using the access
point setup as above, I run:
* "ethtool -s eth0 wol um" on the update.1-691 laptop.
* I sit and wait for it to suspend. Detected by power LED off and
* Now on the Build 656 machine, I run "ping6 -I eth0 ff02::2". Note
the final "2", not a "1". This pings the "all-routers" address
on the link local network. I'm expecting no answering packets,
because there are no IPv6 routers on the local wireless LAN.
Indeed, not only do I get no answers, but the update.1-691 laptop
remains blissfully suspended.
* I interrupt that and run "ping6 -I eth0 ff02::1", pinging the
"all-nodes" link local address. This immediately wakes the
update.1-691 laptop out of suspend, and I get pings back from
both laptops. The first packet is dropped by the suspended
machine, but I get three response packets, from the second ping onward:
one from the local machine, and two from the formerly suspended
* OK, perhaps the Libertas is braindead enough to know the all-nodes
address hard-wired, but this wouldn't work for a configured
multicast address. So I did "ip maddr" to see which addresses
the kernel has instructed each interface to listen on. I waited
for the update.1-691 laptop to suspend. I pinged
"ff02::1:ff10:a958", which is the address that the suspended
laptop was listening on, PLUS ONE! The laptop did not wake up. I
then pinged the right address, "ff02::1:ff10:a957". It awakened
immediately, and responded to the second ping (not the first, as
above). Working! with a minor bug.
It sounds like there are several bugs in the mesh, mentioned above,
that will prevent this from working over the mesh. I'll file bug
reports for them.
> The partitioning we made between the network processor and the
> main processor was pretty clean. Unfortunately, it doesn't support
> low power operation. I suggest rethinking the partition for Gen2.
I'm happy to rethink this for Gen2. But I'd like to see more detailed
support for "the partitioning doesn't support low power operation".
Are you talking about the Libertas chip taking too much power by
itself, or about the host CPU having to wake too often? I think it's
still too early to tell whether the partitioning is correct. We've
barely shipped working code, we don't have a working software release
process, and have had no time to optimize for either power or clean
More information about the Devel