ACPI on XO (or rather, how to use cpuidle on XO)

Mon Nov 3 19:57:19 EST 2008

Last time I took a deep look at this stuff, my design for using cpuidle
on the XO was for the kernel cpuidle framework to awaken a user daemon
at appropriate times.  The daemon would decide whether to suspend.

The interface I was designing would have a cpuidle kernel file that
ohm could write a number of nanoseconds into.  Ohm would then block
(or select, or poll) on a read from a cpuidle file that would only
return data when no process was scheduled to awaken in fewer than that
many nanoseconds.  It would return the actual delay until the next
kernel or process timer wakeup.

Thus:

  echo 2000000000 >/sys/cpuidle/nslatency
  cat </sys/cpuidle/nsidle
[system pauses until no process wants to awaken in less than 2 seconds]
  4987654000

The example would indicate that some process (or kernel object) wants
to be scheduled in almost 5 seconds -- but none before then.  (Of
course, if new I/O occurred, such as arriving network packets,
keypresses, etc, then some processes *would* be scheduled in the
intervening time.)

The Ohm daemon could use this info, and other info, to decide whether
to sleep at all, and how to set up EC or RTC or other timers to resume
from sleep.  It would be useful for other subsystems too, which only
want to run using otherwise-idle CPU cycles.

I've attached three messages discussing the above design and how to
evolve XO power management, only one of which went to devel before.

	John

Message-Id: <200712120625.lBC6P2k2012216 at new.toad.com>
To: cjb at laptop.org, gnu at toad.com
Subject: Measuring and using CPU idleness in Ohm
Date: Tue, 11 Dec 2007 22:25:02 -0800
From: John Gilmore <gnu at toad.com>

Before you started in on ohm, I had cherry-picked the Linus cpuidle
patches into the OLPC kernel with minimal trouble, and had it
compiling.  But the hard part was trying to hook it into the existing
OLPC suspend infrastructure; I never got even close to making that work.

Now that Ohm is being the "management" that decides when to suspend, I
suggest it'd be easy to write a little cpuidle interface that would
let Ohm measure demand for the CPU.  This would let it make reasonable
decisions about suspending and awakening, in the presence of CPU load.
Cpuidle wouldn't need to suspend; it would just need to inform Ohm.

I suggest an interface like this:

  /sys/.../cpuidle/latency

Write a value to "latency" to tell cpuidle the shortest delays you care
to hear about, in nanoseconds.

  /sys/.../cpuidle/idle

Read this when you want to sleep until the system is idle for a minimum
of "latency" nanoseconds.  It will return you the actual delay until the
next scheduled timer event.

So, merely looping reading "idle" will let you report or log the periods of
idleness that occur.  This will tell you what kind of performance tuning
needs to happen to other parts of the system to reduce wakeups.

Note that if any process has a read pending on "idle", the CPU will
not actually go idle; it will schedule that process.  When that process
goes idle (assuming nothing else happens in the meantime), then the system
will truly be idle.

This interface doesn't generalize well to multiple processes using it.
For example, say you want to have a process that scrubs memory for
ECC, (or scrubs the flash for ECC), but only takes idle cycles.  You'd
like to be able to have a few such things that wouldn't interfere with
each other.

Each would need a separate latency setting.  They'd all be sleeping,
awaiting idleness; the kernel would run one of them.  If-and-when it
goes idle again, re-evaluate which one to wake next.  You obviously
don't want it to run the same one again and again.  Hmm...  If it goes
idle by re-entering a read() of "idle", then you go on to a different
one (you won't reawaken the same process that just said "I want to
wait til the system is idle").  If it goes idle any other way (e.g. by
writing to /sys/power/state, or by sleep(), a network read, or
exit()), it's no longer in the chain of processes hung awaiting
"idle", so naturally cpuidle would trigger a different one.  

Perhaps the API should be that each process merely opens a single file
descriptor ("idle"), writes a latency to it, and reads from it to
await idleness and view the available latency.

It's not clear to me what decision the kernel should make if, say, it
has three processes waiting, with latency settings of 100, 1000, and
5000 ms.  If the next timer event is 6000 ms away, do you trigger the
lowest, the highest, or one at random?  For ohm's purposes it won't
matter (it'll be just one process), but the kernel developers can haggle
over this as it goes upstream.

In order to make ohm's automatic suspends work with the timer queue, I
think there should also be an additional value that can be written to
/sys/power/state.  Besides "mem", there should be "timed".  This will
suspend the system only until the next wakeup in the timer queue.
(The kernel would have to ask the EC to wake it, since the RTC is
almost useless, and the normal timers are powered down; but I think
Richard has already implemented that.)  Having the next timer event be
queried directly in the suspend code would eliminate race conditions
and other forms of wasted time.  (Of course, if Ohm itself wants to be
awakened at any particular interval or absolute time, it just needs a
thread that sleeps for that period of time; then Ohm will be in the
timer queue along with everything else.)

These low-impact kernel changes should allow Ohm to decide when to
suspend the CPU when truly idle; and do so safely.

(We'd still eventually want a whole pile of improvements to the
900+-ms kernel suspend code, but that's already known and in trac.)

	John

Message-Id: <200803110953.m2B9rw6S031830 at new.toad.com>
To: cjb at laptop.org, dilinger at laptop.org, gnu at toad.com
Subject: Fixing suspend
Date: Tue, 11 Mar 2008 01:53:58 -0800
From: John Gilmore <gnu at toad.com>

Rather than making a kludge for the audio device, shouldn't we be
blocking suspend whenever DMA is in progress by any device driver
(with specified exceptions)?

The obvious initial exceptions are WiFi and the graphics refresh,
which we have custom code to handle.  USB does DMA, but if the only
device on it is the WiFi or is not doing DMA, then we can suspend.

(I thought that was the point of all the device-driver pre-suspend calls --
so the drivers could tell us "nope, we can't suspend now.")

Blocking during DMA should automatically fix the USB ethernet problem
(#1, that suspend powers down the ethernet, taking us off the net;
#2, that without custom code we don't come back nicely on resume.)

I also think that ohm should be asking the kernel when is the next
runnable process due.  The cpuidle infrastructure looked like it
would do a good job of that.  We should wake up to meet that deadline,
using the EC.

This would solve the Distance problem, and it would also get us
started at cleaning up all the short timeouts and polling in the rest
of the daemons and libraries and activities.

We'll also need to keep the system clock from drifting during suspend.
I think this can be done by borrowing one of the MFGPT's that runs
during suspend, to let us know exactly how long we were suspended.  It
can't wake us up at the right time, but it can at least count off our
sleepy time.

Putting in the real infrastructure will not only keep us from having
to rewrite it later, it'll also give us correctness now instead of
an endless parade of exceptions to exceptions based on particular
applications.

	John

Message-Id: <200808112102.m7BL2Sle027150 at new.toad.com>
To: dsaxena at laptop.org
cc: John Gilmore <gnu at toad.com>, "C. Scott Ananian" <cscott at laptop.org>,
   "OLPC Developer's List" <devel at lists.laptop.org>,
   Chris Ball <cjb at laptop.org>
Subject: Re: inhibiting suspend via dbus 
In-reply-to: <20080810193113.GA16296 at plexity.net> 
References: <c6d9bea0808091100g7bd2b049lfd9c15c58e1b1fd4 at mail.gmail.com> <200808092234.m79MY1le001151 at new.toad.com> <20080810193113.GA16296 at plexity.net>
Comments: In-reply-to Deepak Saxena <dsaxena at laptop.org>
   message dated "Sun, 10 Aug 2008 12:31:13 -0700."
Date: Mon, 11 Aug 2008 14:02:28 -0700
From: John Gilmore <gnu at toad.com>

>     The problem with the pure hereustic approach is that there may
> be times when we don't have enough knowledge about the system state
> and more importantly the user's behaviour to really make a decision 
> without information from the application. For example, if I am streaming
> music on my computer, and I walk away for sometime, it may be perfectly
> OK to go to do a full suspend, or it maybe the case that I don't want to 
> go to sleep but it's OK for me to shutdown the screen.  Simply looking at 
> what is the current process doing (using network and audio resources) is 
> not enough to make a smart judgement on what to do in this situation. 
> There must be a way for the user to tell the application what to do
> and a way for the application to pass that information into the system 
> power manager.

There are already a couple of knobs for users to tweak their
power-control desires with (in the Sugar control panel).  We could add
a few more (though this makes more combinations to test, and to
explain, and that will break some things for some users).  User
controls are orthogonal to whether "Activities" request a suspend or
no suspending, which was the suggestion I was dissing.

> > We should understand why ohm isn't noticing that the activity updater
> > isn't idle.  Should Ohm be looking for a higher cpu idle% in the
> > seconds before it suspends?  Should it be looking for minimal numbers
> > of context switches per second before it suspends? ...
> 
> What happens when the updater is modified?  Do we have to reanalyze 
> the behaviour pattern everyime we have chage the updater and then rewrite 
> the heurestic code in OHM? What if I'm doing something else at the
> same time as running the updater which completely modifies the behaviour?
> And what about when we change to a new kernel and the scheduler behaviour
> changes (see #7603)?

For an idea of where Linux power management is going, see Richard
Woodruff's presentation from the Ottawa Linux Power Management summit:

  http://www.celinux.org/elc08_presentations/TI_OMAP3430_Linux_PM_reference.ppt
  http://article.gmane.org/gmane.linux.acpi.devel/33176

What OLPC is doing was cutting-edge -- before Richard got started.
And he's not putting kludges into applications -- he's hacking how the
kernel allocates resources (which is its job and its only job).

His code is in mainline kernels, runs on the Texas Instruments OMAP3
system-on-chip, and aims for cellphone-like battery life with
Linux-laptop-like features and performance.  (I have no inside
knowledge, but would be surprised if the OMAP3 wasn't a possible
candidate processor for the XO-2.)  My favorite slide is #3, the graph
of volts and amps -- which sits at zero volts, zero amps most of the
time, and rises to 0.9 volts at 50 mA occasionally, spending 94% of
its time at zero.  This is running Linux on a 550 MHz ARM.

In Richard's model, drivers clock down and/or power down their devices
whenever they aren't in use, and can rapidly power them up and restore
their registers as needed, invisibly to user software.  This includes
the "CPU driver", i.e. the scheduler, dispatcher, and cpuidle
implementation.  The CPU goes through 7 increasingly power-miserly
states, resuming from the deepest (CPU power-off) in about 30ms.

The reason the XO-1 has had so much trouble with suspend is because
it's a first-generation effort, using a processor that was designed
for Winblows' multi-second manual suspends.  Our board design also
took a lot of evolution and still has a few misfeatures.  Things will
be very different in the hardware world shortly.  OLPC gets
significant credit for pushing the chip industry to notice how badly
its power management sucked.  Let's not mess up our code base (or our
mindsets) while they're going through the design cycles to get it
right.

(The XO-1 hardware can do much better than it's doing.  I see in
Joyride-2263 that USB drivers are in modules now, so Ohm can unload
'em when it wants to suspend and resume in only HALF a second instead
of a whole second.  But when I enable "Extreme Power Management", ohm
is not actually powering off the USB bus and unloading the USB modules
yet!  My USB keyboard works if plugged in; and the modules are still
there.  Chris, does this need a new TRAC besides #7113 and #7434?)

	John