OOM manager project

Thu Jul 19 23:41:07 EDT 2007

Hello Jim,

We (PlanetLab) have found that OOM does some relative bad things causing 
a system to get into an unusable state.  We replaced OOM with something 
that just panics and reboots rather than letting the system get into an 
unrecoverable state, which we need as many of our PL servers are in 
remote locations and unattended (kinda like your mini-servers will be, 
but unlike your laptops).  And we then introduced a user-level OOM 
governor, which is probably something far more rudimentary than what you 
are after.  Our governor, called pl_mom because "she cleans up your 
mess", assumes that separate applications/services are instantiated in 
separate vservers (slices).  From what I gather, this is definitely the 
direction that OLPC is going for the laptop and mini-server-gateways, so 
our approach might be at least from a thought perspective applicable.

What does pl_mom do?  At the moment she kills the vserver with the 
largest aggregate VSZ (i.e., all processes within that vserver).  This 
works for PlanetLab, but might not the best approach for OLPC.  We have 
found that most OOM scenarios occur by a slow leaker that has its pages 
swapped out by kswap (which happens on the order of a few hours and are 
hard to detect with the current vm metrics we peek at).  Since pl_mom 
does the trick for our usage scenario on PlanetLab for now we have not 
had an incentive to improve it further.  However, one should definitely 
look at better vm statistics to make a better choice than largest 
aggregate VSZ.

The code for pl_mom is available via anon cvs from:

cvs -d :pserver:anonymous at cvs.planet-lab.org:/cvs co pl_mom

Take a peek at swap_mom.py and its helper functions in pl_mom.py.

I'm cc'ing Faiyaz Ahmed, who is the person at Princeton who is currently 
maintaining pl_mom.

Best regards,
Marc

Jim Gettys wrote:
> OLPC needs a OOM governor, so that the "right" process gets shot when we
> run low on RAM, and that processes that might get shot know enough to
> save state for restart.  As you know, various problems appear if the
> wrong process is killed, usually resulting in needing a restart.
> 
> Note that the kernel has to be able to recover memory when it needs it,
> or it will deadlock: this is a situation where the kernel must be in
> control, but user space could cooperate much better than it does today,
> by providing appropriate hints.  So don't say: "the kernel shouldn't
> kill processes: user space should"; that design doesn't fly.
> 
> Here's Kimmo Hämäläinen description of the (current) kernel OOM killer.
> 
> The OOM killer selects a process to kill by assigning a score to each
> process; the process with the highest score is the lucky winner that
> will be killed. The current OOM score for
> a process is visible in proc. The entry is in /proc/PID/oom score. The
> starting point of the score is the amount of memory consumed by the
> process and its children. This value is adjusted as follows:
> • It is set to zero if the process has no memory management or if the
> process has a negative
> nice value (this can be used for protecting processes from the killer).
> • Divided by the square root of the CPU time consumed by the process.
> • Divided by the square root of the square root of the run time of the
> process.
> • Multiplied by 2 if it is a process with a positive nice value.
> • Divided by 4 if it is a superuser process.
> • Divided by 4 if it is a process with direct hardware access.
> • Finally, the value is adjusted (shifted either left or right) by the
> oom adj value. It is shifted left in case the value is positive and
> right in case the value is negative.
> This means that a negative oom adj value will decrease the score and
> also decrease the risk that the particular process will be killed. A
> positive value will have the opposite effect. The value should be no
> smaller than -16 and no larger than 15.
> 
> Please note that you can set the oom adj value in the proc file system.
> It is located at /proc/PID/oom_adj. For more information about how the
> OOM killer behaves, see the Linux kernel source code, mm/oom kill.c in
> particular.
> 
> So we need an OOM killer helper.  
> 
> We have the ability to provide the kernel with much of the 
> information it needs for much better behavior, if we choose.
> 
> I see this project evolving through the following incremental
> improvements (and incremental difficulty) as set out below:
> 
> 1) start by setting the oom_adj appropriately so that the processes we
> really care about don't get shot.
> 
> 2) make this a window manager plug in (plug in, as people including us
> may end up using other window managers) that uses the stacking order on
> the screen to rank order the activities that are running.
> 
> 3) provide a mechanism by which applications may be given a hint that
> they might find it good to save enough state for a checkpoint restart,
> because they are likely a good candidate for shooting.
> 
> 4) use the XRes facilities in X (and/or modify X) to provide the kernel
> with the pixmap usage on a process ID basis, for local
> applications/activities.
> 
> 5) see if there are better OOM algorithms that Linux presently has.
> 
> Discussion?  Anyone want to take on this project, or parts of this
> project?
>                                         - Jim
>