Stability and Memory Pressure in 8.2

riccardo riccardo.lucchese at gmail.com
Tue Sep 9 09:29:55 EDT 2008


On Tue, 2008-09-09 at 00:10 -0400, Michael Stone wrote:
> Dear devel@,
> 
> Kim, Greg, and I have concluded that the instability we experience under
> memory-pressure in 8.2-759 and similar is the single "hard" issue that
> we wish to _attempt_ to address before releasing 8.2 on current
> timeframes. (We recognize that there are several other issues marked
> as blocking the release but we are confident that they will be resolved
> satisfactorily or are, in a few cases, beyond help.)
> 
> Since most other aspects of the release seem to be running smoothly, Kim
> asked me to take a more direct role in organizing our efforts produce a
> release which avoids memory pressure when possible and which is
> better-behaved when it strikes.
> 
> To that end, I would like to ask for your assistance with the following
> questions and tasks:
> 
>   * We need to determine why we encounter low-memory and out-of-memory
>     situations more frequently than in previous releases. 
>     
>     - This means that we need to measure how our memory consumption
>       profile has changed since our previous releases. 
> 
>       (cscott observes that we were unable to attack the F-9 image size
>       issues until we were able to quantify the effect of changes we had
>       made or were considering making. Consequently, he suggests that we
>       will be unable to attack our current space consumption problems
>       until we are able to generate good numbers (and displays).)
>     
>     - We need to think carefully about (or measure) whether our
>       memory-consumption patterns have changed. I am particularly
>       skeptical of our widespread use of tmpfsen since the pages consumed
>       by files stored on tmpfsen are permanently dirty (and are perhaps
>       accounted for differently than pages mapped into process' address
>       spaces?) 
> 
>     - We need to check the configuration of applications like Browse
>       which have configurable caching behavior. (Search for "cache" or
>       "capacity" in about:config; check for important compile-time
>       configuration flags.)
> 
>     - We need to test in a variety of different network configurations
>       in order to determine to what extent the network/presence
>       environment affects memory consumption.
>    
>   * We need to check carefully for memory-leaks. Three mechanisms which
>     occur to me include: 
> 
>       1) running the system for a period of time, then scanning for
>          anomalies either manually or in some automated fashion from
>          userland, kernel-land, or OFW (via SysRq or SMM).
>       
>       2) setting rlimits various processes and noting what dies 
> 
>       3) using debugging tools like the python garbage collection
>          module, guppy/heapy, gdb+macros, valgrind, efence, purify, etc.
>          looking for trouble.
> 
>   * We need to find out why the oom-killer is not killing things fast
>     enough. Based on our results, we might consider configuring
>     /proc/$pid/oom_adj to preferentially kill some processes (e.g., the
>     foreground [or background?] activities.)
> 
>   * We need to determine whether the oom-killer is killing the right
>     processes. (sysctl's vm.oom_dump_tasks can be set to 1 in order to
>     get more verbosity from the oom-killer when it fires).
> 
>   * We ought to ponder whether there are any additional "dirty hacks" we
>     can experiment with in order to reduce memory consumption; for
>     example, running the Shell and Journal (and DS?) in one process or
>     making use of the compressed-caching code published on this list some
>     months ago.
> 
>   * Random other stuff to think about:
>      
>     - rlimits, cgroups, and the memory resource controller
> 
>     - the warnings in the ramfs and tmpfs code about the deadlocks that
>       tmpfsen can generate under low- or no-memory conditions.
> 
>     - whether our kernel "overcommits" when allocation requests are made?
> 
>     - whether we can get Browse to behave intelligently when it receives
>       BadAlloc errors from X?
> 
>     - how to run bootchart on the XO
> 
>     - how to generate decent statistics and graphics (preferably in an
>       automated fashion) concerning memory usage as part of our test
>       suite
>     
>     - system-tap's kmalloc2.stp example
> 
> In conclusion, more to come once I have some actual data; _please_ feel
> free to assist in collecting it! (though be aware that I may 'volunteer'
> you if I need your help. (That means you, Tomeu, Riccardo, Deepak,
> ...)).
> 
> Regards,
> 
> Michael


There are some (trivial) tools (you may be interested in) I've written
and used besides others to attack/study this issues:

 * picker [1]
For me it was handier to use then bootchart; will also show per process
mem usage. 

 * imports timings and alloc statistics [2]
Patch to python that prints timings and mem usage diffs for every
imported module. Original timings patch is from Tomeu.

 * python-allocstatsmodule [3]
Inspired by [2] but can be used inside python scripts to collect
stats on heap usage. 
! When using `allocstats' to get modules mem usage by wrapping import
statements you will get quite rough/unuseful values because of import
cycles (at least for most interesting modules ;P).

Example app at
http://dev.laptop.org/~rlucchese/utils/python_mods_import_stats.py
Note that [2] and [3] should be better used with a python built with
--without-pymalloc.

We measured that there are quite big memory savings by using the
preload&fork trick (as expected btw). I guess enabling this for `all'
python processes would have a good (mem saving)/(work hours) ratio.


thanks,
riccardo


[1] git://dev.laptop.org/activities/picker
[2]
http://dev.laptop.org/~rlucchese/patches/python_show_mem_stats_on_module_loading.patch

[3] git://dev.laptop.org/users/rlucchese/python-allocstatsmodule/.git




More information about the Devel mailing list