Stability and Memory Pressure in 8.2

Michael Stone michael at laptop.org
Tue Sep 9 00:10:53 EDT 2008


Dear devel@,

Kim, Greg, and I have concluded that the instability we experience under
memory-pressure in 8.2-759 and similar is the single "hard" issue that
we wish to _attempt_ to address before releasing 8.2 on current
timeframes. (We recognize that there are several other issues marked
as blocking the release but we are confident that they will be resolved
satisfactorily or are, in a few cases, beyond help.)

Since most other aspects of the release seem to be running smoothly, Kim
asked me to take a more direct role in organizing our efforts produce a
release which avoids memory pressure when possible and which is
better-behaved when it strikes.

To that end, I would like to ask for your assistance with the following
questions and tasks:

  * We need to determine why we encounter low-memory and out-of-memory
    situations more frequently than in previous releases. 
    
    - This means that we need to measure how our memory consumption
      profile has changed since our previous releases. 

      (cscott observes that we were unable to attack the F-9 image size
      issues until we were able to quantify the effect of changes we had
      made or were considering making. Consequently, he suggests that we
      will be unable to attack our current space consumption problems
      until we are able to generate good numbers (and displays).)
    
    - We need to think carefully about (or measure) whether our
      memory-consumption patterns have changed. I am particularly
      skeptical of our widespread use of tmpfsen since the pages consumed
      by files stored on tmpfsen are permanently dirty (and are perhaps
      accounted for differently than pages mapped into process' address
      spaces?) 

    - We need to check the configuration of applications like Browse
      which have configurable caching behavior. (Search for "cache" or
      "capacity" in about:config; check for important compile-time
      configuration flags.)

    - We need to test in a variety of different network configurations
      in order to determine to what extent the network/presence
      environment affects memory consumption.
   
  * We need to check carefully for memory-leaks. Three mechanisms which
    occur to me include: 

      1) running the system for a period of time, then scanning for
         anomalies either manually or in some automated fashion from
         userland, kernel-land, or OFW (via SysRq or SMM).
      
      2) setting rlimits various processes and noting what dies 

      3) using debugging tools like the python garbage collection
         module, guppy/heapy, gdb+macros, valgrind, efence, purify, etc.
         looking for trouble.

  * We need to find out why the oom-killer is not killing things fast
    enough. Based on our results, we might consider configuring
    /proc/$pid/oom_adj to preferentially kill some processes (e.g., the
    foreground [or background?] activities.)

  * We need to determine whether the oom-killer is killing the right
    processes. (sysctl's vm.oom_dump_tasks can be set to 1 in order to
    get more verbosity from the oom-killer when it fires).

  * We ought to ponder whether there are any additional "dirty hacks" we
    can experiment with in order to reduce memory consumption; for
    example, running the Shell and Journal (and DS?) in one process or
    making use of the compressed-caching code published on this list some
    months ago.

  * Random other stuff to think about:
     
    - rlimits, cgroups, and the memory resource controller

    - the warnings in the ramfs and tmpfs code about the deadlocks that
      tmpfsen can generate under low- or no-memory conditions.

    - whether our kernel "overcommits" when allocation requests are made?

    - whether we can get Browse to behave intelligently when it receives
      BadAlloc errors from X?

    - how to run bootchart on the XO

    - how to generate decent statistics and graphics (preferably in an
      automated fashion) concerning memory usage as part of our test
      suite
    
    - system-tap's kmalloc2.stp example

In conclusion, more to come once I have some actual data; _please_ feel
free to assist in collecting it! (though be aware that I may 'volunteer'
you if I need your help. (That means you, Tomeu, Riccardo, Deepak,
...)).

Regards,

Michael

P.S. - Thanks to cscott and cjb for their advice during our brief
planning session.

P.P.S. - Please follow up if you think I missed any avenues that might
be worth pursuing in order to address this rather large and fuzzy problem
space -- there's plenty of room left for good ideas that didn't occur to
me, Scott, or Chris.



More information about the Devel mailing list