Oprofile, swap

Tue Dec 18 18:27:54 EST 2007

(Note:  most of this message isn't very useful probably; it's about 
theoretical software architecture, that nobody's going to implement, 
that I can't prove, that I'm not really 100% sure about.  Still, if you 
WANT to read it, hey... remember, bad ideas sometimes get corrected by 
people who are smart enough to turn them into GOOD ideas)

Ivan Krstić wrote:
> On Dec 18, 2007, at 12:27 PM, Jameson Chema Quinn wrote:
>> Has anyone looked at Psyco on the XO?
> 
> 
> Psyco improves performance at the cost of memory. On a 
> memory-constrained machine, it's a tradeoff that can only be made in 
> laser-focused, specific cases. We have not done the work -- partly for 

It would be wise to throw out the idea of laser-focusing which engine to 
use.  Think of memory costs for running multiple versions of Python. 
Then again, what IS wise?

Any such system needs to efficiently use memory.  I like the idea of one 
based on Mono since it has that whole compacting garbage collector, 
which (although being a cache destroyer by nature) at least shrinks down 
memory usage.  Of course then you still have Mono on top of it, and CIL 
code that's been generated, and reflected, and JIT'd, which means you 
(again) have 2 interpreters in memory (one in CIL, one being Mono 
itself), and one gets dynamic recompiled (the Python one in CIL), and 
all the intermediary (CIL) code gets kept for later profiling and 
optimization...

... didn't I say before I hate the concept of JIT dynamic compilation? 
Interpreters just suck by nature due to dcache problems (code becomes 
data, effectively your instruction working set is fixed, the load 
doesn't go onto icache and dcache both as the program gets bigger...) 
and due to the fact that you have to do a LOT of work to decode an insn 
in software (THINK ABOUT QEMU).  Interpreters for specific script 
languages like Python and Perl have the advantage of not having to be a 
general CPU emulator, so they can have instructions that are just 
function calls that go into native code.

So execution time order:

Native code <  // *1
JIT < // *2
Specific language interpreter < // *3
General bytecode interpreter < // *4
Parser script interpreter // *5

*1:  Native code.  C, obj-C, something compiled.  everything else I 
could mention is out of date.

*2:  Technically JIT is native code, but there's also extra 
considerations with memory use and cache pressure comes into play 
slightly.  After the ball gets rolling it just eats more memory but 
cache and execution speed are fine.

*3:  A specific language interpreter might call a native code strcpy() 
function instead of have an insn for CALL that goes into a bytecode 
implementation of strcpy(), or having an insn for CALL that goes into a 
bytecode strcpy() that just sets up a binding and calls real native 
strcpy().  The interpreter would head straight for native land, going 
"function foobar() gets assigned token 0x?? and I'll know what to do 
when I see it."

*4:  A general CPU interpreter is going to have to be a CPU emulator. 
Java and Mono count, for Java and CIL CPUs.  These CPUs don't really 
exist but those interpreters work that way, they even have their own 
assembly.

*5:  Some script engines are REALLY FREAKING DUMB and actually send each 
line through a parser every time they see it, which is megaslow.  These 
usually don't last, or just function as proof of concept until a real 
bytecode translator gets written to make a specific language interpreter.

Maybe, MAYBE by twiddling with a JIT, you could convince it to discard 
generated bytecode.  For example, assuming we're talking about a Python 
implementation on top Mono, and we can modify Mono any way we want with 
reasonably little effort:

  - Python -> internal tree (let's say Gimple, like gcc)
  - Gimple -> optimizer (Python)
  - Gimple (opt) -> optimizer (general)
  - Gimple (opt) -> CIL data (for reflection)
  - FREE:  Gimple
  - CIL (data) -> Reflection (CIL)
  - FREE:  CIL data (for reflection)
  - CIL -> CIL optimizer
  - CIL (opt) -> JIT (x86)
  - While (not satisfied)
    - The annoying process of dynamic profiling
    - CIL (opt, profiled) -> JIT (x86)
  - FREE:  CIL

NOTE:  at the FREE CIL data step, we are talking about the Python 
interpreter freeing the CIL data that it has; Mono has now loaded a copy 
as CIL code, we don't need to give it to it again, we're done with it.

At this point we should have:

  - A CIL program for a Python interpreter
  - A CIL interpreter (Mono)
  - x86 native code for the program

Further, you should be able to make the Python interpreter do a number 
of things:

  - Translate any Python-written libraries via JIT on a method-for-method
    basis
  - Translate Python bindings (Python calling C) to active CIL bindings
    (to avoid calling back to the interpreter)
  - Unload most of itself when done (say, when it's been unused for about
    5 minutes of execution time), save for a method that loads the Python
    interpreter back into memory AGAIN when a new method gets called, so
    that it can dynamic-compile it.

Thus, you should be able to achieve the near total elimination of the 
CIL program for a Python interpreter and just leave Mono and the program 
itself, already JIT'd to native code, in memory.  You would need a 
fragment of the Python interpreter loaded to handle any entry back into 
the Python interpreter, with a single function to load it again; each 
re-entry point would just lead the whole engine and then jump to the 
actual handler in it.  This of course isn't much (if you need a whole 
page of code for that I'm surprised).

Mind you there's a number of flaws in this argument.  You probably 
noticed most of them.

  - IronPython is of not entirely acceptable license; nobody is going to
    make a SECOND Python/CIL dynamic compiler
  - You can get an IL stream for any compiled method.  Mono won't free
    CIL stuff.  It may actually be small enough not to care; or not.
    I believe it's actually too big to be feasible.  MAYBE you can add
    something to Mono to allow flushing that permanently on purpose (i.e.
    by the Python interpreter).
  - You're still dealing with JIT'd code, which is still not shareable.
    Mono seems to put that in WX segments, so by counting these (in
    kilobytes) I can ascertain the exact size of the executable code.
    Because it's not shared, it doesn't get evicted from memory if not
    used like normal .so files or /bin executables.  I have a memory
    analysis script I wrote that does the trick, if I have bash play with
    the output; here's what Tomboy looks like on x86-64, about 13MB:

    $ echo $(( $(~/memuse.sh 19132 | grep "p:wx" | cut -f1 -d' ' | \
       tr -d 'K' | tr -d 'B' | xargs | sed -e "s/ / + /g" ) ))
    13544

I like to think of programs like kernels, or kernels like programs. 
Either way, I like to treat applications like microkernels.  In the 
embedded scene, this may actually be critical; maybe you should think 
that way for the XO, in a little part.  (Re:  the part about unloading 
the entire Python interpreter except for a little bit that reloads it if 
needed...)

> -- 
> Ivan Krstić <krstic at solarsail.hcs.harvard.edu> | http://radian.org
> 
> 

-- 
Bring back the Firefox plushy!
http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good
https://bugzilla.mozilla.org/show_bug.cgi?id=322367
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memuse.sh
Type: application/x-shellscript
Size: 2548 bytes
Desc: not available
URL: <http://lists.laptop.org/pipermail/devel/attachments/20071218/8413058f/attachment.bin>