Oprofile, swap
John Richard Moser
nigelenki at comcast.net
Tue Dec 18 18:27:54 EST 2007
(Note: most of this message isn't very useful probably; it's about
theoretical software architecture, that nobody's going to implement,
that I can't prove, that I'm not really 100% sure about. Still, if you
WANT to read it, hey... remember, bad ideas sometimes get corrected by
people who are smart enough to turn them into GOOD ideas)
Ivan Krstić wrote:
> On Dec 18, 2007, at 12:27 PM, Jameson Chema Quinn wrote:
>> Has anyone looked at Psyco on the XO?
>
>
> Psyco improves performance at the cost of memory. On a
> memory-constrained machine, it's a tradeoff that can only be made in
> laser-focused, specific cases. We have not done the work -- partly for
It would be wise to throw out the idea of laser-focusing which engine to
use. Think of memory costs for running multiple versions of Python.
Then again, what IS wise?
Any such system needs to efficiently use memory. I like the idea of one
based on Mono since it has that whole compacting garbage collector,
which (although being a cache destroyer by nature) at least shrinks down
memory usage. Of course then you still have Mono on top of it, and CIL
code that's been generated, and reflected, and JIT'd, which means you
(again) have 2 interpreters in memory (one in CIL, one being Mono
itself), and one gets dynamic recompiled (the Python one in CIL), and
all the intermediary (CIL) code gets kept for later profiling and
optimization...
... didn't I say before I hate the concept of JIT dynamic compilation?
Interpreters just suck by nature due to dcache problems (code becomes
data, effectively your instruction working set is fixed, the load
doesn't go onto icache and dcache both as the program gets bigger...)
and due to the fact that you have to do a LOT of work to decode an insn
in software (THINK ABOUT QEMU). Interpreters for specific script
languages like Python and Perl have the advantage of not having to be a
general CPU emulator, so they can have instructions that are just
function calls that go into native code.
So execution time order:
Native code < // *1
JIT < // *2
Specific language interpreter < // *3
General bytecode interpreter < // *4
Parser script interpreter // *5
*1: Native code. C, obj-C, something compiled. everything else I
could mention is out of date.
*2: Technically JIT is native code, but there's also extra
considerations with memory use and cache pressure comes into play
slightly. After the ball gets rolling it just eats more memory but
cache and execution speed are fine.
*3: A specific language interpreter might call a native code strcpy()
function instead of have an insn for CALL that goes into a bytecode
implementation of strcpy(), or having an insn for CALL that goes into a
bytecode strcpy() that just sets up a binding and calls real native
strcpy(). The interpreter would head straight for native land, going
"function foobar() gets assigned token 0x?? and I'll know what to do
when I see it."
*4: A general CPU interpreter is going to have to be a CPU emulator.
Java and Mono count, for Java and CIL CPUs. These CPUs don't really
exist but those interpreters work that way, they even have their own
assembly.
*5: Some script engines are REALLY FREAKING DUMB and actually send each
line through a parser every time they see it, which is megaslow. These
usually don't last, or just function as proof of concept until a real
bytecode translator gets written to make a specific language interpreter.
Maybe, MAYBE by twiddling with a JIT, you could convince it to discard
generated bytecode. For example, assuming we're talking about a Python
implementation on top Mono, and we can modify Mono any way we want with
reasonably little effort:
- Python -> internal tree (let's say Gimple, like gcc)
- Gimple -> optimizer (Python)
- Gimple (opt) -> optimizer (general)
- Gimple (opt) -> CIL data (for reflection)
- FREE: Gimple
- CIL (data) -> Reflection (CIL)
- FREE: CIL data (for reflection)
- CIL -> CIL optimizer
- CIL (opt) -> JIT (x86)
- While (not satisfied)
- The annoying process of dynamic profiling
- CIL (opt, profiled) -> JIT (x86)
- FREE: CIL
NOTE: at the FREE CIL data step, we are talking about the Python
interpreter freeing the CIL data that it has; Mono has now loaded a copy
as CIL code, we don't need to give it to it again, we're done with it.
At this point we should have:
- A CIL program for a Python interpreter
- A CIL interpreter (Mono)
- x86 native code for the program
Further, you should be able to make the Python interpreter do a number
of things:
- Translate any Python-written libraries via JIT on a method-for-method
basis
- Translate Python bindings (Python calling C) to active CIL bindings
(to avoid calling back to the interpreter)
- Unload most of itself when done (say, when it's been unused for about
5 minutes of execution time), save for a method that loads the Python
interpreter back into memory AGAIN when a new method gets called, so
that it can dynamic-compile it.
Thus, you should be able to achieve the near total elimination of the
CIL program for a Python interpreter and just leave Mono and the program
itself, already JIT'd to native code, in memory. You would need a
fragment of the Python interpreter loaded to handle any entry back into
the Python interpreter, with a single function to load it again; each
re-entry point would just lead the whole engine and then jump to the
actual handler in it. This of course isn't much (if you need a whole
page of code for that I'm surprised).
Mind you there's a number of flaws in this argument. You probably
noticed most of them.
- IronPython is of not entirely acceptable license; nobody is going to
make a SECOND Python/CIL dynamic compiler
- You can get an IL stream for any compiled method. Mono won't free
CIL stuff. It may actually be small enough not to care; or not.
I believe it's actually too big to be feasible. MAYBE you can add
something to Mono to allow flushing that permanently on purpose (i.e.
by the Python interpreter).
- You're still dealing with JIT'd code, which is still not shareable.
Mono seems to put that in WX segments, so by counting these (in
kilobytes) I can ascertain the exact size of the executable code.
Because it's not shared, it doesn't get evicted from memory if not
used like normal .so files or /bin executables. I have a memory
analysis script I wrote that does the trick, if I have bash play with
the output; here's what Tomboy looks like on x86-64, about 13MB:
$ echo $(( $(~/memuse.sh 19132 | grep "p:wx" | cut -f1 -d' ' | \
tr -d 'K' | tr -d 'B' | xargs | sed -e "s/ / + /g" ) ))
13544
I like to think of programs like kernels, or kernels like programs.
Either way, I like to treat applications like microkernels. In the
embedded scene, this may actually be critical; maybe you should think
that way for the XO, in a little part. (Re: the part about unloading
the entire Python interpreter except for a little bit that reloads it if
needed...)
> --
> Ivan Krstić <krstic at solarsail.hcs.harvard.edu> | http://radian.org
>
>
--
Bring back the Firefox plushy!
http://digg.com/linux_unix/Is_the_Firefox_plush_gone_for_good
https://bugzilla.mozilla.org/show_bug.cgi?id=322367
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memuse.sh
Type: application/x-shellscript
Size: 2548 bytes
Desc: not available
URL: <http://lists.laptop.org/pipermail/devel/attachments/20071218/8413058f/attachment.bin>
More information about the Devel
mailing list