Performance issues
John Richard Moser
nigelenki at comcast.net
Sat Sep 23 01:36:22 EDT 2006
So far the only things I've come up with for performance are hardware
related, which isn't helpful unless you can get some custom toys for the
next revision (i.e. years from now).
Based on a number of things (cachegrind and some math drawing out
multitasking) I'm thinking there's going to be a bare minimum of 33.2%
performance hit due to the low amount of cache on the chip, and probably
17.5% hit added by cache thrashing due to multi-tasking, if I'm blending
these together right (take the 33.2% away, take 17.5% of that away, then
find out what's been taken away total) that's total of bout 45%
performance hit.
I think cachegrinding some of the actual apps on the machine may give
more useful numbers but they'll likely be in the same general trend (I
used Rhythmbox playing Ogg Vorbis). I'll make some suggestions for the
next OLPC revision but as far as this one goes you're stuck with what
you picked for the first revision.
I'm assuming that the OLPCs will get newer, better hardware in the
future in a second revision; when this happens I would look to getting
some L2 cache that can be accessed in 2-4 cycles.
I had an interesting idea along these lines. I believe the memory
controller is on the Geode GX's companion north bridge chip; a slight
alteration to the memory controller, it could become possible to chop
off some memory from the bottom of physical RAM and make it L2 off-dye
cache. It would be as such:
* Real physical memory access (hardware going through the memory
controller for DMA, or just the CPU going through the memory
controller) would shift the addresses over a few bits (i.e. 256K,
shift left 18). In hardware I believe you can set up the gates to
always perform the shift, but I'm no EE. The point is you should be
able to do this without performing a calculation.
* L2 cache would be direct addressed, but only so many bits; it would
always touch the lowest physical RAM. Due to this cut-through
architecture, all calculations with TLBs entries could be skipped,
and it would be possible to do an L2 cache hit in much less than the
25 cycles an L1 cache miss takes (I'm assuming it would be
equivalent to off-dye L2 or L3 cache).
* To add some extra flexibility, you could have the bios set the
amount of L2 cache within a range (0-20 bits == 0-1MiB of L2). This
would require a number of gate configurations to perform
calculationless translation for RAM access; but it would be very
cool and increase the chip's marketability (very specialized tasks
may not require L2 at all and may require so little RAM that
spending a meg on L2 would be a waste; however in some cases i.e.
OLPC the 1M taken from 128M will be a friggin' godsend).
* Bonuses in this design, there is NO added cache chip, NO added
refresh clock (static RAM is MUCH more expensive than DRAM), and
thus there is a savings on both extra parts AND power usage.
Physical RAM: [| | | | | | | |]
Amount: \ 127 / \1/
Use: RAM L2 cache
To do something like that of course you'd have to convince AMD to put
that in their memory controller. This costs money to do, but may be
interesting; it could prove useful in other places like normal desktops
with L3 or L4 cache taken from main RAM and other such nonsense that we
probably don't REALLY care about in the real world because we already
have a meg of L2 on Athlon 64s in that sector anyway. At the very least
it'd be a selling point in future embedded systems where adjusting the
L2 cache to the workload IS feasibly useful.
Please note I'm not an electrical engineer, I'm a software guy and I
have read the data book on the Geode GX (skimmed it for stuff I wanted
to know, like how much cache was around and how expensive cache misses
cost). Take anything said above with a grain of salt; I only want my
thoughts out there in case someone can find something useful buried in them.
Cachegrind settings to simulate Geode GX:
valgrind --tool=cachegrind --I1=16384,4,32 --D1=16384,4,32 --L2=64,4,32
There is no L2; assume all L2 hits are L1 misses or just calculate the
L1 miss rate yourself. Cachegrind DEMANDS there be L2.
A cache miss is 25 cycles, a cache hit is 1; to get performance
impact, multiply the percentage of cache misses by 25. That's your
performance hit percent.
Remember that cachegrind doesn't profile multitasking. After
switching around to a bunch of tasks cache is going to be pretty much
destroyed and you're going to eventually make one pass at 100% cache
misses, this takes 0.7% of a time slice to complete, so add 0.7% to the
miss rate? (I think)
OTHER THOUGHTS:
Software... Python is going to turn script into bytecode. It's then
going to read bytecode and execute code corresponding. Effectively the
program's 'code' is accessed as data, and thus the D1 cache will suffer
WORSE miss rates because you're working with more data. However, the I1
cache may not have to miss as much, as you won't grow the actual program
code with bigger programs. I'm not sure how this will end out overall.
It may be possible to teach Python to profile what parts of the bytecode
jump to other parts of the bytecode, and then shift them closer together
to get better cache locality. It may also be possible to just improve
Python's bytecode generator and have it optimize more, shrink things
down more, do some kind of instruction encoding (fully analyze it, turn
some redundant bytecode into dictionary entries in a window, etc.).
I'm looking at writing a new memory allocator to replace Ptmalloc in
glibc and reduce the amount of memory waste. This should increase
available memory, improve cache locality, etc. No promises, I have no
idea how well this will work and I doubt the cache locality thing will
be much of anything. I doubt I can do more than 5% (i.e. nothing
noticeable) performance increase with software; if I can even approach
1% it'll be a miracle.
What else... Nitin Gupta is working on compressed caching in Linux 2.6,
he wants someone to give him funding and has stated he'll try to get it
into mainline if he's funded. He's hitting up Canonical for money but I
don't know if they'll give him any; this may be a good place to burn off
a little extra cash.
--
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.
Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
More information about the Devel
mailing list