Performance issues

Sat Sep 23 01:36:22 EDT 2006

So far the only things I've come up with for performance are hardware
related, which isn't helpful unless you can get some custom toys for the
next revision (i.e. years from now).

Based on a number of things (cachegrind and some math drawing out
multitasking) I'm thinking there's going to be a bare minimum of 33.2%
performance hit due to the low amount of cache on the chip, and probably
17.5% hit added by cache thrashing due to multi-tasking, if I'm blending
these together right (take the 33.2% away, take 17.5% of that away, then
find out what's been taken away total) that's total of bout 45%
performance hit.

I think cachegrinding some of the actual apps on the machine may give
more useful numbers but they'll likely be in the same general trend (I
used Rhythmbox playing Ogg Vorbis).  I'll make some suggestions for the
next OLPC revision but as far as this one goes you're stuck with what
you picked for the first revision.

I'm assuming that the OLPCs will get newer, better hardware in the
future in a second revision; when this happens I would look to getting
some L2 cache that can be accessed in 2-4 cycles.

I had an interesting idea along these lines.  I believe the memory
controller is on the Geode GX's companion north bridge chip; a slight
alteration to the memory controller, it could become possible to chop
off some memory from the bottom of physical RAM and make it L2 off-dye
cache.  It would be as such:

  * Real physical memory access (hardware going through the memory
    controller for DMA, or just the CPU going through the memory
    controller) would shift the addresses over a few bits (i.e. 256K,
    shift left 18).  In hardware I believe you can set up the gates to
    always perform the shift, but I'm no EE.  The point is you should be
    able to do this without performing a calculation.

  * L2 cache would be direct addressed, but only so many bits; it would
    always touch the lowest physical RAM.  Due to this cut-through
    architecture, all calculations with TLBs entries could be skipped,
    and it would be possible to do an L2 cache hit in much less than the
    25 cycles an L1 cache miss takes (I'm assuming it would be
    equivalent to off-dye L2 or L3 cache).

  * To add some extra flexibility, you could have the bios set the
    amount of L2 cache within a range (0-20 bits == 0-1MiB of L2).  This
    would require a number of gate configurations to perform
    calculationless translation for RAM access; but it would be very
    cool and increase the chip's marketability (very specialized tasks
    may not require L2 at all and may require so little RAM that
    spending a meg on L2 would be a waste; however in some cases i.e.
    OLPC the 1M taken from 128M will be a friggin' godsend).

  * Bonuses in this design, there is NO added cache chip, NO added
    refresh clock (static RAM is MUCH more expensive than DRAM), and
    thus there is a savings on both extra parts AND power usage.

Physical RAM:  [| | | | | | | |]
 Amount:         \  127  / \1/
 Use:             RAM       L2 cache

To do something like that of course you'd have to convince AMD to put
that in their memory controller.  This costs money to do, but may be
interesting; it could prove useful in other places like normal desktops
with L3 or L4 cache taken from main RAM and other such nonsense that we
probably don't REALLY care about in the real world because we already
have a meg of L2 on Athlon 64s in that sector anyway.  At the very least
it'd be a selling point in future embedded systems where adjusting the
L2 cache to the workload IS feasibly useful.

Please note I'm not an electrical engineer, I'm a software guy and I
have read the data book on the Geode GX (skimmed it for stuff I wanted
to know, like how much cache was around and how expensive cache misses
cost).  Take anything said above with a grain of salt; I only want my
thoughts out there in case someone can find something useful buried in them.

Cachegrind settings to simulate Geode GX:

valgrind --tool=cachegrind --I1=16384,4,32 --D1=16384,4,32 --L2=64,4,32

  There is no L2; assume all L2 hits are L1 misses or just calculate the
L1 miss rate yourself.  Cachegrind DEMANDS there be L2.

  A cache miss is 25 cycles, a cache hit is 1; to get performance
impact, multiply the percentage of cache misses by 25.  That's your
performance hit percent.

  Remember that cachegrind doesn't profile multitasking.  After
switching around to a bunch of tasks cache is going to be pretty much
destroyed and you're going to eventually make one pass at 100% cache
misses, this takes 0.7% of a time slice to complete, so add 0.7% to the
miss rate?  (I think)

OTHER THOUGHTS:

Software... Python is going to turn script into bytecode.  It's then
going to read bytecode and execute code corresponding.  Effectively the
program's 'code' is accessed as data, and thus the D1 cache will suffer
WORSE miss rates because you're working with more data.  However, the I1
cache may not have to miss as much, as you won't grow the actual program
code with bigger programs.  I'm not sure how this will end out overall.

It may be possible to teach Python to profile what parts of the bytecode
jump to other parts of the bytecode, and then shift them closer together
to get better cache locality.  It may also be possible to just improve
Python's bytecode generator and have it optimize more, shrink things
down more, do some kind of instruction encoding (fully analyze it, turn
some redundant bytecode into dictionary entries in a window, etc.).

I'm looking at writing a new memory allocator to replace Ptmalloc in
glibc and reduce the amount of memory waste.  This should increase
available memory, improve cache locality, etc.  No promises, I have no
idea how well this will work and I doubt the cache locality thing will
be much of anything.  I doubt I can do more than 5% (i.e. nothing
noticeable) performance increase with software; if I can even approach
1% it'll be a miracle.

What else... Nitin Gupta is working on compressed caching in Linux 2.6,
he wants someone to give him funding and has stated he'll try to get it
into mainline if he's funded.  He's hitting up Canonical for money but I
don't know if they'll give him any; this may be a good place to burn off
a little extra cash.

-- 
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond

    We will enslave their women, eat their children and rape their
    cattle!
                  -- Bosc, Evil alien overlord from the fifth dimension