Performance issues

Sun Sep 24 02:38:42 EDT 2006

John Richard Moser wrote:
> So far the only things I've come up with for performance are hardware
> related, which isn't helpful unless you can get some custom toys for the
> next revision (i.e. years from now).

[DISCLAIMER:  Remember, I'm not an electrical engineer.  These ideas are
   here for your review, in case anything can be turned into something
   useful.  Occasionally people come to believe I'm some sort of genius
   and know everything; trust me it doesn't work that way.]

[Remember also that for the IMMEDIATE moment any of such conjectures
   here are NOT USEFUL; in the future if the MMU can be altered to do
   these cute hacks then maybe.]

First thing I'd like to say is my previous assessment of task switching
placed an added 17.5% performance hit on top of the 33.2% that cache
misses takes; my calculations were flawed.  I multiplied 25 cycles * 512
cache lines / 1.83 million cycles per timeslice == 0.7%, then multiplied
0.7% times 25 cycles.  The 0.7% is the performance hit from the extra
cache misses, NOT the percent of extra cache misses added; the hit is
0.7%, NOT 17.5%.  The TOTAL performance hit due to low cache is
theoretically 33.7%.

[I HATE being wrong... *mutter*]

I've been thinking on the cut-through cache idea, and I'm considering
how fast it could be accessed.  Here's what I come up with so far:

CYCLE 1:  L1 cache miss.
CYCLE 2:  L2 cache access, hit.

-OR-

CYCLE 1:  L1 Cache miss.
CYCLE 2:  L2 Cache is in the middle of a CAS refresh cycle.  Wait.
CYCLE 3:  L2 cache access, hit.

-OR-

We miss L2, and spend about 26-27 cycles instead of 25 doing a full
look-up through the MMU with all its
calculate-the-address-through-the-TLB glory.

[This does not account for if it takes longer than a cycle to physically
   get from the CPU to the RAM and back; it is assumed that the MMU's
   job in this scheme is to sit inbetween the CPU and RAM and do
   nothing.  *I* am assuming that access to the next circuit (i.e. the
   RAM) is synchronized by clock and thus it only takes 1 clock cycle;
   during a CAS refresh the MMU will change this circuitry to actually
   stop for a clock and wait.]

Assuming 2 cycles and still destroying cache on multi-task, as well as
ample (i.e. 0% miss rate; my desktop does 0.2%) L2, the basic
performance hit drops from 33.2% to 2.65% (2 cycle access) and 3.98% (3
cycle access) for an L2 cache hit.  Multi-tasking skews this to about
3.3% to 4.7% instead of 33.7%.  (A 0.2% miss rate will multiply 0.2%
times 25, i.e. add 5% here and remove 0.002 * CYCLES_FOR_L2_ACCESS).

[MATH:
  100 - (1 - 0.007) * (100 - (2*33.2/25)) == 3.3
         ^^^ task switch hit  ^^^ new_HzPerMiss * OldHit / OldHzPerMiss
  100 - (1 - 0.007) * (100 - (3*33.2/25)) == 4.7

  This calculation uses the performance loss to 25 cycle cache misses;
  divides it by 25 to get the percent of cache misses; then multiplies
  that by the predicted cost of L2 cache hits.  It then removes 0.7% of
  that to account for multitasking.  We don't account for filling up
  extra L2, this math works directly on the total L1 cache misses.  L2
  can be guessed to be about 0.2% miss rate as it is on my desktop.]

[At best I'm predicting the figure can be lowered to 9.1%]

The cut-through cache idea I had works based on a single-bit adder
circuit with a floating start bit that's adjusted in BIOS; in essence,
any size cache of a power of 2 could be set.  It would be a set of gates
for memory access, that are not used (address lines routed around them)
for cache access.

[I mumbled a bunch of junk that follows no standard terminology just
   now.  The basic premise is this:  You turn on a flip-flop at bit 18
   (256K).  This circuit starts falling through gates turning them off
   until it hits an address line that's ALREADY off, in which case it
   turns it on and stops.  The output is a set of address lines
   addressing INPUT + 256KiB.  If you turn that flip-flop off and
   instead turn on a different one (at 16KiB, 512KiB, 1MiB, 8MiB...) you
   skip THAT much.  Accessing addresses WITHOUT going through this
   circuit will access below there, giving you your cache area.]

[I repeat myself a lot for my own benefit, it helps me think.]

Because this effectively removes any commitment to any size L2 cache or
any added circuitry (besides a modified MMU, which is probably
infeasible for anyone here to do), the design HAS caught my interest for
the moment and I am interested in finding some electrical engineer that
may know more about the cost of accessing RAM physically, at least to
improve my understanding on this subject.

[I'll subsequently lose interest.]
-- 
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

    Creative brains are a valuable, limited resource. They shouldn't be
    wasted on re-inventing the wheel when there are so many fascinating
    new problems waiting out there.
                                                 -- Eric Steven Raymond

    We will enslave their women, eat their children and rape their
    cattle!
                  -- Bosc, Evil alien overlord from the fifth dimension