Performance issues
John Richard Moser
nigelenki at comcast.net
Sun Sep 24 02:38:42 EDT 2006
John Richard Moser wrote:
> So far the only things I've come up with for performance are hardware
> related, which isn't helpful unless you can get some custom toys for the
> next revision (i.e. years from now).
[DISCLAIMER: Remember, I'm not an electrical engineer. These ideas are
here for your review, in case anything can be turned into something
useful. Occasionally people come to believe I'm some sort of genius
and know everything; trust me it doesn't work that way.]
[Remember also that for the IMMEDIATE moment any of such conjectures
here are NOT USEFUL; in the future if the MMU can be altered to do
these cute hacks then maybe.]
First thing I'd like to say is my previous assessment of task switching
placed an added 17.5% performance hit on top of the 33.2% that cache
misses takes; my calculations were flawed. I multiplied 25 cycles * 512
cache lines / 1.83 million cycles per timeslice == 0.7%, then multiplied
0.7% times 25 cycles. The 0.7% is the performance hit from the extra
cache misses, NOT the percent of extra cache misses added; the hit is
0.7%, NOT 17.5%. The TOTAL performance hit due to low cache is
theoretically 33.7%.
[I HATE being wrong... *mutter*]
I've been thinking on the cut-through cache idea, and I'm considering
how fast it could be accessed. Here's what I come up with so far:
CYCLE 1: L1 cache miss.
CYCLE 2: L2 cache access, hit.
-OR-
CYCLE 1: L1 Cache miss.
CYCLE 2: L2 Cache is in the middle of a CAS refresh cycle. Wait.
CYCLE 3: L2 cache access, hit.
-OR-
We miss L2, and spend about 26-27 cycles instead of 25 doing a full
look-up through the MMU with all its
calculate-the-address-through-the-TLB glory.
[This does not account for if it takes longer than a cycle to physically
get from the CPU to the RAM and back; it is assumed that the MMU's
job in this scheme is to sit inbetween the CPU and RAM and do
nothing. *I* am assuming that access to the next circuit (i.e. the
RAM) is synchronized by clock and thus it only takes 1 clock cycle;
during a CAS refresh the MMU will change this circuitry to actually
stop for a clock and wait.]
Assuming 2 cycles and still destroying cache on multi-task, as well as
ample (i.e. 0% miss rate; my desktop does 0.2%) L2, the basic
performance hit drops from 33.2% to 2.65% (2 cycle access) and 3.98% (3
cycle access) for an L2 cache hit. Multi-tasking skews this to about
3.3% to 4.7% instead of 33.7%. (A 0.2% miss rate will multiply 0.2%
times 25, i.e. add 5% here and remove 0.002 * CYCLES_FOR_L2_ACCESS).
[MATH:
100 - (1 - 0.007) * (100 - (2*33.2/25)) == 3.3
^^^ task switch hit ^^^ new_HzPerMiss * OldHit / OldHzPerMiss
100 - (1 - 0.007) * (100 - (3*33.2/25)) == 4.7
This calculation uses the performance loss to 25 cycle cache misses;
divides it by 25 to get the percent of cache misses; then multiplies
that by the predicted cost of L2 cache hits. It then removes 0.7% of
that to account for multitasking. We don't account for filling up
extra L2, this math works directly on the total L1 cache misses. L2
can be guessed to be about 0.2% miss rate as it is on my desktop.]
[At best I'm predicting the figure can be lowered to 9.1%]
The cut-through cache idea I had works based on a single-bit adder
circuit with a floating start bit that's adjusted in BIOS; in essence,
any size cache of a power of 2 could be set. It would be a set of gates
for memory access, that are not used (address lines routed around them)
for cache access.
[I mumbled a bunch of junk that follows no standard terminology just
now. The basic premise is this: You turn on a flip-flop at bit 18
(256K). This circuit starts falling through gates turning them off
until it hits an address line that's ALREADY off, in which case it
turns it on and stops. The output is a set of address lines
addressing INPUT + 256KiB. If you turn that flip-flop off and
instead turn on a different one (at 16KiB, 512KiB, 1MiB, 8MiB...) you
skip THAT much. Accessing addresses WITHOUT going through this
circuit will access below there, giving you your cache area.]
[I repeat myself a lot for my own benefit, it helps me think.]
Because this effectively removes any commitment to any size L2 cache or
any added circuitry (besides a modified MMU, which is probably
infeasible for anyone here to do), the design HAS caught my interest for
the moment and I am interested in finding some electrical engineer that
may know more about the cost of accessing RAM physically, at least to
improve my understanding on this subject.
[I'll subsequently lose interest.]
--
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.
Creative brains are a valuable, limited resource. They shouldn't be
wasted on re-inventing the wheel when there are so many fascinating
new problems waiting out there.
-- Eric Steven Raymond
We will enslave their women, eat their children and rape their
cattle!
-- Bosc, Evil alien overlord from the fifth dimension
More information about the Devel
mailing list