performance work

Wed Dec 31 11:20:27 EST 2008

Neil Graham wrote:
> On Tue, 2008-12-30 at 20:41 -0700, Jordan Crouse wrote:
>>> I'm curious as to why reads from video memory are so slow,  On standard
>>> video cards it's slow because there is quite a division between the CPU
>>> and the video memory,  but on the geode isn't the video memory shared in
>>> the same SDRAM as Main memory. 
>> It is, in that they share the same physical RAM chips, but they are 
>> controlled by different entities - one is managed by the system memory 
>> controller and the other is handled by the GPU.   At start up time, the 
>> memory is carved up by the firmware, and after the top of system RAM is 
>> established, video and system memory behave for all intents and purposes 
>> like separate components.  Put simply, there is no way to directly 
>> address video memory from the system memory.  Access to the video memory 
>> has to happen via PCI cycles, and for obvious reasons the active video 
>> region has the cache disabled, accounting for relatively slow readback.
> 
> That makes my brain melt, you can't address it even though it's on the
> same chip!?!  Even as far back as the PCjr the deal was that sharing
> video memory cost some performance due to taking turns with cycles but
> it gave some back with easy access to the memory for all.   Has the
> geode cunningly managed to provide a system that combines all the
> disadvantages of separate memory with all the disadvantages of shared?
> 
> One wonders what would happen if you wired some lines to the chips so
> that the memory appeared in two places,  would you get access to the ram
> (with the usual 'you pays your money, you takes your chances' caveats
> about coherency)
> 
> I'm not a hardware person, but that all just seems odd.

You are missing the point - this model wasn't designed so that the 
system could somehow sneakily address video memory, it was designed so 
that the system designer could eliminate the need for the added cost, 
expense and real estate for a separate bank of memory chips.  See also
http://en.wikipedia.org/wiki/Shared_Memory_Architecture.

>> That said, the read from memory performance is still worse  then you
>> might expect - I never really got a good answer from
>> the silicon guys as to why. 
>>
> being hit with the full sdram latency every access maybe?
> 
> Is it feasible to try with caches enabled and require the software to
> flush as needed.

Ask around - I don't think that you'll find anybody too keen on having 
the X server execute a cache invalidate a half dozen times a second.

Anyway, you are getting distracted and solving the wrong problem.  You 
should be more concerned about limiting the number of times that the X 
server reads from video memory rather then worrying about how fast the 
read is.

If I can rant for a second (and this isn't targeted at Neil 
specifically, but just in general), but this is another in a list of 
more or less hard constraints that the current XO design has. 
Throughout the history of the project, it seems to me that developers 
have been more biased toward trying to eliminate those constraints 
rather then making the software work in spite of them.  The processor is 
too slow - everybody immediately wants to overclock.  There is too 
little memory - enter a few dozen schemes for compressing it or swaping it.

The XO platform has limitations, most of which were introduced by choice 
for power or cost reasons.  The limitations are clearly documented and 
were known by all, at least when the project started.  The understanding 
was that the software would have to be adjusted to fit the hardware, not 
the other way around.  Over time, we seem to have lost that understanding.

Software engineering is hard - software engineering for resource 
restrained systems is even harder.  In this day and age geeks like us 
have been accustomed to always having the latest and greatest hardware 
at our fingertips, and so the software that we write is also for the 
latest and greatest.  And so, when confronted with a system such as the 
XO, our first instinct is to just plop our software on it and watch it 
go.  That attitude is further re-enforced by the fact that the Geode is 
x86 based - just like our desktops.  It should just work, right?  We 
know better - or at least, we should know better.

The solution to the performance problems is good old fashioned elbow 
grease.  We have to take our software that is naturally biased toward 
the year 2007 and make it work for the year 1995.  Thats going to 
involve fixing bugs in the drivers, but also re-thinking how the 
software works - and finding situations where the software might be 
inadvertently doing the wrong thing. Let me give you an example - as 
recently as X 1.5, operations involving an a8 alpha mask worked like this:

* Draw a 1x1 rectangle in video memory containing the source color for 
the operation
* Read the source color from video memory
* Perform the mask operation with the source color

This isn't smart for any kind of processor or GPU, running at 2 Ghz or 
half a Ghz.  The X server knows the source color from the start, why 
don't we just use it?  We get away with in it on a modern processor, but 
it kills us on the Geode.  These are the sorts of things that we need to 
find and squash - and yes, it will be very time consuming and a little 
boring.  But if you care about performance, I mean really care about it 
and not just out for the quick fix, these are the sorts of things that 
we need to do.

Jordan