15 computer science collegians looking for a project

Jordan Crouse jordan.crouse at amd.com
Wed Apr 30 11:16:56 EDT 2008

On 30/04/08 10:18 +0200, NoiseEHC wrote:
>> On 29/04/08 17:41 +0200, NoiseEHC wrote:
>>> On this page
>>> http://wiki.laptop.org/go/Geode_LX
>>> I have named some instructions as "Synchronized ops" (in the MMX 
>>> section). Are those real or did I mismeasured something?
>> That section is very difficult to understand.  I'm not sure which
>> operations you have invented this name for.
> As you probably have already noticed I am not a native English speaker (and 
> neither learned advanced English in school, just picked it up). What I 
> wanted to write in that section, every MMX op, whose source/destination 
> operand is an integer register (and not a MOV), will consume absolutely 
> different clock cycles than 2 (2 is listed for almost every MMX op in the 
> databook, at least in my version). Is it real?

I still don't understand what you mean, but the clock timings that are
in the data sheet, are the same ones on my documentation.  You would have
to find somebody more skilled then I to debate if they are correct or not.

>>> If those are real then would somebody from AMD just go through the 
>>> databook and fix the instruction clock cycle numbers? Because in that 
>>> case it is sure that they do not match reality and clearly I have better 
>>> things to do than measuring clock cycles.     
>> Clearly you must have some basis for assuming that the numbers are
>> wrong, so you must have done some measurement.  I consulted the
>> secret documentation that you claim I am withholding from you, and the 
>> timings there are the same as in the datasheet.  I believe that
>> you are correct in that these are the clock counts for the instruction to
>> go through the FPU and don't include the stall time for the pipeline
>> to clear up.
> There is a "Test results" section in that page. The first two test were 
> conducted via email. I have emailed to this list test programs and there 
> were people who run them and emailed back the result. Especially the first 
> test has some stupid bugs because I wrote them essentially blind. The third 
> one is the result of my session logged into a physical machine. It can be 
> that only this "stall time" is missing from the databook but the fact is 
> that I as a programmer am not interested in how many clock cycles does the 
> FPU take to execute some internal operation (which seems the databook to 
> list) but I would like to know the real time consumed.

I think you'll probably have to measure that.  I can't find any further
documentation as to what the penalty is for scheduling two FPU instructions

>> I am not a silicon designer, so I'm not the final word on if they are
>> correct or not, but at least that should prove that there isn't a
>> massive marketing conspiracy to hide the details of the processor
>> from our customers.  If they are lying to you, they are lying to me,
>> and they're not lying to me.
> This conspiracy thing was not serious, I have used a smiley at the end. 
> However from my perspective there is no difference if there is some 
> conspiracy or if there is not. In fact what I think is either that I am 
> mistaken and made some errors measuring this or the technical writer made 
> mistakes years ago and nobody cared to fix it.

You need to be careful when tossing about opinions, especially if you do not
mean it.  My collegues and I have spent a lot of effort to ensure that
the documentation and software for this processor is open and freely
available.  I would wager it would be rather difficult to find another
x86 processor on the market today with such complete documentation and
software to accompany it (BIOS and operating system).  I take allegations
that we're hiding something very seriously.

>> I don't have any information about L2 cache miss penalties, but they are 
>> easy to calculate. Please see:
>> http://homepages.cwi.nl/~manegold/Calibrator/
> Could you run on your machine and share the results? Currently I do not 
> have access to an XO.

I don't have a machine currently handy to do that test, but I'll try to get 
to it when I do.

>> I will talk to somebody about documenting the FP unit pipeline.
>> It does handle 1 instruction per clock from the integer unit.
>> In practice we know that two floating point instructions back to
>> back will stall the IU.  I can also tell you that it is optimized
>> for single precision, so double precision is handled by microcode
>> and needs to go through the path again. 
> Thanks!
> I would also like to know how many ALU units does the FPU have? I mean FMUL 
> costs 1, PFMUL costs 2. Is it because it only has 1 multiply unit and it 
> executes PFMUL serially? If that is the case, does that mean that the 3DNOW 
> support is only compatibility and will not be faster than simple FP?

I believe that is a reasonable assertion to make if you have instructions
that perform similar behavior.  There are some 3DNow! operations that
cannot be performed with a single FP operation, and those will still win.


Jordan Crouse
Systems Software Development Engineer 
Advanced Micro Devices, Inc.

More information about the Devel mailing list