<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-2" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#000000">

<br>

<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">

  <pre wrap="">On 29/04/08 17:41 +0200, NoiseEHC wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">On this page

<a class="moz-txt-link-freetext"

 href="http://wiki.laptop.org/go/Geode_LX">http://wiki.laptop.org/go/Geode_LX</a>

I have named some instructions as "Synchronized ops" (in the MMX 

section). Are those real or did I mismeasured something?

    </pre>

  </blockquote>

  <pre wrap=""><!---->

That section is very difficult to understand.  I'm not sure which

operations you have invented this name for.

  </pre>

</blockquote>

As you probably have already noticed

I am not a native English speaker (and neither learned advanced English

in school, just picked it up).

What I wanted to write in that section, every MMX op, whose

source/destination operand is an integer register (and not a MOV), will

consume absolutely different clock cycles than 2 (2 is listed for

almost every MMX op in the databook, at least in my version). Is it

real?<br>

<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">

  <blockquote type="cite">

    <pre wrap="">If those are 

real then would somebody from AMD just go through the databook and fix 

the instruction clock cycle numbers? Because in that case it is sure 

that they do not match reality and clearly I have better things to do 

than measuring clock cycles. 

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Clearly you must have some basis for assuming that the numbers are

wrong, so you must have done some measurement.  I consulted the

secret documentation that you claim I am withholding from you, 

and the timings there are the same as in the datasheet.  I believe that

you are correct in that these are the clock counts for the instruction to

go through the FPU and don't include the stall time for the pipeline

to clear up.

  </pre>

</blockquote>

There is a "Test results" section in that page. The first two test were

conducted via email. I have emailed to this list test programs and

there were people who run them and emailed back the result. Especially

the first test has some stupid bugs because I wrote them essentially

blind. The third one is the result of my session logged into a physical

machine. It can be that only this "stall time" is missing from the

databook but the fact is that I as a programmer am not interested in

how many clock cycles does the FPU take to execute some internal

operation (which seems the databook to list) but I would like to know

the real time consumed.<br>

<br>

<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">

  <pre wrap="">I am not a silicon designer, so I'm not the final word on if they are

correct or not, but at least that should prove that there isn't a

massive marketing conspiracy to hide the details of the processor

from our customers.  If they are lying to you, they are lying to me,

and they're not lying to me.

  </pre>

</blockquote>

This conspiracy thing was not serious, I have used a smiley at the end.

However from my perspective there is no difference if there is some

conspiracy or if there is not. In fact what I think is either that I am

mistaken and made some errors measuring this or the technical writer

made mistakes years ago and nobody cared to fix it.<br>

<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">

  <pre wrap="">  </pre>

  <blockquote type="cite">

    <pre wrap="">Also the legend is clearly wrong in several 

cases so probably that would need checking too (like on page 668 note 4 

talks about 3DNOW ops in the table about FP ops).

    </pre>

  </blockquote>

  <pre wrap=""><!---->

That is an mistake - I have let the technical writer know about it.

  </pre>

</blockquote>

Thanks!<br>

Another error:<br>

On page 631 it talks about this:<br>

Conditional jump taken | Conditional jump not taken. (e.g., “4|1” =

four clocks if jump taken, one clock if jump not taken).<br>

It is never used in the opcode table.<br>

<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">

  <pre wrap="">

  </pre>

  <blockquote type="cite">

    <pre wrap="">absolutely no info about L2 cache miss penalties or mispredicted jumps 

or about the pipeline stages of the FP unit.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

I don't have any information about L2 cache miss penalties, but they 

are easy to calculate. Please see:

<a class="moz-txt-link-freetext"

 href="http://homepages.cwi.nl/%7Emanegold/Calibrator/">http://homepages.cwi.nl/~manegold/Calibrator/</a>

  </pre>

</blockquote>

Could you run on your machine and share the results? Currently I do not

have access to an XO.<br>

<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">

  <pre wrap="">

I will talk to somebody about documenting the FP unit pipeline.

It does handle 1 instruction per clock from the integer unit.

In practice we know that two floating point instructions back to

back will stall the IU.  I can also tell you that it is optimized

for single precision, so double precision is handled by microcode

and needs to go through the path again. 

  </pre>

</blockquote>

Thanks!<br>

I would also like to know how many ALU units does the FPU have? I mean

FMUL costs 1, PFMUL costs 2. Is it because it only has 1 multiply unit

and it executes PFMUL serially? If that is the case, does that mean

that the 3DNOW support is only compatibility and will not be faster

than simple FP?<br>

<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">

  <pre wrap="">  </pre>

  <blockquote type="cite">

    <pre wrap="">See, all I would like to have is enough data that when I look at 

assembly code I could approximately calculate how many clock cycles will 

be consumed. Nothing more and nothing less.

    </pre>

  </blockquote>

  <pre wrap=""><!---->

You have nearly all the information you need, and you can collect the

additional information the same way we do, with careful analysis and

measurement.  In fact, Bernie and Vladimir Makarov have done a lot

of work already in this area, resulting in the Geode specific

code for gcc 4.2.0 and glibc.  Perhaps you can work with them to figure

out the finer details of the FPU scheduling.  I'm sure they would

appreciate it.

Jordan

  </pre>

</blockquote>

<br>

<br>

</body>

</html>