<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-2" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
<br>
<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">
<pre wrap="">On 29/04/08 17:41 +0200, NoiseEHC wrote:
</pre>
<blockquote type="cite">
<pre wrap="">On this page
<a class="moz-txt-link-freetext"
href="http://wiki.laptop.org/go/Geode_LX">http://wiki.laptop.org/go/Geode_LX</a>
I have named some instructions as "Synchronized ops" (in the MMX
section). Are those real or did I mismeasured something?
</pre>
</blockquote>
<pre wrap=""><!---->
That section is very difficult to understand. I'm not sure which
operations you have invented this name for.
</pre>
</blockquote>
As you probably have already noticed
I am not a native English speaker (and neither learned advanced English
in school, just picked it up).
What I wanted to write in that section, every MMX op, whose
source/destination operand is an integer register (and not a MOV), will
consume absolutely different clock cycles than 2 (2 is listed for
almost every MMX op in the databook, at least in my version). Is it
real?<br>
<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">
<blockquote type="cite">
<pre wrap="">If those are
real then would somebody from AMD just go through the databook and fix
the instruction clock cycle numbers? Because in that case it is sure
that they do not match reality and clearly I have better things to do
than measuring clock cycles.
</pre>
</blockquote>
<pre wrap=""><!---->
Clearly you must have some basis for assuming that the numbers are
wrong, so you must have done some measurement. I consulted the
secret documentation that you claim I am withholding from you,
and the timings there are the same as in the datasheet. I believe that
you are correct in that these are the clock counts for the instruction to
go through the FPU and don't include the stall time for the pipeline
to clear up.
</pre>
</blockquote>
There is a "Test results" section in that page. The first two test were
conducted via email. I have emailed to this list test programs and
there were people who run them and emailed back the result. Especially
the first test has some stupid bugs because I wrote them essentially
blind. The third one is the result of my session logged into a physical
machine. It can be that only this "stall time" is missing from the
databook but the fact is that I as a programmer am not interested in
how many clock cycles does the FPU take to execute some internal
operation (which seems the databook to list) but I would like to know
the real time consumed.<br>
<br>
<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">
<pre wrap="">I am not a silicon designer, so I'm not the final word on if they are
correct or not, but at least that should prove that there isn't a
massive marketing conspiracy to hide the details of the processor
from our customers. If they are lying to you, they are lying to me,
and they're not lying to me.
</pre>
</blockquote>
This conspiracy thing was not serious, I have used a smiley at the end.
However from my perspective there is no difference if there is some
conspiracy or if there is not. In fact what I think is either that I am
mistaken and made some errors measuring this or the technical writer
made mistakes years ago and nobody cared to fix it.<br>
<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">
<pre wrap=""> </pre>
<blockquote type="cite">
<pre wrap="">Also the legend is clearly wrong in several
cases so probably that would need checking too (like on page 668 note 4
talks about 3DNOW ops in the table about FP ops).
</pre>
</blockquote>
<pre wrap=""><!---->
That is an mistake - I have let the technical writer know about it.
</pre>
</blockquote>
Thanks!<br>
Another error:<br>
On page 631 it talks about this:<br>
Conditional jump taken | Conditional jump not taken. (e.g., “4|1” =
four clocks if jump taken, one clock if jump not taken).<br>
It is never used in the opcode table.<br>
<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">
<pre wrap="">
</pre>
<blockquote type="cite">
<pre wrap="">absolutely no info about L2 cache miss penalties or mispredicted jumps
or about the pipeline stages of the FP unit.
</pre>
</blockquote>
<pre wrap=""><!---->
I don't have any information about L2 cache miss penalties, but they
are easy to calculate. Please see:
<a class="moz-txt-link-freetext"
href="http://homepages.cwi.nl/%7Emanegold/Calibrator/">http://homepages.cwi.nl/~manegold/Calibrator/</a>
</pre>
</blockquote>
Could you run on your machine and share the results? Currently I do not
have access to an XO.<br>
<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">
<pre wrap="">
I will talk to somebody about documenting the FP unit pipeline.
It does handle 1 instruction per clock from the integer unit.
In practice we know that two floating point instructions back to
back will stall the IU. I can also tell you that it is optimized
for single precision, so double precision is handled by microcode
and needs to go through the path again.
</pre>
</blockquote>
Thanks!<br>
I would also like to know how many ALU units does the FPU have? I mean
FMUL costs 1, PFMUL costs 2. Is it because it only has 1 multiply unit
and it executes PFMUL serially? If that is the case, does that mean
that the 3DNOW support is only compatibility and will not be faster
than simple FP?<br>
<blockquote cite="mid:20080429162031.GD1891@cosmic.amd.com" type="cite">
<pre wrap=""> </pre>
<blockquote type="cite">
<pre wrap="">See, all I would like to have is enough data that when I look at
assembly code I could approximately calculate how many clock cycles will
be consumed. Nothing more and nothing less.
</pre>
</blockquote>
<pre wrap=""><!---->
You have nearly all the information you need, and you can collect the
additional information the same way we do, with careful analysis and
measurement. In fact, Bernie and Vladimir Makarov have done a lot
of work already in this area, resulting in the Geode specific
code for gcc 4.2.0 and glibc. Perhaps you can work with them to figure
out the finer details of the FPU scheduling. I'm sure they would
appreciate it.
Jordan
</pre>
</blockquote>
<br>
<br>
</body>
</html>