The "iGoogle bug"

Tue Sep 18 01:11:20 EDT 2007

Please, dismiss my previous posting.  I was testing a new a Thunderbird
plugin for using external editors, but it's evidently buggy.

Here's the correct body again:

---cut---

This is #3352: https://dev.laptop.org/ticket/3352

What's special about this page is that the clock applet uses two
very wide bitmaps of 3200 pixels each to represent the clock arms in all
possible positions.  (that's over 2MB of RAM wasted, nice!)

Of course, EXA promptly tries to upload these huge bitmaps to video
memory.  It's surprising we still had this much unallocated memory
to afford it.

So the request gets through the amd_drv upload hook, and eventually
we reach gp_color_bitmap_to_screen_blt(), whose purpose is to do
the actual uploading:

  http://dev.laptop.org/git?p=xf86-amd-devel;a=blob;f=src/cim/cim_gp.c;hb=HEAD#l1225

The LX blitter uses a circular command buffer to avoid stalling the
CPU for every command it issues.

The wrapping check for the ring buffer in gp_color_bitmap_to_screen_blt()
is very peculiar:

   /* CHECK IF A WRAP WILL BE NEEDED */

   gp3_cmd_next = gp3_cmd_current + size_dwords;
   if ((gp3_cmd_bottom - gp3_cmd_next) <= GP3_MAX_COMMAND_SIZE) {
       gp3_cmd_next = gp3_cmd_top;

       /* WAIT FOR HARDWARE */

       GP3_WAIT_WRAP(temp);
       WRITE_COMMAND32(0, GP3_DATA_LOAD_HDR_TYPE |
           GP3_DATA_LOAD_HDR_WRAP | GP3_DATA_LOAD_HDR_ENABLE);
   } else {
       /* WAIT FOR AVAILABLE SPACE */

       GP3_WAIT_PRIMITIVE(temp);
       WRITE_COMMAND32(0, GP3_DATA_LOAD_HDR_TYPE |
           GP3_DATA_LOAD_HDR_ENABLE);
   }

More insight into the above logic can be found in this comment from a
kissing cousin of the wrap check in gp_declare_blt():

   /* SET ADDRESS OF NEXT COMMAND */
   /* A summary of the command buffer logic is as follows:           */
   /*  - If after a basic BLT we will not have room for the largest  */
   /*    command (a full line of host source data), we set the wrap  */
   /*    bit.  This will waste up to a whopping 8K of command buffer */
   /*    space, but it simplifies the logic for all commands.        */
   /* -  If we are wrapping, we have extra logic to ensure that we   */
   /*    don't skip over the current GP read pointer.                */

Other code confirms the statement in this comment: GP3_MAX_COMMAND_SIZE
is defined to be 8K.  However, this limit is arbitrary: I couldn't find
anywhere in the databook a reason why the blitter couldn't copy more
than 8K of data.  The actual limit is 64K of DWORDS.  I guess 8KB was
just chosen as a reasonable waste of buffer space.

Moreover, the GPU is very well capable of wrapping its command pointer at
arbitrary positions, even in the middle of a command.  And so should the
software.  I strongly disagree with the claim in the comment that this
strategy simplifies anything.

Later on, gp_color_bitmap_to_screen_blt() carries on the actual bitmap
upload:

   /* WRITE DWORD COUNT */

   WRITE_COMMAND32(4, GP3_HOST_SOURCE_TYPE | total_dwords);

   /* WRITE DATA */

   WRITE_COMMAND_STRING32(8, data, srcoffset, dword_count);
   WRITE_COMMAND_STRING8(8 + (dword_count << 2), data,
       srcoffset + (dword_count << 2), byte_count);

This code is repeated for each line of the source bitmap.  dword_count
is the actual width in 32bit words.  I had to re-read the code several
time because I couldn't believe my eyes.

What this code does, is *bulk copying* raw bitmap data, a line at a time,
into the command buffer, as revealed by unwrapping the macros:

 #define WRITE_COMMAND32(offset, value) \
        (*(unsigned long *)(cim_cmd_ptr + (offset))) = (value)

 #define WRITE_COMMAND_STRING32(offset, dataptr, dataoffset, dword_count) \
 {                                                                        \
         unsigned long i;                                                     \
         unsigned long tempdata = (unsigned long)dataptr + (dataoffset);      \
         unsigned long byte_off = 0;                                          \
         for (i = 0; i < dword_count; i++, byte_off += 4)                     \
                 WRITE_COMMAND32 ((offset) + byte_off,                            \
                                *((unsigned long *)(tempdata + byte_off)));  \
 }

For reference:

  http://dev.laptop.org/git?p=xf86-amd-devel;a=blob;f=src/cim/cim_defs.h#l316

There are actually three flavors of WRITE_COMMAND_STRING32. The one we're
really using is hand optimized with inline assembly.  But the substance is
that they're just reimplementations of memcpy() in disguise, with a hidden
global pointer as destination.

What happens when we get a very wide bitmap, let's say 3200 pixels wide?
If the bitmap is 32bpp, we break the 8KB rule and (sometimes) stomp outside
the command buffer.  BOOM!

Another interesting consideration is: why are we copying bitmaps to the
command buffer with the CPU in the first place?  My guess is that getting
the blitter to read the data from paged virtual memory was hard.
But then, if we have to copy all the data with the CPU once, wouldn't it
be faster and easier to just write it right to its final destination in
video memory?

So my current impression is: this is just too broken to be fixed.  And
even if we try to work around it now, I bet we'll soon discover several
other ways to break it.  I hereby volunteer to simplify it in small
incremental steps by:

 - Removing all of the asm wizardry (useless IMHO, maybe even
   counter-productive)

 - Implementing access macros for the ring buffer using the normal,
   plain wrapping policy of all ring buffers

 - Killing the WRITE_COMMAND32() and WRITE_COMMANDSTRING32() abstractions.

 - Removing gp_declare_blt(), which needs to be called before starting
   any blitting operation

 - Seeing if we can get the blitter to read source data directly from system
   memory.  I'd be very surprised if there was no way to make it work
   with virtual memory enabled, because, without such a mechanism, the
   blitter would be less than fully useful.

In the short term, one quick and dirty fix would be to disable the EXA
upload hook.  The driver may even become somewhat faster because we
avoid the extra copy!

- 
   //  Bernardo Innocenti - http://www.codewiz.org/
 \X/ One Laptop Per Child - http://www.laptop.org/