The "iGoogle bug"

Bernardo Innocenti bernie at codewiz.org
Tue Sep 18 00:50:34 EDT 2007


Subject:   The "iGoogle bug"

To:        Jordan Crouse <jordan.crouse at amd.com>,OLPC Development <devel at lists.laptop.org>

Cc:        

Bcc:       

-=-=-=-=-=-=-=-=-=# Don't remove this line #=-=-=-=-=-=-=-=-=-

This is #3352: https://dev.laptop.org/ticket/3352



What's special about this page is that the clock applet uses two

very wide bitmaps of 3200 pixels each to represent the clock arms in all

possible positions.  (that's over 2MB of RAM wasted, nice!)



Of course, EXA promptly tries to upload these huge bitmaps to video

memory.  It's surprising we still had this much unallocated memory

to afford it.



So the request gets through the amd_drv upload hook, and eventually

we reach gp_color_bitmap_to_screen_blt(), whose purpose is to do

the actual uploading:



  http://dev.laptop.org/git?p=xf86-amd-devel;a=blob;f=src/cim/cim_gp.c;hb=HEAD#l1225



The LX blitter uses a circular command buffer to avoid stalling the

CPU for every command it issues.



The wrapping check for the ring buffer in gp_color_bitmap_to_screen_blt()

is very peculiar:



   /* CHECK IF A WRAP WILL BE NEEDED */



   gp3_cmd_next = gp3_cmd_current + size_dwords;

   if ((gp3_cmd_bottom - gp3_cmd_next) <= GP3_MAX_COMMAND_SIZE) {

       gp3_cmd_next = gp3_cmd_top;



       /* WAIT FOR HARDWARE */



       GP3_WAIT_WRAP(temp);

       WRITE_COMMAND32(0, GP3_DATA_LOAD_HDR_TYPE |

           GP3_DATA_LOAD_HDR_WRAP | GP3_DATA_LOAD_HDR_ENABLE);

   } else {

       /* WAIT FOR AVAILABLE SPACE */



       GP3_WAIT_PRIMITIVE(temp);

       WRITE_COMMAND32(0, GP3_DATA_LOAD_HDR_TYPE |

           GP3_DATA_LOAD_HDR_ENABLE);

   }



More insight into the above logic can be found in this comment from a

kissing cousin of the wrap check in gp_declare_blt():



   /* SET ADDRESS OF NEXT COMMAND */

   /* A summary of the command buffer logic is as follows:           */

   /*  - If after a basic BLT we will not have room for the largest  */

   /*    command (a full line of host source data), we set the wrap  */

   /*    bit.  This will waste up to a whopping 8K of command buffer */

   /*    space, but it simplifies the logic for all commands.        */

   /* -  If we are wrapping, we have extra logic to ensure that we   */

   /*    don't skip over the current GP read pointer.                */





Other code confirms the statement in this comment: GP3_MAX_COMMAND_SIZE

is defined to be 8K.  However, this limit is arbitrary: I couldn't find

anywhere in the databook a reason why the blitter couldn't copy more

than 8K of data.  The actual limit is 64K of DWORDS.  I guess 8KB was

just chosen as a reasonable waste of buffer space.



Moreover, the GPU is very well capable of wrapping its command pointer at

arbitrary positions, even in the middle of a command.  And so should the

software.  I strongly disagree with the claim in the comment that this

strategy simplifies anything.



Later on, gp_color_bitmap_to_screen_blt() carries on the actual bitmap

upload:



   /* WRITE DWORD COUNT */



   WRITE_COMMAND32(4, GP3_HOST_SOURCE_TYPE | total_dwords);



   /* WRITE DATA */



   WRITE_COMMAND_STRING32(8, data, srcoffset, dword_count);

   WRITE_COMMAND_STRING8(8 + (dword_count << 2), data,

       srcoffset + (dword_count << 2), byte_count);



This code is repeated for each line of the source bitmap.  dword_count

is the actual width in 32bit words.  I had to re-read the code several

time because I couldn't believe my eyes.



What this code does, is *bulk copying* raw bitmap data, a line at a time,

into the command buffer, as revealed by unwrapping the macros:



 #define WRITE_COMMAND32(offset, value) \

        (*(unsigned long *)(cim_cmd_ptr + (offset))) = (value)



 #define WRITE_COMMAND_STRING32(offset, dataptr, dataoffset, dword_count) \

 {                                                                        \

         unsigned long i;                                                     \

         unsigned long tempdata = (unsigned long)dataptr + (dataoffset);      \

         unsigned long byte_off = 0;                                          \

         for (i = 0; i < dword_count; i++, byte_off += 4)                     \

                 WRITE_COMMAND32 ((offset) + byte_off,                            \

                                *((unsigned long *)(tempdata + byte_off)));  \

 }



For reference:



  http://dev.laptop.org/git?p=xf86-amd-devel;a=blob;f=src/cim/cim_defs.h#l316



There are actually three flavors of WRITE_COMMAND_STRING32. The one we're

really using is hand optimized with inline assembly.  But the substance is

that they're just reimplementations of memcpy() in disguise, with a hidden

global pointer as destination.



What happens when we get a very wide bitmap, let's say 3200 pixels wide?

If the bitmap is 32bpp, we break the 8KB rule and (sometimes) stomp outside

the command buffer.  BOOM!



Another interesting consideration is: why are we copying bitmaps to the

command buffer with the CPU in the first place?  My guess is that getting

the blitter to read the data from paged virtual memory was hard.

But then, if we have to copy all the data with the CPU once, wouldn't it

be faster and easier to just write it right to its final destination in

video memory?



So my current impression is: this is just too broken to be fixed.  And

even if we try to work around it now, I bet we'll soon discover several

other ways to break it.  I hereby volunteer to simplify it in small

incremental steps by:



 - Removing all of the asm wizardry (useless IMHO, maybe even

   counter-productive)



 - Implementing access macros for the ring buffer using the normal,

   plain wrapping policy of all ring buffers



 - Killing the WRITE_COMMAND32() and WRITE_COMMANDSTRING32() abstractions.



 - Removing gp_declare_blt(), which needs to be called before starting

   any blitting operation



 - Seeing if we can get the blitter to read source data directly from system

   memory.  I'd be very surprised if there was no way to make it work

   with virtual memory enabled, because, without such a mechanism, the

   blitter would be less than fully useful.



In the short term, one quick and dirty fix would be to disable the EXA

upload hook.  The driver may even become somewhat faster because we

avoid the extra copy!



-- 

   //  Bernardo Innocenti - http://www.codewiz.org/

 \X/ One Laptop Per Child - http://www.laptop.org/




More information about the Devel mailing list