More 16 vs 24 bpp profiling

Tue Sep 11 13:44:22 EDT 2007

On Tue, 2007-09-11 at 13:03 -0400, Bernardo Innocenti wrote:
> (adding xorg-devel@ on Cc)
> 
> On 09/11/2007 11:29 AM, Jordan Crouse wrote:
> > On 11/09/07 13:05 +0200, Stefano Fedrigo wrote:
> >> I've done some more profiling on the 16 vs. 24 bpp issue.
> >> This time I used this test:
> >> https://dev.laptop.org/git?p=sugar;a=blob;f=tests/graphics/hipposcalability.py
> >>
> >> A simple speed test: I measured the time required to scroll down and up
> >> one time all the generated list.  Not extremely accurate, but I repeated the
> >> test a few times with consistent results (+- 0.5 secs).  Mean times:
> >>
> >> xserver 1.4
> >> 16 bpp: 37.9
> >> 24 bpp: 40.7
> >>
> >> xserver 1.3
> >> 16: 46.4
> >> 24: 50.1
> >>
> >> At 24 bpp we're a little slower.  1.3 is 20% slower than 1.4. The pixman
> >> migration patch makes the difference: 1.3 spend most of that 20% in memcpy().
> >>
> >> The oprofile reports are from xserver 1.4.  I don't see much difference
> >> between 16 and 24, except that at 24 bpp, less time is spent in pixman and more
> >> in amd_drv.  At 16 bpp pixman_fill() takes twice the time.
> >>
> >> Unfortunately without a working callgraph it's not very clear to me what's
> >> happening in amd_drv.  At 24bpp gp_wait_until_idle() takes twice the time...
> > 
> > What can we do to fix this?  I would really like to know who is calling
> > gp_wait_until_idle().
> 
> I think the invocation in lx_get_source_color() can safely
> go away, as exaGetPixmapFirstPixel() has always done
> correct locking even in 1.3.
> 
> But because the1x1 source pixmap used as solid color is
> still being uploaded to the framebuffer, I'd expect
> exaGetPixmapFirstPixel() to indirectly call the driver
> download hook and, thus, stall the GPU anyway.
> 
> If this tiny pixmap was at least reused, the second
> time it would be already in system memory.  And it
> seems that Cairo is trying to cache patterns in the CR.
> 
> Problem is, many GTK widgets like to create a new CR on every
> repaint event, thus rendering the cache quite effective for a
> typical workload of a window with several small widgets in it.
> But I've stumbled in the caching code a few months ago while
> debugging something else, so I may very well be mistaken.
> 
> On git's master, Michel Dänzer has recently been pushing
> a long run of EXA performance patches.  I've had only a
> quick glance, but it seems they may cure 
> 
> $ git-log 8cfcf9..e8093e | git-shortlog
> Michel Dänzer (14):
>       EXA: Track valid bits in Sys and FB separately.
>       Add DamagePendingRegion.
>       EXA: Support partial migration of pixmap contents between Sys and FB.
>       EXA: Hide pixmap pointer outside of exaPrepare/FinishAccess whenever possible.
>       EXA: Improvements for trapezoids and triangles.
>       EXA: exaImageGlyphBlt improvements.
>       EXA: Improvements for 1x1 pixmaps.
>       EXA: RENDER improvements.
>       EXA: Remove superfluous manual damage tracking.
>       EXA: exaGetImage improvements.
>       EXA: exa(Shm)PutImage improvements.
>       EXA: Use exaShmPutImage for pushing glyphs to scratch pixmap in exaGlyphs.
>       EXA: exaFillRegion{Solid,Tiled} improvements.
>       EXA: Exclude bits that will be overwritten from migration in exaCopyNtoN.
> 
> 
> Aleph, I guess it may be useful to re-run tests after applying
> these patches.  In case merging them on the 1.4 branch happens
> to be difficult, using the code from master should be ok.
> They don't seem to have diverged too much, yet.
> 
> 
> > Also, I think we're spending way too much time in
> > gp_color_bitmap_to_screen_blt() - is there any way we
> > can get more indepth profiling in that one function?
> 
> Good idea!
> 
> Meanwhile, I looked at gp_color_bitmap_to_screen_blt() and it
> seems we're issuing a separate blit per horizontal line of the
> source data.   That is correct for the general case, where the
> destination width may not match the source pitch.
> 
> However, when we invoke gp_color_bitmap_to_screen_blt() for
> uploads, I'd expect the destination buffer to match the source,
> so a single blit would work.
> 
> If my guess is right, special casing "pitch == width*bpp"
> would be a big win.  Anyone minds adding an ErrorF()?
> 
> 
> NOTE ALEPH: I think we stopped development in the xf86-amd-devel
> repo some time ago.  The correct driver nowadays would be the
> fd.o one.  Jordan, do you confirm this?

Would be nice to know since I'll need to know where to get the sources
for the RPM updates from...  But ISTR mail about this to this to
xorg-devel@ last month, I think you're right.

Dan