Where olpc machine spending time when using web broswer

Thu Mar 15 16:38:05 EDT 2007

On Thu, 2007-03-15 at 12:02 -0700, Carl Worth wrote:
> On Tue, 2007-03-03, Adam Jackson wrote:
> > > On Mon, 2007-03-12 at 18:59 -0400, William Cohen wrote:
> ...
> > > 6514     68.1096  libfb.so                 fbFetchTransformed
> >
> > Wow.  I think that's the first time I've ever seen this actually show up
> > on a profile.  I didn't think anything used Render's transformations on
> > account of they're so painfully slow.
> 
> But it's a new, cairo world now, Adam, and we prefer to use Render
> whenever possible. So I expect you'll be seeing these paths in the X
> server get exercised more and more, so we really do need to improve
> this software to be much faster, (and also use the GPU whenever
> possible to avoid this software altogether--obviously not an option on
> the OLPC machine).

There's just no making software texture interpolation fast enough.  You
probably don't have the memory bandwidth, you certainly don't have the
interpolation in silicon, and (on CPUs where you actually have any) you
end up blowing out your dcache.  It's the wrong tool for the job.  This
is _why_ we invented GPUs.  

That Render includes transforms and filters is very nearly academia
levels of solipsism.  You just can't make them go fast in software, and
we already had a working API for doing them in hardware.  Using Render
for untransformed alpha blending is at least reasonable to do in
software; Render ought to have stopped there.

This is why my earlier comments were about how to cheat.  Nearest
scaling will go reasonably fast and probably look okay.  We might be
able to squeeze passable speed out bilinear scaling in the special case
where the transformation matrix is an integer multiple of the identity
matrix, by constant-folding the scale factors into the blend equations
[1].  But bilinear is just not ever going to go as fast as you want it
to, you're blowing out your memory bandwidth requirements by a factor of
4 and we already don't have any memory bandwidth, and the X image format
is intrinsically non-tiled and therefore cache-unfriendly.  Worse, X has
no context for knowing how the image is going to be used in the future,
so we don't even have a reasonable heuristic for caching the transformed
image.  If we want gecko to scale X images we should be doing so in
libpr0n and only ever giving the server the pre-scaled image.

I'm all for optimising our code, but there's a physical limit here.  We
need to stop pretending that software Render can ever be fast.  It can
only be tolerable.

[1] - Yes, I'm aware of dynamic code generation.  It doesn't exist in
the X we have, it's not going to exist by the time gen1 ships, and it
still won't solve the bandwidth or cache problems.

- ajax