XO 1.5 performance testing.
Daniel Drake
dsd at laptop.org
Mon Mar 14 13:12:02 EDT 2011
On 14 March 2011 07:28, Jon Nettleton <jon.nettleton at gmail.com> wrote:
> It has been a fun and fulfilling weekend tracking down performance
> regressions when using DRI on the XO 1.5 platform. One place I was
> looking at specifically was blitting solids from userspace to the
> kernel. I found that copy_from_user was really chewing up a
> significant amount of cpu. Looking further I found that for VIA C7
> processors these two options, X86_INTEL_USERCOPY and
> X86_USE_PPRO_CHECKSUM, are not enabled.
>
> I have not done thorough testing on it, but after patch the kernel
> with the attached patch my gtkperf run dropped from 63 seconds down to
> 50 seconds.
Wow! Nice work!
But, I think your gains must have come from elsewhere.
Please correct me if you spot something that I haven't.
Firstly X86_INTEL_USERCOPY
The function that decides when the alternative codepath enabled by
this config option gets enabled is __movsl_is_ok() in
arch/x86/lib/usercopy_32.c.
This function has to return 0 for the interesting new functions such
as __copy_user_zeroing_intel() to be called.
The one way that this function can return 0 is:
if (n >= 64 && ((a1 ^ a2) & movsl_mask.mask))
return 0;
This above "return 0" will never happen on our platform.
movsl_mask.mask is always 0. (on other platforms it is initialized in
Intel-specific code in arch/x86/kernel/cpu/intel.c)
So, X86_INTEL_USERCOPY doesn't seem to have any effect for us.
I tried hacking movsl_mask.mask to 7 like Intel, and it resulted in a
0.9% speedup in copy_to_user (possibly just noise) when doing
unaligned writes (its important to realise that the codepaths enabled
by this option are only an optimization for unaligned accesses,
well-aligned accesses are unchanged).
X86_USE_PPRO_CHECKSUM looks like it is worth doing. It results in
csum_partial() calls speeding up by a factor of 1.5. But I'd be
surprised if this is having any direct impact on your graphics work
(this checksum-calculating function seems to only have a handful of
users outside of networking) unless your new DRI code is calling this
function directly?
Your performance gains are very exciting but I think they must have
resulted from something else.
My benchmark code is here (including a hack to enable the unaligned
write optimization when X86_INTEL_USERCOPY is set):
http://dev.laptop.org/~dsd/20110314/benchmark-copy_from_user-and-csum_partial.patch
Daniel
More information about the Devel
mailing list