XO 1.5 performance testing.

Mon Mar 14 13:12:02 EDT 2011

On 14 March 2011 07:28, Jon Nettleton <jon.nettleton at gmail.com> wrote:
> It has been a fun and fulfilling weekend tracking down performance
> regressions when using DRI on the XO 1.5 platform.  One place I was
> looking at specifically was blitting solids from userspace to the
> kernel.  I found that copy_from_user was really chewing up a
> significant amount of cpu.  Looking further I found that for VIA C7
> processors these two options,  X86_INTEL_USERCOPY and
> X86_USE_PPRO_CHECKSUM, are not enabled.
>
> I have not done thorough testing on it, but after patch the kernel
> with the attached patch my gtkperf run dropped from 63 seconds down to
> 50 seconds.

Wow! Nice work!
But, I think your gains must have come from elsewhere.
Please correct me if you spot something that I haven't.

Firstly X86_INTEL_USERCOPY
The function that decides when the alternative codepath enabled by
this config option gets enabled is __movsl_is_ok() in
arch/x86/lib/usercopy_32.c.

This function has to return 0 for the interesting new functions such
as __copy_user_zeroing_intel() to be called.

The one way that this function can return 0 is:
	if (n >= 64 && ((a1 ^ a2) & movsl_mask.mask))
		return 0;

This above "return 0" will never happen on our platform.
movsl_mask.mask is always 0. (on other platforms it is initialized in
Intel-specific code in arch/x86/kernel/cpu/intel.c)

So, X86_INTEL_USERCOPY doesn't seem to have any effect for us.

I tried hacking movsl_mask.mask to 7 like Intel, and it resulted in a
0.9% speedup in copy_to_user (possibly just noise) when doing
unaligned writes (its important to realise that the codepaths enabled
by this option are only an optimization for unaligned accesses,
well-aligned accesses are unchanged).

X86_USE_PPRO_CHECKSUM looks like it is worth doing. It results in
csum_partial() calls speeding up by a factor of 1.5. But I'd be
surprised if this is having any direct impact on your graphics work
(this checksum-calculating function seems to only have a handful of
users outside of networking) unless your new DRI code is calling this
function directly?

Your performance gains are very exciting but I think they must have
resulted from something else.

My benchmark code is here (including a hack to enable the unaligned
write optimization when X86_INTEL_USERCOPY is set):
http://dev.laptop.org/~dsd/20110314/benchmark-copy_from_user-and-csum_partial.patch

Daniel