Announce: Compressed cache alpha-007 (kernel 2.6.18)

Wed Sep 20 10:00:14 EDT 2006

On Wed, 2006-09-20 at 18:23 +0530, Nitin Gupta wrote:

> >
> > +       /*
> > +        * This allocation must never sleep and must never fail.
> > +        * -- Doing GFP_ATOMIC will guarantee that I doesn't sleep
> > +        * but alloc may fail!
> > +        * -- Doing GFP_KERNEL giver higher chances that alloc will
> > +        * be successfull but it may sleep (and hence doesn't work)!
> > +        * -- What to do??
> > +        */
> > +       comp_page = alloc_page(GFP_ATOMIC);
> >
> > unsigned long pflags = current->flags;
> > current->flags |= PF_MEMALLOC;
> > alloc_page(GFP_NOWAIT);
> > current->flags = pflags;
> >
> A dumb question: IIRC 'current' points to currently executing task?
> What if 'current' changes between these statements -- any locking
> required to prevent this (like disable irq to keep 'current' current)?

current cannot change from the point of view of the execution thread.

> > or steal my __GFP_EMERGENCY patch.
> >
> It's in mainline now? or plan to be mainlined soon?

Not in mainline, I hope to get it there, but not looking to bright atm.
But that is mainly the rest of the patches, the __GFP_EMERGENCY
bit looks like so:

Index: linux-2.6/include/linux/gfp.h
===================================================================

--- linux-2.6.orig/include/linux/gfp.h
+++ linux-2.6/include/linux/gfp.h
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_ZERO	((__force gfp_t)0x8000u)/* Return zeroed page on success */
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_EMERGENCY  ((__force gfp_t)0x40000u) /* Use emergency reserves */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,7 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_EMERGENCY)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -971,8 +972,8 @@ restart:
 
 	/* This allocation should allow future memory freeing. */
 
-	if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
-			&& !in_interrupt()) {
+	if ((((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+			&& !in_interrupt()) || (gfp_mask & __GFP_EMERGENCY)) {
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */

> > +       tmp_page = alloc_pages(GFP_KERNEL, 1);  /* page may expand */
> >
> > 1 order alloc? No way to not do that? Compression will fail if you don't
> > provide enough output space, right? perhaps just discard the page,
> > negative savings just don't make sense.
> >
> Yes. 1 order alloc!! A failure point for ccache. The pages that expand
> are already discarded but problem is to provide temporary buffer to
> get compressed output in first place. Alternate is to insert size
> checks within compressor tight loop -- even this seemed hard in
> strange LZO code.

Ah, so the compressor requires it, too bad. Then perhaps preallocate a
few 1-order pages; if you could ensure the max concurrency to be NR_CPUS
that would be great, otherwise you'd have to stick in a fallback path.

> > Also:
> >
> > +struct chunk {
> > +       void *start_addr;       /* page addr + offset within page
> > +                                * where chunk starts */
> > +       unsigned short size;    /* size: 12 LSB bits, flags: rest 4 bits */
> > +       struct chunk *next;     /* link to next 'related' chunk */
> > +       struct list_head chunks;        /* 'master chunk list': every
> > +                                        * chunk is linked */
> > +};
> >
> > that unsigned short is a waste, the compiler will align the next field
> > to 32 bit anyway (unless you add the packed attribute to the struct) and
> > the size of struct chunk would be 20 bytes and struct chunk is
> > kmalloc'ed and the smallest slab is 32 bytes, so you waste heaps of
> > memory here, just be clear and don't fiddle with bitfields, it doesn't
> > gain you anything.
> >
> 
> So, there's no way to allocate just 20 bytes? I need to keep this
> struct as small as possible. 32 bytes will be too much for stuct
> chunk! Declaring it as packed and then kmalloc() will still use
> 32-bytes. right?

yep, you could perhaps declare your own slab (kmem_cache) for this
purpose (not quite sure on the slab size restrictions, it might be that
32 is actually the smallest possible). (if you do pack it, put the short
at the end, that way the rest will stay aligned)

Either that, or embed struct chunk within the actual data space.

> > Don't like the bit_spin_trylock/bit_spin_unlock in handle_ccache_fault;
> > what's wrong with TestSetPageLocked() and unlock_page() ?
> >
> 
> If (PageCompressed(page)) is true then 'page' is 'struct chunk_head'
> not 'struct page' and you cannot use unlock_page() on chunk_head.

ClearPageLocked()

> > PageWillCompress() doesn't look necessary, can't you just add a
> > parameter to pageout() instead?
> >
> hm...PageWillCompress() will now disappear :)

goodness, :-)

> > Why not make ccache a proper address_space and modify do_swap_page()
> > that should get rid of a lot of the intrusiveness in the find_get_page()
> > family. Of course that would mean finding another bit in the swp_entry_t
> > (perhaps you can use some of the page_migration bits and make ccache and
> > page migration mutually exclusive).
> >
> 
> This would make it hard to handle page-cache pages. So, these
> intrusive changes in 4-5 lookup functions looked better to me....

Right, pinning the file's address_space will be hard,.. /me goes think
on it.