[OLPC-devel] Re: Software action items and status

Fri Jun 9 09:51:49 EDT 2006

On Thu, 2006-06-08 at 23:28 -0400, Jim Gettys wrote:
> 1) Flash interface alternatives for slow flash reading. (dwmw2/jg)
>    Dave and Jim to get together a list of possible alternatives here,
>    so Mark can cost out and see what might be done to improve this. 

Right... here are the basic alternatives:

1). Leave it as it is, using the CS5536 NAND controller. We get access
    to the flash about an order of magnitude slower than it should be;
    2.7MiB/s instead of 26MiB/s.

2). Leave the hardware as it is but switch to 66MHz PCI. We ought to get
    about 3.5MiB/s from it -- still fairly crap. And there are power
    consumption implications of switching to 66MHz -- do we know
    precisely what different that makes?

3). The IDE MDMA hack that Tom has been looking at. The documentation on
    the IDE timing MSR (ATAC_CH0D0_DMA) seems to be explicit that it's 
    66MHz cycles, so by setting tKR and tDR both to zero (i.e. one cycle
    for each of 'active' and 'recovery' time) we ought to be able to 
    do 60ns per cycle.

    Unfortunately, we get 16 bits of data for each IDE cycle and the
    chip provides only 8 bits. We have to post-process the buffer,
    picking out the alternate bytes that we actually want, and
    discarding the line noise on the upper 8 bits of each transfer.

    Tom -- does that look like an accurate summary? And am I right in
    interpreting the docs as saying it's _always_ measured in 66MHz
    cycles, even when PCI is at 33MHz? 

    We'd get just under 16MiB/s raw _buffer_ read speed from that, under
    ideal conditions. That's not _quite_ a number we can compare 
    directly with the above 2.7MiBs and 3.5MiB/s figures, since it
    doesn't include command time or anything like that, and neither does
    it include picking the alternate bytes from the buffer when it's
    arrived (which will also mean bringing it into dcache from RAM).
    For comparison, raw _buffer_ read speed from the flash chip would
    be about 38MiB/s. So we're still a way off the ideal.

    It also doesn't account for that fact that we then have to do ECC
    in software, and there are other 'interesting' details about the 
    abuse of the IDE interface which may slow us down. Tom should have
    more details on precisely how this goes together for us, some time 
    soon.

4). Our own CPLD/FPGA
5). Our own ASIC

    I'll bundle these together since I'm not in a position to make a
    distinction between the two. That's about the up-front costs vs. 
    the per-unit costs, and the scheduling (and testing) constraints.
    The technical issues from my PoV are very similar.

    Basically, the idea is that we attach our own NAND flash
    controller to replace the one in the CS5536. It's relatively
    simple -- just a FIFO and some Reed-Solomon ECC calculation.
    Thomas has a working implementation of this in a CPLD which 
    is freely licensed, and would just need adapting to interface
    to our board. Getting it all working in a CPLD and then transferring
    that to an ASIC should be relatively low-risk.

    We ought to be able to get _very_ close to the full 26MiB/s read
    speed of the chip (and also to its write speed) by doing this, and
    I think it's the option that I'd prefer; cost permitting.

    One question which I hope the AMD guys can help us with is _how_
    we interface to the CPU. Do we do it as a PCI device? A GeodeLink
    device? Something else?

    One possibility is that we could follow on from Tom's idea of
    abusing IDE. Except that with our own CPLD/FPGA/ASIC the whole thing
    is far less Heath Robinson; we can have proper 16-bit transfers, we
    can still have hardware ECC, etc. One advantage of this is that the
    OLPC board is already laid out to allow a chip to sit between the
    IDE interface and the NAND chip. We can even do UDMA -- Thomas says
    that "it should be no big deal to hack the VHDL glue for that".

6). Abandon direct access to flash, and use the PS3002 which Quanta put
    down pads for.

    There are serious problems with the approach of letting something
    like the PS3002 'fake' a normal 512-byte-sector block device using
    NAND flash. By layering a 'normal' file system on top of the
    internal pseudo-filesystem implemented by the PS3002, we end up with
    a fairly inefficient mess. We don't even get to tell the PS3002 when
    certain 'sectors' are no longer used by the file system, so it'll
    continue to garbage collect those sectors; copying them around on
    the NAND flash even though they're no longer used. We also end up
    with two 'layers' of journalling, and the journal of something like 
    the ext3 file system would be horridly inefficient atop the PS3002's
    "block device", as we repeatedly write sectors to the 'disk' twice
    in quick succession rather than just dealing with the underlying
    flash directly.

    I think this is an ultimate last resort, but we'd probably be better
    off with option #1 or #2 than this.

Overall, I think #5 is the better answer -- test it out with a CPLD and
then commit it to silicon. Comments? 

-- 
dwmw2