[OLPC-devel] SoC update for Sunday, June 18th, 2006

mharriso at student.umass.edu mharriso at student.umass.edu
Sun Jun 18 20:21:11 EDT 2006



Hello folks.  I've got my current code for the soc-eds posted in the git
repository at http://crank.laptop.org/git.do .  At this point, it's mostly a
mess and not terribly useful unless you're me, but I suspect it will get better
as I iron things out.

Here's the rundown on what's happening with my project:
    *  After much time spent drinking coffees and lattes, I've made some good
progress with parsing the Project Gutenberg catalog.

    *  The file of interest is soc-eds/modules/gutenberg/gutenberg.py.  This
downloads the PG catalog, trims off excess data from the beginning and end of
it, and begins to parse the etexts that are numbered 10,001 and above.  The
reason for this is because PG adopted a new format for listings starting with #
10,001, so there are (at least) two different formats to deal with.

    *  It parses each line and delimits it by language, title, author, and etext
#.  I'm still looking to find the best way to then search the catalog.

    *  I've also got a few methods outlined for searching and downloading etexts
which will be implemented later.

I hope to spend the next week parsing the etexts that are < 10,001, and start
implementing the search function.  After this, downloading the texts from PG's
servers is fairly trivial.

-Matthew Harrison





More information about the Devel mailing list