[OLPC-devel] Weekly work summary [8/12]

Eric Astor eastor1 at swarthmore.edu
Sun Jul 23 22:58:26 EDT 2006


OEPC: Tools for Wikipedia article selection and export

Git Tree: (web) http://dev.laptop.org/git.do?p=projects/soc-oepc
          (git)  git://dev.laptop.org/projects/soc-oepc

Status update:

* Markenstein: I've made some more progress on my parser, and discovered a
fundamental issue - parsing of /italics/ is shaky, as there's no good way to
keep it from interfering with Linux path specifications. Suggestions for
alternate italics-delimiters are welcome. (Note: we currently use *bold*,
/italics/, and _underline_) I've also done further benchmarking, and begun
profiling - it can parse between 150-200K characters per second on my
laptop, so even if performance scales linearly with clock speed, I'd expect
at least 25K characters per second on the laptop. Considering that the
traditional print page holds a maximum of 5.6K characters, this will
hopefully be fast enough with proper buffering.
     - BisonGen: Meanwhile, I found that certain important features of
                 Markenstein are impossible to implement with BisonGen,
                 so this path is temporarily abandoned unless we find
                 the current parser to be unacceptably slow.
* WikiFilter: I attempted to extract the parser from WikiFilter, to help in
converting wikitext to Markenstein, and got reasonably far with porting and
re-architecting the system - however, an alternative has emerged, which
should give us access to a well-proven, full-featured wikitext parser. I
noted that Magnus Manske's wiki2xml parser (written in PHP) uses a structure
*very* similar to what the Martel parsing framework utilizes - and Martel
could easily yield a faster parser with a smaller footprint.

Next week's plans:

* Re-implement the wiki2xml parser in Python, using Martel.
* Continue development of Markenstein and its parsers.

- Eric Astor
 

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.394 / Virus Database: 268.10.3/395 - Release Date: 7/21/2006
 




More information about the Devel mailing list