[Wikireader] Wikipedia to TomeRaider perl scripts

Erik Zachte erikzachte at infodisiac.com
Fri May 2 21:46:33 EDT 2008

Newest perl scripts for preparation of TomeRaider input file are online.
There may be some useful snippets for other projects.


The scripts implement 95% of the Wikimedia syntax, and produce html output
from a wikidump, in a form suited for TomeRaider,
which is basically one article title on a separate line followed by the
article content (html and css) one the subsequent line.

<new>title x\n
article x content\n
<new>title y\n
article y content\n

The last major addition was code to resolve templates including most but not
all conditional syntax (no #expr).
Nowadays I would never have started this project, now that a static html
dump is (sometimes) available, but in 2003 it seemed not a big deal to
rebuild the Wikimedia parser, and the project grew over the years.


A part that might be useful and can be run alone is >
WikiToTomeToolGenerateMathImages < which generates png's from embedded
<math>..</math> using dvipng, with several options and error handling.
I started the file but a retired German math teacher, Norbert  Jaspers,
adopted the file and greatly expanded and improved it.


In > WikiToTomeImages.pl < there is code to download and resize images from
the Wikimedia servers. 
The code is rather complicated, and part of a larger job. Still there may be
something to learn from it.  

Images are downloaded and stored locally in nested folders similar to
Wikimedia servers (based on pos 1 and 2 of md5 hash of file name).
This process is incremental: Images retained from earlier runs are not
downloaded again (yes, re-uploads are missed this way).
Images are resized using two tools nconvert.exe (command line tools that
comes with Xnview) and convert.exe (ImageMagick)
This resize process is also incremental: images resized with similar options
on earlier runs are not overwritten 
For png images the compression ratio of the image is checked. If compression
is above a certain threshold the image probably contains a map or diagram
and is not resized (text would become unreadable) unless more than x Mb
Image meta data are removed (exif). 
Transparant png's are recolored (transparent to white, on Pocket PC
transparent background would be shown as black, actually a bug workaround)
Png's are compressed in two ways and best result retained

On any error a special file with the image name is written where the byte
size explains the error. 

Feel free to reuse for any purpose. Attribution would be neat.

Erik Zachte

Results of the download/resize process are logged as follows:

Image Processing Results:

Column O = Original image:
  NE = Image name invalid. No extension
  UE = Ignore image, unsupported file extension
  C  = Use custom sized image 'as is', never resize
  -  = Image downloaded succesfully on previous run
  =  = Image downloaded succesfully on this run
  D! = Image download failed
  S! = Image download failed on previous run. Skip download
  R! = Image download failed on previous run. Retry failed again.
  R  = Image download failed on previous run. Retry was succesful
  M! = Image metrics could not be retrieved. Corrupt image

Column R = Resize image:
  N  = Never resize
  C  = No resize needed. Compression rate < 0.1
  S  = No resize needed. Small image
  F  = No resize needed (specified in custom filter)
  T  = No resize wanted (probably EasyTimeline time chart)
  L  = Resize proved ineffective: result larger than original
  -  = Image resized succesfully on previous run
  =  = Image resized succesfully on this run
  !! = Image resize failed on this run
  O! = Resize failed (no output found)
  I! = Resize failed (no image info could be retrieved)
  U! = Resize failed (unknown reason)
  !  = Image resize failed on previous run

Column U = Use image:
  O  = Use original image
  R  = Use resized image
  N  = Use none. Discard image

Special sizes for 'images files' on disk:
Original images folder:
  Size 1 = Download failed
  Size 2 = Image corrupt or image info could not be retrieved
  Size 3 = Resize operation produced no output. Makes original image
suspicious too. Discard.
Resized images folder:
  Size 0 = No valid original image to start with
  Size 1 = No resize needed. Compression rate < 0.1 (might well be map or
  Size 2 = No resize needed. Very small image
  Size 3 = No resize needed (specified in custom filter)
  Size 4 = Resize proved ineffective: result larger than original
  Size 5 = Resize failed alltogether, no output found
  Size 6 = Resize image corrupt, no image info could be retrieved

  Size 7 = No resize wanted. Probably EasyTimeline time chart

FSO = File size original image
FSR = File size resized image
CRO = Compress ratio original image
WHO = Width/Height original image
WHR = Width/Height resized image
DB  = Depth in bits (32 = transparency layer present) (new, does not
function flawlessly)

Resize images to max 240 x 180 pixels (landscape) or 180 x 240 (portrait)


        O  R  U     FSO     FSR   CRO       WHO       WHR DB

      1 -  -  R   83939    8079 0.162   480x360   240x180 24
      2 -  -  R   18155    6032 0.187   180x180   180x180 24
      3 -  -  R   94566    2688 0.083   960x398    240x99 24
      4 S! N  N                                            0
      5 -  -  R  452677    5044 0.192  1024x768   240x180 24
      6 -  -  R  106150    4671 0.382   299x310   180x186 24

More information about the Wikireader mailing list