[Wikireader] Wikipedia to TomeRaider perl scripts

Fri May 2 21:46:33 EDT 2008

Newest perl scripts for preparation of TomeRaider input file are online.
There may be some useful snippets for other projects.

http://infodisiac.com/Wikipedia/TomeRaider/WikiToTome_4_2_a.zip

The scripts implement 95% of the Wikimedia syntax, and produce html output
from a wikidump, in a form suited for TomeRaider,
which is basically one article title on a separate line followed by the
article content (html and css) one the subsequent line.

<new>title x\n
article x content\n
<new>title y\n
article y content\n

The last major addition was code to resolve templates including most but not
all conditional syntax (no #expr).
Nowadays I would never have started this project, now that a static html
dump is (sometimes) available, but in 2003 it seemed not a big deal to
rebuild the Wikimedia parser, and the project grew over the years.

----------------------------------------------------------------------------
------------------------------

A part that might be useful and can be run alone is >
WikiToTomeToolGenerateMathImages < which generates png's from embedded
<math>..</math> using dvipng, with several options and error handling.
I started the file but a retired German math teacher, Norbert  Jaspers,
adopted the file and greatly expanded and improved it.

----------------------------------------------------------------------------
------------------------------

In > WikiToTomeImages.pl < there is code to download and resize images from
the Wikimedia servers. 
The code is rather complicated, and part of a larger job. Still there may be
something to learn from it.  

Images are downloaded and stored locally in nested folders similar to
Wikimedia servers (based on pos 1 and 2 of md5 hash of file name).
This process is incremental: Images retained from earlier runs are not
downloaded again (yes, re-uploads are missed this way).
Images are resized using two tools nconvert.exe (command line tools that
comes with Xnview) and convert.exe (ImageMagick)
This resize process is also incremental: images resized with similar options
on earlier runs are not overwritten 
For png images the compression ratio of the image is checked. If compression
is above a certain threshold the image probably contains a map or diagram
and is not resized (text would become unreadable) unless more than x Mb
large.
Image meta data are removed (exif). 
Transparant png's are recolored (transparent to white, on Pocket PC
transparent background would be shown as black, actually a bug workaround)
Png's are compressed in two ways and best result retained

On any error a special file with the image name is written where the byte
size explains the error. 

Feel free to reuse for any purpose. Attribution would be neat.

Erik Zachte

----------------------------------------------------------------------------
------------------------------
Results of the download/resize process are logged as follows:

Image Processing Results:

Column O = Original image:
  NE = Image name invalid. No extension
  UE = Ignore image, unsupported file extension
  C  = Use custom sized image 'as is', never resize
  -  = Image downloaded succesfully on previous run
  =  = Image downloaded succesfully on this run
  D! = Image download failed
  S! = Image download failed on previous run. Skip download
  R! = Image download failed on previous run. Retry failed again.
  R  = Image download failed on previous run. Retry was succesful
  M! = Image metrics could not be retrieved. Corrupt image

Column R = Resize image:
  N  = Never resize
  C  = No resize needed. Compression rate < 0.1
  S  = No resize needed. Small image
  F  = No resize needed (specified in custom filter)
  T  = No resize wanted (probably EasyTimeline time chart)
  L  = Resize proved ineffective: result larger than original
  -  = Image resized succesfully on previous run
  =  = Image resized succesfully on this run
  !! = Image resize failed on this run
  O! = Resize failed (no output found)
  I! = Resize failed (no image info could be retrieved)
  U! = Resize failed (unknown reason)
  !  = Image resize failed on previous run

Column U = Use image:
  O  = Use original image
  R  = Use resized image
  N  = Use none. Discard image

Special sizes for 'images files' on disk:
Original images folder:
  Size 1 = Download failed
  Size 2 = Image corrupt or image info could not be retrieved
  Size 3 = Resize operation produced no output. Makes original image
suspicious too. Discard.
Resized images folder:
  Size 0 = No valid original image to start with
  Size 1 = No resize needed. Compression rate < 0.1 (might well be map or
diagram)
  Size 2 = No resize needed. Very small image
  Size 3 = No resize needed (specified in custom filter)
  Size 4 = Resize proved ineffective: result larger than original
  Size 5 = Resize failed alltogether, no output found
  Size 6 = Resize image corrupt, no image info could be retrieved

  Size 7 = No resize wanted. Probably EasyTimeline time chart

FSO = File size original image
FSR = File size resized image
CRO = Compress ratio original image
WHO = Width/Height original image
WHR = Width/Height resized image
DB  = Depth in bits (32 = transparency layer present) (new, does not
function flawlessly)

Resize images to max 240 x 180 pixels (landscape) or 180 x 240 (portrait)

=========================================================================

        O  R  U     FSO     FSR   CRO       WHO       WHR DB

      1 -  -  R   83939    8079 0.162   480x360   240x180 24
6/62/Zwijntje.JPG
      2 -  -  R   18155    6032 0.187   180x180   180x180 24
2/20/Zulia-Palafitos.jpg
      3 -  -  R   94566    2688 0.083   960x398    240x99 24
2/21/Zubia_jun.jpg
      4 S! N  N                                            0
7/74/Zoyus_Palpa.jpg
      5 -  -  R  452677    5044 0.192  1024x768   240x180 24
b/b7/Zona_Purace.jpg
      6 -  -  R  106150    4671 0.382   299x310   180x186 24
8/8c/Zionpictographs.jpg