[Wikireader] Wikipedia to TomeRaider perl scripts
Erik Zachte
erikzachte at infodisiac.com
Fri May 2 21:46:33 EDT 2008
Newest perl scripts for preparation of TomeRaider input file are online.
There may be some useful snippets for other projects.
http://infodisiac.com/Wikipedia/TomeRaider/WikiToTome_4_2_a.zip
The scripts implement 95% of the Wikimedia syntax, and produce html output
from a wikidump, in a form suited for TomeRaider,
which is basically one article title on a separate line followed by the
article content (html and css) one the subsequent line.
<new>title x\n
article x content\n
<new>title y\n
article y content\n
The last major addition was code to resolve templates including most but not
all conditional syntax (no #expr).
Nowadays I would never have started this project, now that a static html
dump is (sometimes) available, but in 2003 it seemed not a big deal to
rebuild the Wikimedia parser, and the project grew over the years.
----------------------------------------------------------------------------
------------------------------
A part that might be useful and can be run alone is >
WikiToTomeToolGenerateMathImages < which generates png's from embedded
<math>..</math> using dvipng, with several options and error handling.
I started the file but a retired German math teacher, Norbert Jaspers,
adopted the file and greatly expanded and improved it.
----------------------------------------------------------------------------
------------------------------
In > WikiToTomeImages.pl < there is code to download and resize images from
the Wikimedia servers.
The code is rather complicated, and part of a larger job. Still there may be
something to learn from it.
Images are downloaded and stored locally in nested folders similar to
Wikimedia servers (based on pos 1 and 2 of md5 hash of file name).
This process is incremental: Images retained from earlier runs are not
downloaded again (yes, re-uploads are missed this way).
Images are resized using two tools nconvert.exe (command line tools that
comes with Xnview) and convert.exe (ImageMagick)
This resize process is also incremental: images resized with similar options
on earlier runs are not overwritten
For png images the compression ratio of the image is checked. If compression
is above a certain threshold the image probably contains a map or diagram
and is not resized (text would become unreadable) unless more than x Mb
large.
Image meta data are removed (exif).
Transparant png's are recolored (transparent to white, on Pocket PC
transparent background would be shown as black, actually a bug workaround)
Png's are compressed in two ways and best result retained
On any error a special file with the image name is written where the byte
size explains the error.
Feel free to reuse for any purpose. Attribution would be neat.
Erik Zachte
----------------------------------------------------------------------------
------------------------------
Results of the download/resize process are logged as follows:
Image Processing Results:
Column O = Original image:
NE = Image name invalid. No extension
UE = Ignore image, unsupported file extension
C = Use custom sized image 'as is', never resize
- = Image downloaded succesfully on previous run
= = Image downloaded succesfully on this run
D! = Image download failed
S! = Image download failed on previous run. Skip download
R! = Image download failed on previous run. Retry failed again.
R = Image download failed on previous run. Retry was succesful
M! = Image metrics could not be retrieved. Corrupt image
Column R = Resize image:
N = Never resize
C = No resize needed. Compression rate < 0.1
S = No resize needed. Small image
F = No resize needed (specified in custom filter)
T = No resize wanted (probably EasyTimeline time chart)
L = Resize proved ineffective: result larger than original
- = Image resized succesfully on previous run
= = Image resized succesfully on this run
!! = Image resize failed on this run
O! = Resize failed (no output found)
I! = Resize failed (no image info could be retrieved)
U! = Resize failed (unknown reason)
! = Image resize failed on previous run
Column U = Use image:
O = Use original image
R = Use resized image
N = Use none. Discard image
Special sizes for 'images files' on disk:
Original images folder:
Size 1 = Download failed
Size 2 = Image corrupt or image info could not be retrieved
Size 3 = Resize operation produced no output. Makes original image
suspicious too. Discard.
Resized images folder:
Size 0 = No valid original image to start with
Size 1 = No resize needed. Compression rate < 0.1 (might well be map or
diagram)
Size 2 = No resize needed. Very small image
Size 3 = No resize needed (specified in custom filter)
Size 4 = Resize proved ineffective: result larger than original
Size 5 = Resize failed alltogether, no output found
Size 6 = Resize image corrupt, no image info could be retrieved
Size 7 = No resize wanted. Probably EasyTimeline time chart
FSO = File size original image
FSR = File size resized image
CRO = Compress ratio original image
WHO = Width/Height original image
WHR = Width/Height resized image
DB = Depth in bits (32 = transparency layer present) (new, does not
function flawlessly)
Resize images to max 240 x 180 pixels (landscape) or 180 x 240 (portrait)
=========================================================================
O R U FSO FSR CRO WHO WHR DB
1 - - R 83939 8079 0.162 480x360 240x180 24
6/62/Zwijntje.JPG
2 - - R 18155 6032 0.187 180x180 180x180 24
2/20/Zulia-Palafitos.jpg
3 - - R 94566 2688 0.083 960x398 240x99 24
2/21/Zubia_jun.jpg
4 S! N N 0
7/74/Zoyus_Palpa.jpg
5 - - R 452677 5044 0.192 1024x768 240x180 24
b/b7/Zona_Purace.jpg
6 - - R 106150 4671 0.382 299x310 180x186 24
8/8c/Zionpictographs.jpg
More information about the Wikireader
mailing list