#1570 NORM Untriag: datastore fail to index many pdfs

Thu May 24 13:54:49 EDT 2007

#1570: datastore fail to index many pdfs
-----------------------+----------------------------------------------------
 Reporter:  tomeu      |       Owner:  bcsaller 
     Type:  defect     |      Status:  new      
 Priority:  normal     |   Milestone:  Untriaged
Component:  datastore  |     Version:           
 Keywords:             |    Verified:  0        
-----------------------+----------------------------------------------------
 Trying to index an entry with this pdf:
 http://fredrik.hubbe.net/plugger/test.pdf gives this error:
 {{{
 Failed example:
     ds.update(uid, dict(title="Same entry with some other content in
 pdf"), 'test2.pdf')
 Exception raised:
     Traceback (most recent call last):
       File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/doctest.py",
 line 1248, in __run
         compileflags, 1) in test.globs
       File "<doctest sugar_demo_may17.txt[29]>", line 1, in ?
         ds.update(uid, dict(title="Same entry with some other content in
 pdf"), 'test2.pdf')
       File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
 packages/olpc/datastore/datastore.py", line 229, in update
         self.querymanager.update(uid, props, filename)
       File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
 packages/olpc/datastore/query.py", line 111, in update
         if file: self.fulltext_index(content, file)
       File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
 packages/olpc/datastore/query.py", line 478, in fulltext_index
         self._ft_index(content.id, fp, piece)
       File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
 packages/olpc/datastore/query.py", line 481, in _ft_index
         doc = [piece(p) for p in fp]
       File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
 packages/olpc/datastore/converter.py", line 41, in next
         def next(self): return self.filter(self._fp.next())
       File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
 packages/olpc/datastore/converter.py", line 45, in filter
         return line.encode('utf-8')
     UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
 42: ordinal not in range(128)
 }}}

 This also happens with many of the pdfs that appear if you search on
 google for "pdf test".

 pdftotext looks to extract correctly the text, though.

-- 
Ticket URL: <http://dev.laptop.org/ticket/1570>
One Laptop Per Child <http://laptop.org/>