#1570 NORM Untriag: datastore fail to index many pdfs
Zarro Boogs per Child
bugtracker at laptop.org
Thu May 24 13:54:49 EDT 2007
#1570: datastore fail to index many pdfs
-----------------------+----------------------------------------------------
Reporter: tomeu | Owner: bcsaller
Type: defect | Status: new
Priority: normal | Milestone: Untriaged
Component: datastore | Version:
Keywords: | Verified: 0
-----------------------+----------------------------------------------------
Trying to index an entry with this pdf:
http://fredrik.hubbe.net/plugger/test.pdf gives this error:
{{{
Failed example:
ds.update(uid, dict(title="Same entry with some other content in
pdf"), 'test2.pdf')
Exception raised:
Traceback (most recent call last):
File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/doctest.py",
line 1248, in __run
compileflags, 1) in test.globs
File "<doctest sugar_demo_may17.txt[29]>", line 1, in ?
ds.update(uid, dict(title="Same entry with some other content in
pdf"), 'test2.pdf')
File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
packages/olpc/datastore/datastore.py", line 229, in update
self.querymanager.update(uid, props, filename)
File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
packages/olpc/datastore/query.py", line 111, in update
if file: self.fulltext_index(content, file)
File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
packages/olpc/datastore/query.py", line 478, in fulltext_index
self._ft_index(content.id, fp, piece)
File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
packages/olpc/datastore/query.py", line 481, in _ft_index
doc = [piece(p) for p in fp]
File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
packages/olpc/datastore/converter.py", line 41, in next
def next(self): return self.filter(self._fp.next())
File "/home/tomeu/sugar-jhbuild/build/lib/python2.4/site-
packages/olpc/datastore/converter.py", line 45, in filter
return line.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
42: ordinal not in range(128)
}}}
This also happens with many of the pdfs that appear if you search on
google for "pdf test".
pdftotext looks to extract correctly the text, though.
--
Ticket URL: <http://dev.laptop.org/ticket/1570>
One Laptop Per Child <http://laptop.org/>
More information about the Bugs
mailing list