OLPC Mesh with Wikipedia - P2P-based search service needed?
Michael Christen
mc at anomic.de
Mon Nov 20 21:56:31 EST 2006
Hello SJ,
> This sounds fascinating; we definitely want to pursue such
> approaches. Can you tell me more about the project and how it is
> currently being used?
The project aims to produce a distributed web search engine. The
target is currently the public internet, not local intranets. We want
to produce a completely independent search engine, without any
central server, and we want that all clients of the YaCy network have
equal rights to add content to the web search index.
So far we have about 100 always-on running clients (not always the
same clients at the same time), with a complete index of 300 Million
Webpages. Each YaCy-installation can easily hold several millions of
web pages and the index of all words on them. We have users who have
20 Million web pages in one single installation.
YaCy consists of the following parts:
- a web spider/crawler with many content type parser (html, pdf, word
doc, open office, etc..)
- a integrated database engine
- a search interface based on web pages and an integrated web-server
- a p2p interface to other peers. Indexes are distributed in a DHT
(distributed hash table)
- some communication tools (built-in web proxy, blog & wiki system, a
DNS extension bases on P2P-routing)
YaCy does not depend on any other software, only java 1.4 is
required. Its size is about 5 MB (without extra parsers). The parser
library needs 20MB disc space.
People use YaCy for several purposes:
- web search (not as good as google... not yet). Everybody can start
a web crawl.
- social bookmarking (integrated bookmark system with public link
rating; bookmarks can be tagged and public)
- as a web proxy (pages that pass YaCy are indexed and can be
searched afterwards)
- run their own blog and/or wiki pages. Changes to the own blog/wiki
are distributed as 'YaCy News' to other peers.
> We are currently planning the P2P system for
> - discovering new materials published at nodes (by laptop users), and
> - discovering material available 'on the web' but also accessible
> via the local mesh [when the internet at large is not available].
'new materials':
if this means 'new files' then that cannot be done with yacy at this
time, because we concentrate on web pages available at the public
internet. But changes to the crawler wouldn't be too big to index
also files on a local file system.
But if you think of blog and/or wiki content: this would be instantly
available, since YaCy has a built-in wiki system and blog system.
"discovering material available 'on the web'":
I don't know what you mean with 'discovering', but this sounds like a
web search. Thats what we do.
Regarding 'discovering': we had the idea, that interesting web pages
are such pages, that other people had seen in their web browser.
Therefore we wanted to index all pages that people see; the technical
approach to that is that everybody must use a web proxy, and pages
that pass the proxy must be indexed. This thechnique is built into YaCy.
> Our network will consist of laptops for every child and teacher;
> servers at each school, with more space, memory, and computing power;
> and fileservers at each region, with still more space -- suitable for
> backups and local caches of significant size.
>
> Where in this network do you see different parts of YaCy running?
YaCy is implemented in java, and needs about 64MB of RAM. But the RAM
can be more or less according to what you do with it. Maybe its
possible to run YaCy on every single laptop, if they restrict their
own index to not more that some thousand web pages. If YaCy runs a
web crawl, it needs strong CPU and IO power. My first guess for the
usage in your network would be:
every childs laptop:
maybe 'material' content indexing; then a search could be done even
without the server at school. At least, every child could use its own
wiki and blog, and changes to the content are distributed as 'news'
to other children.
server at school:
web indexing within a cluster of all other servers, proxy for
laptops, search interface for children. The search could be
restricted to the content of the childrens notebooks or less
restricted (over all content of other schools), or the internet.
fileserver in region:
can do what the 'server at school' can do. Could do also a back-up
management.
Could hold a bigger web index; for example: to hold a web index of 10
Million web pages I would recommend at least 500MB RAM (which is not
that much this time), better with 1GB RAM.
Most of these things could be done right now with the current version
of YaCy. Everything about indexing of web pages inside an Intranet
and of files in a local file system would need (most probably minor)
software modifications.
> You might want to continue this discussion on the olpc-devel
> mailing list:
> http://mailman.laptop.org/mailman/listinfo/devel
ok, I have subscribed to that list and send a copy of this mail to
devel at laptop.org
With kind regards
Michael Christen
http://yacy.net/yacy
see also:
http://en.wikipedia.org/wiki/YaCy
>
> Warmly,
> SJ Klein
> One Laptop per Child
> +1 617 529 4266
>
>
> On Mon, 20 Nov 2006, Jennifer Lucien wrote:
>
>> ---------- Forwarded message ----------
>> Date: Nov 20, 2006 9:47 AM
>> Subject: OLPC Mesh with Wikipedia - P2P-based search service needed?
>> To: info at laptop.org
>>
>> Dear OLPC Project,
>>
>> In your OLPC News from 2006-08-05 you state that you think about
>> using a mesh to host snapshot-Portions of Wikipedia on OLPC-clients.
>>
>> In this context I would find it useful, if you could use a P2P-based
>> search service to search this Wikipedia-content, or probably much
>> more Web pages. There is a technology that implements a distribted
>> web search, called YaCy.
>>
>> I am the project leader of the YaCy-Project, a web search engine
>> licensed under the GPL. I would like to offer my service to implement
>> a distributed search service on OLPC-Computers, based on the YaCy
>> distributed web search engine.
>> (This is a working software, with a great user community in Germany
>> and many publications about it in German Computer Magazines)
>>
>> With kind regards
>> Michael Christen
>> http://yacy.net/yacy
>>
>>
>> --
>>
>>
>> Jennifer Lucien
>> One Laptop per Child
>> jenn at laptop.org
>>
More information about the Devel
mailing list