OLPC Mesh with Wikipedia - P2P-based search service needed?

Michael Christen mc at anomic.de
Mon Nov 20 21:56:31 EST 2006


Hello SJ,

> This sounds fascinating; we definitely want to pursue such  
> approaches. Can you tell me more about the project and how it is  
> currently being used?

The project aims to produce a distributed web search engine. The  
target is currently the public internet, not local intranets. We want  
to produce a completely independent search engine, without any  
central server, and we want that all clients of the YaCy network have  
equal rights to add content to the web search index.

So far we have about 100 always-on running clients (not always the  
same clients at the same time), with a complete index of 300 Million  
Webpages. Each YaCy-installation can easily hold several millions of  
web pages and the index of all words on them. We have users who have  
20 Million web pages in one single installation.

YaCy consists of the following parts:
- a web spider/crawler with many content type parser (html, pdf, word  
doc, open office, etc..)
- a integrated database engine
- a search interface based on web pages and an integrated web-server
- a p2p interface to other peers. Indexes are distributed in a DHT  
(distributed hash table)
- some communication tools (built-in web proxy, blog & wiki system, a  
DNS extension bases on P2P-routing)

YaCy does not depend on any other software, only java 1.4 is  
required. Its size is about 5 MB (without extra parsers). The parser  
library needs 20MB disc space.

People use YaCy for several purposes:
- web search (not as good as google... not yet). Everybody can start  
a web crawl.
- social bookmarking (integrated bookmark system with public link  
rating; bookmarks can be tagged and public)
- as a web proxy (pages that pass YaCy are indexed and can be  
searched afterwards)
- run their own blog and/or wiki pages. Changes to the own blog/wiki  
are distributed as 'YaCy News' to other peers.


> We are currently planning the P2P system for
>  - discovering new materials published at nodes (by laptop users), and
>  - discovering material available 'on the web' but also accessible  
> via the local mesh [when the internet at large is not available].

'new materials':
if this means 'new files' then that cannot be done with yacy at this  
time, because we concentrate on web pages available at the public  
internet. But changes to the crawler wouldn't be too big to index  
also files on a local file system.
But if you think of blog and/or wiki content: this would be instantly  
available, since YaCy has a built-in wiki system and blog system.

"discovering material available 'on the web'":
I don't know what you mean with 'discovering', but this sounds like a  
web search. Thats what we do.
Regarding 'discovering': we had the idea, that interesting web pages  
are such pages, that other people had seen in their web browser.  
Therefore we wanted to index all pages that people see; the technical  
approach to that is that everybody must use a web proxy, and pages  
that pass the proxy must be indexed. This thechnique is built into YaCy.


> Our network will consist of laptops for every child and teacher;
> servers at each school, with more space, memory, and computing power;
> and fileservers at each region, with still more space -- suitable for
> backups and local caches of significant size.
>
> Where in this network do you see different parts of YaCy running?

YaCy is implemented in java, and needs about 64MB of RAM. But the RAM  
can be more or less according to what you do with it. Maybe its  
possible to run YaCy on every single laptop, if they restrict their  
own index to not more that some thousand web pages. If YaCy runs a  
web crawl, it needs strong CPU and IO power. My first guess for the  
usage in your network would be:

every childs laptop:
maybe 'material' content indexing; then a search could be done even  
without the server at school. At least, every child could use its own  
wiki and blog, and changes to the content are distributed as 'news'  
to other children.

server at school:
web indexing within a cluster of all other servers, proxy for  
laptops, search interface for children. The search could be  
restricted to the content of the childrens notebooks or less  
restricted (over all content of other schools), or the internet.

fileserver in region:
can do what the 'server at school' can do. Could do also a back-up  
management.
Could hold a bigger web index; for example: to hold a web index of 10  
Million web pages I would recommend at least 500MB RAM (which is not  
that much this time), better with 1GB RAM.


Most of these things could be done right now with the current version  
of YaCy. Everything about indexing of web pages inside an Intranet  
and of files in a local file system would need (most probably minor)  
software modifications.

> You might want to continue this discussion on the olpc-devel  
> mailing list:
> http://mailman.laptop.org/mailman/listinfo/devel

ok, I have subscribed to that list and send a copy of this mail to  
devel at laptop.org

With kind regards
Michael Christen

http://yacy.net/yacy
see also:
http://en.wikipedia.org/wiki/YaCy

>
> Warmly,
>   SJ Klein
>   One Laptop per Child
>   +1 617 529 4266
>
>
> On Mon, 20 Nov 2006, Jennifer Lucien wrote:
>
>> ---------- Forwarded message ----------
>> Date: Nov 20, 2006 9:47 AM
>> Subject: OLPC Mesh with Wikipedia - P2P-based search service needed?
>> To: info at laptop.org
>>
>> Dear OLPC Project,
>>
>> In your OLPC News from 2006-08-05 you state that you think about
>> using a mesh to host snapshot-Portions of Wikipedia on OLPC-clients.
>>
>> In this context I would find it useful, if you could use a P2P-based
>> search service to search this Wikipedia-content, or probably much
>> more Web pages. There is a technology that implements a distribted
>> web search, called YaCy.
>>
>> I am the project leader of the YaCy-Project, a web search engine
>> licensed under the GPL. I would like to offer my service to implement
>> a distributed search service on OLPC-Computers, based on the YaCy
>> distributed web search engine.
>> (This is a working software, with a great user community in Germany
>> and many publications about it in German Computer Magazines)
>>
>> With kind regards
>> Michael Christen
>> http://yacy.net/yacy
>>
>>
>> -- 
>>
>>
>> Jennifer Lucien
>> One Laptop per Child
>> jenn at laptop.org
>>




More information about the Devel mailing list