Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: Building a search engine

  • Subject: Re: Building a search engine
  • From: Roy Schestowitz <newsgroups@schestowitz.com>
  • Date: Sat, 16 Jul 2005 04:47:51 +0100
  • Newsgroups: alt.internet.search-engines
  • Organization: schestowitz.com / Manchester University
  • References: <42D87390.7801D43C@uoft.com>
  • Reply-to: newsgroups@schestowitz.com
  • User-agent: KNode/0.7.2
MiniME wrote:

> Hi all
> I am looking for a free software that would help me to build a search
> engine.
> It is not something that has to be state of the art but I am looking for
> scalability.
> My target is to index around 500 sites. I don't know yet the equivalent
> size
> of these in HTML pages (I mean the number of indexed pages).
> I was looking at Nutch when I came across to this page:
> http://www.searchtools.com/index.html
> There is a "Tool list" that impressed me and left me with no words. I am
> new to this domain
> and I have no idea what would be the best fit for my purposes.
> I am planning to run it on Linux on a regular PC. It must be free, easy
> to setup and maintain.
> Thank you in advance
> MiniMe

First of all, allow me to congratulate you on this initiative. Never forget
that Google started by creating an engine to index and search the Stanford
University Web site.

You made an excellent choice by going with Linux. The best tools will be
available to you and the price will be nill. As far as I can tell, even
Brin was using Python and other tools that were *NIX-related. You will
first need to look into 'wget' or equivalent. For example:

wget -r -l1 -H -t1 -nd -N -np -A.html -erobots=on http://site.org

$ man wget

for more details. Make sure you honour the robots.txt file or else you can
hammer the server and get reported for abuse.

You then need to scan the HTML files (or whatever type suits your purpose)
for tokens (e.g. words). Use tools like lex or YACC to do that. You
probably want to create some token tables and perform some analysis upon
them. If you wish to index just 500 sites, consider starting with just 5
for testing purposes. Also make sure you have enough storage space, plenty
of RAM, high computational power, but abive all, _bandwidth_. You cannot do
much unless you have a Gigabit backbone or fast Ethernet.

In the future, please use a subject line that is meaningful. "check this
out", "help please" or "how do I do this" make it impossible for readers to
discern posts where they can help from those which are utter gibberish or
jargon to them.

Hope it helps,


Roy S. Schestowitz

[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index