__/ [h7qvnk7q001@xxxxxxxxxxxxxx] on Monday 09 January 2006 00:41 \__
> When Google started, its spiders attempted to find and index _all_
> content on the Internet: only the relative ranking in search results
> was based on references. Google did an excellent job, and they quickly
> took over for AltaVista, which was previously the "best" search engine.
True statement. The intent was to crawl the entire World Wide Web. Pages that
date back to 98 confirm that. Schmidt now intends to "index the entire human
knowledge", whatever that means. I don't think pattern recognition and books
being indexed are enough to accomplish that.
> Now sites must be "popular" to get indexed at all by Google.
>
> I have several sites where "rare" information is available that are no
> longer indexed in Google. Searching for a specific phrase returns
> nothing in Google.
> The information has been on the Internet at the same
> location for over 5 years, so it is not a matter of discovery.
There is a problem due to growth. Webspaces are rather cheap, so you see
plenty of sites around which contain millions of pages. How can this ever be
handled 'fairly' without noisifying the results?
> In my opinion, this selective indexing eliminates Google as a true
> research tool. I often search for obscure information, and come up with
> nothing in Google. Now I realize that the information may be actually
> out there on the Internet, but just not indexed due to lack of
> "popularity".
Do you know how vast the Web has become? With all these auto-generated
doorway pages, you could never crawl everything, not even with Google's
hardware.
It's a cat-and-mouse game. If you have an SE that crawls *everything*,
Webmasters will be encouraged to spew out even more junk onto their
Webspace. It's not hard. Even just scraping results pages and FTP'ing them,
with templated and SEO-friendly structure can be done using brute-force and
1 hour of programming.
One could write a script to do just that. You could add about 1 million pages
per hour with decent hardware and fast Ethernet. If plagiarism is a concern,
I will have you aware that some blackhat sites scrape random small bits from
many sites, link to them (attribution), and then glue everything together.
Since it's fetched from a certain SERP, the content is almost coherent, too.
> Does anybody know of any search engine that attempts to search and
> index the _whole_ Internet? Or is that considered to be impossible?
>
> I'd like to know what the next great search engine will be, now that
> Google seems
> to have given up on maintaining that position.
There is the issue of duplicability of data. It does not cost money or labour
to produce Web pages and deliver them. It is the same issue with software,
whose value depends on the number of copies it sells. There are no
production costs. Therefore, what you suggest is impossible. A streets-wise
approach would say so. What if everyone uploaded his/her hard-drive to the
Web? The search engines would need to have storage devices that equate in
capacity to all that we have already... not to mention vast tables for
indexing purposes.
Roy
--
Roy S. Schestowitz
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
5:30am up 29 days 12:41, 14 users, load average: 0.45, 0.66, 0.68
http://iuron.com - next generation of search paradigms
|
|