Re: Alexa Internet archive crawler gone wild?

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: Alexa Internet archive crawler gone wild?

Subject: Re: Alexa Internet archive crawler gone wild?
From: Carol W <from_you@xxxxxxxxxx>
Date: Tue, 03 Jan 2006 02:31:15 -0500
Newsgroups: alt.internet.search-engines
Organization: AllTheNewsgroups.com
References: <41u6opF1g7kd9U1@news.dfncis.de> <UCluf.153350$dP1.511345@newsc.telia.net> <dpd09e$1mof$2@godfrey.mcc.ac.uk> <pu6kr1dma2ues8pt3c3391uij88r2c7ll1@4ax.com> <dpd7vk$2vs$1@godfrey.mcc.ac.uk>
Xref: news.mcc.ac.uk alt.internet.search-engines:73901

On Tue, 03 Jan 2006 07:09:01 +0000, Roy Schestowitz
<newsgroups@xxxxxxxxxxxxxxx> wrote:

>__/ [Carol W] on Tuesday 03 January 2006 06:41 \__
>
>> On Tue, 03 Jan 2006 04:57:44 +0000, Roy Schestowitz
>> <newsgroups@xxxxxxxxxxxxxxx> wrote:
>> 
>> [snip]
>> 
>>>What's  the  benefit of permitting Alexa to crawl though? Having the  site
>>>archived  for someone to look back at deleted content in the future?  Have
>>>we not learned the lesson yet?
>> 
>> Actually it can be helpful to have an archived copy - even if that
>> particular content or site becomes deleted at a later date. I have
>> used the web archive to help locate some information or data that had
>> been deleted or removed from the web.
>> 
>> Carol
>
>...But  if the size of a site does not exceed gigabytes (particularly when
>compressed),  why  not  make use of private storage, which is  often  very
>cheap.  You can stack up a progressive backup for just a few quid. If  re-
>silience is important, you can duplicate the content periodically. The Web
>Archive is slower to access and it tends to mix objects that were collect-
>ed at different timepoints.

Well that is an option - and would work if people snagged copies of
things while the site still existed. However some sites close up with
next to no warning or people, when first visitng that site, may not
need a copy of that particular information at that time - but at a
later date wanting to find a copy of it due to something they are
researching. 

You know what they say about hindsight being 20/20 though ... 

>Another  issue  is people having access to content which was  *accidently*
>made  public, or even find the roots of a site whose 'image' has  evolved.

I think web  archive folks may take some situations into consideration
on if they will remove some contents or not at the request of the site
owner. I mean even Google Groups/Deja vu offer/offered people a way to
remove some of their own posts that they no longer cared to have in
that archive. 

>Having said that, the Web Archive can be useful to the user. 

*nod* Archival sites can be helpful indeed. What helps them in being
useful is that they will archive what is accessible (within the scope
of what they want to archive) and willingly store it for others to
make use of at al later date. 

[snip] 
>Lastly,  as  the OP points out, Alexa can have a noticeable cost,  whether
>that  cost is latency when serving visitors and search engines  (crawlers)
>or  even  the  traffic  (hosting) bill. Rarely is there  something  to  be
>gained.

That can be said of any spider or bot that "goes nuts" on a site. I
seem to recall some folks complaining about spiders from (name of any
one of the Top 3 search engines inserted here] to the point they
snippets to their robots.txt files to turn away or try to curb those
spiders/bots for a while. However the OP also said that this behavior
has only been displayed for the past couple of days - so also unusual
in his observation. If that is the case then he could temporarily deny
alexa's spider from his site. 

Carol

References:
- Re: Alexa Internet archive crawler gone wild?
  - From: Roy Schestowitz
- Re: Alexa Internet archive crawler gone wild?
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index