Re: robots.txt

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: robots.txt

Subject: Re: robots.txt
From: Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx>
Date: Fri, 21 Apr 2006 03:15:34 +0100
Newsgroups: alt.internet.search-engines
Organization: schestowitz.com / MCC / Manchester University
References: <443cf71c$1@news.broadpark.no> <uaup32hf6p0ltl7p09cteht0sbj8smk79o@4ax.com> <443cfa7d$1@news.broadpark.no> <gpp%f.43185$8Q3.14413@fe1.news.blueyonder.co.uk>
Reply-to: newsgroups@xxxxxxxxxxxxxxx
User-agent: KNode/0.7.2

__/ [ Eric Johnston ] on Thursday 13 April 2006 11:07 \__

> 
> "Per-Erik Skramstad" <webmaster@xxxxxxxxxxxxxxxxxxxxxx> wrote in message
> news:443cfa7d$1@xxxxxxxxxxxxxxxxxxxx
>> Big Bill wrote:
>>> On Wed, 12 Apr 2006 14:48:30 +0200, Per-Erik Skramstad
>>> <webmaster@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>>> What am I to write in the robots.txt file to make robots ignore a whole
>>>> folder? I tried Disallow: /foldername/ but it didn't seem to help
>>>
>>> Show us your robots.txt and we'll have a look at it.
>>
>> http://www.rsil.no/robots.txt
>>
>>
>> --
>>
>> Per-Erik Skramstad
>> http://www.korrekturavdelingen.no
> 
> Just some guesses / ideas ...
> 
> Try removing the spurious space character at the end of the line Disallow:
> /WORDogPDF/
>  but retain a carriage return or line feed character to properly terminate
> the line.
> 
> There may be come confusion about your url name
> ht tp://www.rsil.n o/WORDogPDF  gets changed to ht tp://rsil.n o/WORDogPDF/
> ht tp://www.rsil.n o/WORDogPDF/  stays as ht tp://www.rsil.n o/WORDogPDF/
> Why ?
> You use relative addressing for your page to page links and once someone or
> something has gone to  ht tp://rsil.n o/ there is entire duplicate of your
> site for them to browse.  This will confuse search engines and possibly
> generate duplicate site penalty.  This has something to do with server
> configuration and DNS setup.  If you read any of your pages there is no
> indication what is the url of the site.  I think you can use something like
> <base href="ht tp://ww w.demo.com/" /> in the head to clarify.


I don't want to step on anybody's toes, but the one-stop place for quick
reference is probably:

        http://www.robotstxt.org/wc/robots.html

It explains everything that is supported and does so fairly well.
Fortunately, no search engines have yet diverged from the standard by
supporting wildcards.


> If a search engine read the content before you added the robots.txt it may
> take many months or several years for it to be removed from the search
> engine records.  Try asking the search engines to delete.  Google have a
> special file deletion process that works well, but you need to check again
> about 6 months later in case it re-appears.  If other sites have already
> put links to the unwanted file you can expect calls for the file for ever
> more....
> 
> Best regards, Eric.


Have a look here:

        http://services.google.com:8882/urlconsole/controller

This uses your robots.txt for instructions.

Best wishes,

Roy

-- 
Roy S. Schestowitz      |   Apache: commercial software?s days are numbered
http://Schestowitz.com  |    SuSE Linux    ¦     PGP-Key: 0x74572E8E
  3:10am  up 43 days 16:53,  6 users,  load average: 1.18, 1.04, 1.03
      http://iuron.com - next generation of search paradigms

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index