Re: Robots.txt help

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: Robots.txt help

Subject: Re: Robots.txt help
From: Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx>
Date: Mon, 18 Sep 2006 17:29:56 +0100
Newsgroups: alt.internet.search-engines
Organization: schestowitz.com / ISBE, Manchester University / ITS / Netscape / MCC
References: <1158595502.140567.168050@h48g2000cwc.googlegroups.com>
Reply-to: newsgroups@xxxxxxxxxxxxxxx
User-agent: KNode/0.7.2

__/ [ danish ] on Monday 18 September 2006 17:05 \__

> I have a site that uses PHP Sessions IDs.. I know that total
> elimination of these from the URL is what is recommended for optimal
> bot crawling and I am working on that, but is there any way, for now to
> 
> include a line in robots.txt that would ignore the "PHPSESSID"
> parameter?
> 
> For example, the site works just fine when you visit this page:
> 
> 
> http://fixmyfamily.com/search_details.php?cid=41
> 
> 
> But by default it generates a URL like this:
> http://fixmyfamily.com/search_details.php?cid=41&PHPSESSID=0d8ff46dbd...
> 
> 
> 
> What can be done right now so that Google doesn't crawl these session
> IDs and then store them and want to come back to them? Thanks in
> advance for your help. BTW, I don't want to disallow all
> "search_details.php" URLs..

Hi, this would probably be handled well by alterring the generation of URL's
in the CMS, either by omitting these duplicates or moving them to a
(virtual) directory structure so that robots.txt can exclude them (it
can't/shouldn't do wildcards, but Google is pushing towards
breaking/'extending' the standards and conventions).

Session ID's are tricky. Are you sure bots are being assigned a cookie? I
know that spyware-type tools will be passed such URL's, but I don't think
search engines will browse (crawl) with a cookie. There were similar
questions before in this newsgroup (sessionid and duplicates), so it's
definitely worth browsing the archive. It's also worth looking at the logs,
filering by crawler type (or IP address) to see what is going on underneath
the surface. Another possibility is to view the cache, e.g. using
"site:yoursite.suffix".

Best wishes,

Roy

-- 
Roy S. Schestowitz      |    /earth: file system full
http://Schestowitz.com  |    SuSE Linux     |     PGP-Key: 0x74572E8E
  5:20pm  up 60 days  5:32,  7 users,  load average: 0.40, 0.54, 0.64
      http://iuron.com - Open Source knowledge engine project

Follow-Ups:
- Re: Robots.txt help
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index