__/ [Andrew] on Friday 04 November 2005 16:49 \__
> In comp.infosystems.www.authoring.html on Tuesday 01 November 2005 05:56,
> Roy Schestowitz wrote:
>
>> __/ [Leif K-Brooks] on Tuesday 01 November 2005 01:16 \__
>>
>>> Roy Schestowitz wrote:
>>>> This will not avoid spyware like Alexa/A9/Amazon toolbars (among
>>>> more) from crawling your password-protected pages, but it will at
>>>> least turn away human users who ought to remain outside. To
>>>> understand how this works (essentially JavaScript), look at the
>>>> source, change it and save it.
>>>
>>> Anyone with a clue can turn off JavaScript support in their browser.
>>> Security should not depend on clueless attackers.
>>
>> No JavaScript, no entry. *smile*
>>
>> ...still better than ActiveX
>>
>> ActiveX enabled, anybody in (including hijackers)
>>
>> Roy
>
> There is a flaw with this method, which is that if someone is visiting your
> "private" page and then goes to a completely different website, the private
> URL will be passed as the referrer to that website. Many websites' referrer
> logs are publicly available (this may or may not be with the
> intention/knowledge of the webmaster) and therefore, potentially, the links
> could be accessed by search engines, so your private content could appear
> in a search engine's results.
I never thought about this route. Thanks for pointing that out.
> A partial solution, which I recommend you use, is to put the following in
> the head section of each private page.
>
> <meta name="ROBOTS" content="NOINDEX,NOFOLLOW,NOARCHIVE">
If the page contains sensitive content, I suppose 'shielding' it would indeed
be worthwhile. I would only like to stress that the information which I
'hide' is not confidential, yet it should never be easily-accessible.
Private material like Palm data has always been password-protected.
> This only works with some search engines (but the major ones should all act
> on it).
>
> The preferred method of controlling search engine spiders is to use a
> robots.txt file but this will have two drawbacks:
>
> 1. You might not have access to the root directory of the domain or
> subdomain, which is where the robots.txt needs to go.
> 2. In any event, some people look at a site's robots.txt to "discover"
> directories the site owner would rather weren't known about, hence it is
> definitely *not* recommended for your situation.
Yes, I once thought about it. Pages and sections where I deny crawlers access
at robots.txt-level are either:
- Sections that contains names, which I would rather people did not 'Google'
(or 'Yahooed' etc.)
- Sections that are too extensive to be crawled as they will add 'noise' to
indices of the search engines.
Roy
--
Roy S. Schestowitz | Useless fact: A dragonfly only lives for one day
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
5:15am up 2 days 1:13, 4 users, load average: 0.25, 0.46, 0.42
http://iuron.com - next generation of search paradigms
|
|