Re: Fuzzy Matches for Content Mirrors

Home	Messages Index

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index

Re: Fuzzy Matches for Content Mirrors

Subject: Re: Fuzzy Matches for Content Mirrors
From: John Bokma <john@xxxxxxxxxxxxxxx>
Date: 26 Aug 2005 20:19:09 GMT
Newsgroups: alt.internet.search-engines
Organization: Castle Amber - software development
References: <dekrj8$47m$1@nwrdmz03.dmz.ncs.ea.ibs-infra.bt.com> <deksi9$2c6l$1@godfrey.mcc.ac.uk> <dekura$crp$1@nwrdmz03.dmz.ncs.ea.ibs-infra.bt.com> <430e1f91$0$18636$14726298@news.sunsite.dk> <dem2vi$2ppa$3@godfrey.mcc.ac.uk> <430ec59c$0$18639$14726298@news.sunsite.dk> <demhct$1438$1@godfrey.mcc.ac.uk> <430eca73$0$18648$14726298@news.sunsite.dk> <demibq$148p$2@godfrey.mcc.ac.uk> <Xns96BE22CE9993Ecastleamber@130.133.1.4> <demk58$14sp$1@godfrey.mcc.ac.uk> <Xns96BE37BFB402Fcastleamber@130.133.1.4> <den611$19ou$1@godfrey.mcc.ac.uk>
User-agent: Xnews/5.04.25
Xref: news.mcc.ac.uk alt.internet.search-engines:65397

Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> wrote:

> __/ On Friday 26 August 2005 11:29, [John Bokma] wrote : \__
> 
>>> Sorry, but I must disagree. Let us say that T is the original page
>>> and F (false) is the copy.
>>> 
>>> If F = T + A where A is some extra content, then you have problems
>> 
>> Not really, you can define similarities based on sentences, words,
>> etc. You don't have to look for exact matches. Similar is close
>> enough. 
> 
> ...and very computationally-expensive.

Today, maybe. Tomorrow? Who knows. I can imagine that there is something 
like the soundex algorithm, but then for sentences, or even paragraphs. 
E.g. a code or vector can be calculated for each paragraph. This can be 
done when a page is fetched. Fetching and comparing those vectors within 
a database is not that much harder then the same content check which is 
already happening.

> Search engines are having a
> hard time indexing billions of pages and picking up key words. Now you
> ask them to calculate similarities in a graph with billions of nodes?!

Isn't that already happening? Duplicate content? That step only needs to 
be refined. Maybe in 2 years it's as expensive as the current duplicate 
check, so it's certainly within reach, and maybe even already developed.

>> I am sure there has already been a lot of research done. For example,
>> students copy papers written by others.
> 
> Yes, I know, but people mocked it for being unreliable. Besides, you
> can easily run filters that will do some permutations and replace
> words with proper equivalents. Brute force would do the job.

Yes, and hence it will get harder and harder. 

>>> To a black hat SEO it would be no problem to automate this and
>>> deceive the search engines. it is much easier to carry out a robbery
>>> than it is for the police to spot the crook in a town of millions.
>> 
>> You don't do exact matches in cases like this, just fuzzy matches.
> 
> Using that analogy again, that's like doing a house-to-house search
> and questioning all the residents.

But you have already eliminated all houses that certainly have nothing 
to do with it. The trick is to minimize both the number of false 
positives and negatives (hence improving the certainty). Those 
techniques are used for spam, for virus detection, etc. And I am sure 
they will become more and more important to stop things like lyric 
sites, usenet archives, and free content cloning.

-- 
John                       Perl SEO tools: http://johnbokma.com/perl/
                 Experienced (web) developer: http://castleamber.com/
Get a SEO report of your site for just 100 USD:
              http://johnbokma.com/websitedesign/seo-expert-help.html

References:
- Re: Great source for content
  - From: Roy Schestowitz
- Re: Great source for content
  - From: T.J.
- Re: CIA Factbook Errors
  - From: Roy Schestowitz
- Re: CIA Factbook Mirrors
  - From: Roy Schestowitz
- Re: CIA Factbook Mirrors
  - From: Roy Schestowitz
- Re: Detecting Content Mirrors
  - From: Roy Schestowitz
- Re: Detecting Content Mirrors
  - From: John Bokma
- Re: Fuzzy Matches for Content Mirrors
  - From: Roy Schestowitz

[Date Prev]	[Date Next]	[Thread Prev]	[Thread Next]

Author Index	Date Index	Thread Index