Introduction About Site Map

XML
RSS 2 Feed RSS 2 Feed
Navigation

Main Page | Blog Index

Archive for the ‘SEO’ Category

C-Block Penalties

Google Cookie

WE have begun to hear more and more about link filtering, selective links, rel='nofollow' and link penalties in recent days. It appears as though sites might be penalised for having certain links associated with them. I am skeptic nonetheless and hereby I present a case study argument as to why no SE could ever (successfully) deploy such an approach.

Think of a Web designer, Mr. X, who has built commercial sites for Mrs. Y and Mrs. Z.

Since Mr. X knows his Web host rather well and wants to centralise his bills, he registers the sites for his clients himself (and possibly ownerships are also attributed to his own business, if not hosted locally). Once done, he does not neglect to add the new sites to his portfolio page. Moreover, he remembers to include a footer in his clients’ sites, which link back to him and potentially attract some clients who liked his work.

Will a search engine penalise X, Y and Z as a consequence? Will they all run out of business because they work together, acknowledging one another reciprocally? Some links are exchanged for the benefit of the visitor (as illustrated above). Cohesiveness and communities are the way our Internet is built and research in IBM has shown that.

The Web is not an isolated set of Web sites. Penalising for cross-site relationships would be a chaotic mistake. In fact, the only way to ever resolve this is to look for off-site links that depart from a ‘community’. But what if these are not relavant? What if all Chinese sites linked to one another because their native language is the same? That’s a community. A country can be a community. Bloggers are a community. You can never penalise for clannish patterns, even if the registrar happens identical.

These communal patterns may have led to questioning of the PageRank system in the past. From an old article on PageRank, for example:

As Gary Stock noted here last May, Google “didn’t foresee a tightly-bound body of wirers. They presumed that technicians at USC would link to the best papers from MIT, to the best local sites from a land trust or a river study – rather than a clique, a small group of people writing about each other constantly. They obviously bump the rankings system in a way for which it wasn’t prepared.”

Although it’s tempting to suggest that bloggers broke PageRankâ„¢ it might equally be the case that the Blog Noise issue is emblematic rather than causal. Blog Noise – in the form of ‘trackbacks’, content-free pages and other chaff – is the most visible manifestation, but mindless list-generators are also to blame for Google’s poor performance.

While on the subject, another article from July presents a few more speculations about such C-block-reliant penalties.

Google’s possible purpose for filtering new links

While Google’s algorithm is not made public, it’s generally thought that Google intends to clamp down on link sales for PageRank and for ranking in the SERPs. Also on Google’s hit list are multiple interlinked sites, existing on the same ip c block, entirely for the purposes of link popularity and PageRank enhancement.

Purchased links tend to be added to a website in medium to large quantities, and often all at one time. Large quantities of incoming links, appearing all at once, might indeed trip a filter.

Google could suspect a high volume of links added at one time to be purchased, and therefore suspect. The possibility would be in keeping with Google’s strongly suspected policy of discouraging link sales. After all, Google’s guidelines point out that any type of linking schemes are against its policies.

The ip c block is the third series of numbers in the identity of an ISP. For example, in 123.123.xxx.12 the c block is denoted as xxx. Google is able to readily identify those links.

Collaborative Effort to Crawl the Web

Iuron

ONLY a few days ago, somebody had me aware of the Majestic-12 distributed search engine. The idea behind the engine is persistent use of other people’s computer power and bandwidth. The goal is mind is to crawl and potetially index the Web reasonably well.

This sudden ‘enlightement’, to me at least, provided somewhat of an insight. It affected the matter of practicability in my Open Source Iuron, which is in its early stages and more of a porposal at this stage. As explained before, Iuron does not index pages; it aspires to gain actual knowledge from the Internet instead. This can potentially make PageRank (or equivalents) obsolete, I believe, thereby reducing spam and search engine cheats.

Within a few days, I will be meeting the person who is arguably the father of the Semantic Web. My project will be difficult to lift off the ground without some support. Nonetheless, this now appears to be a hindrance with a simple solution. It is, after all, the kind of project where the vast requirement for bandwidth and computer power can be obtained in more or less the same way as Majestic-12. Since it is Open Source, willingness on the public’s behalf should not be a considerable peril.

On an unrelated topic which is paranoia, I recently noticed a referral reduction from Google. It became conspicuosuly significant in recent days so I thought it was an attempt to silence me. It finally turns out to have been merely a side-effect of a large-scale update at Google’s end. Many Web sites were in fact affected by this and distress became apparent in a few newsgroups. It was even pointed out that msn.com was assigned PageRank 2!

The Demise of PageRank

PageRank versus traffic
The number of sites with PageRank 10 is tiny when compared to the number of sites with PageRank 0. Conversely, traffic is largely centralised in sites with a high PR (more details)

A little tour around Google has led me to a questionably out-of-date article from the Register. This Article is 2 years old, but it seems more true than ever these days because scraping and link-related spam/attacks are constantly on the rise.

Google has made no secret of its goal to “understand” the web, an acknowledgement that its current brute-force text index produces search results with little or no context. The popularity of Teoma demonstrates that even a small index can produce superior results for certain kind of searches. Teoma leans on existing classification systems.

While Google relied on PageRankâ„¢ to provide context, all was well. But PageRank is now widely acknowledged to be broken, so new, smarter tricks are required.

Regarded as heresy when we raised the issue last spring, now some of Google’s warmest admirers, MetaFilter’s Matt Haughey and web designer Jason Kottke have acknowledged the problem.

As Gary Stock noted here last May, Google “didn’t foresee a tightly-bound body of wirers. They presumed that technicians at USC would link to the best papers from MIT, to the best local sites from a land trust or a river study – rather than a clique, a small group of people writing about each other constantly. They obviously bump the rankings system in a way for which it wasn’t prepared.”

The intersting fact is that Google themselves acknowledge the problem and I am sure difficulties have intensified, if anything, in the past 2 years. The specific reference to bloggers proves that very point as a new blog is set up every second these days.

Regarding the point about pages being indexed rather than learned from or understood, that is one of the catalysts that led me to starting Iuron. That site has attracted tremendous levels of interest since the idea had been conceived on the night on October 9th. I set up the Web site and made an official announcement the following day. Yesterday I finished a 1-page formal proposal and I contacted the person who is perceived by some as the father of the Semantic Web. He was once my lecturer.

WikiMirror – Vile Ripoff

Sky scrapers
Content scrapers: where is the original content? Which one is the ripoff?

MUCH that we see on the Internet these days are mirrors, although we are rarely aware of it. Several crooks make good money out of it. Some search engines are crippled by the fact that they have no knowledge as to which sites are known ‘mirror culprits’ and which ones can be trusted. Consequently, search engines like Yahoo tend to return many references to content scrapers, which is a deterrent.

As I go about looking at some SERP‘s I suddenly come across a commercial site — a Wikipedia mirror — that had been registered since January 2005. Its description in Google (judge for yourselves):

Wikimirror.com – Free encyclopedia search a b c d e f g h i j k l m n o p q r s t u v w x y z _ · Google Web Encyclopedia.
 

WTF? Is the Creative Commons Licence an invitation to massive, huge-scale ripoffs? A community of WikiPedians around the world is voluntarily spending time in vain? To get somebody else rich(er)? I have had a look around the site in question. It appears like a complete mirror, which I must stress cannot be edited, unlike its much superior source. I recently discussed the issue of people mirroring the CIA Factbook, which is public content, in a relevant newsgroup. When will this end? And why has Google not banned Wikimirror.com yet? Does the domain name not say something? Helloooooo…?

I spoke to Chris Pirillo about Blogspot spam yesterday. After our discussion he posted an item that makes a nice little read. That item is titled Google: Kill Blogspot Already!!!, which is a venturous and strong title to be used by somebody as prominent as Chris.

Also while on the subject, have a look at Networkmirror.com [rel='nofollow']. In its defence, this one-among-many Slashdot mirrors does not archive content and it serves a defensible purpose — that of mirroring sites before they go down due to the Slashdot Effect. Then mirrors sites at least get exposure while they cannot cope with the demand.

UPDATE: 1-script argues that I may have been a little hasty in posting this item. The site states:

Content Credit

Wikimirror financially supports the Wikimedia Foundation. Displaying this page does not burden Wikipedia hardware resources. This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License. Contact: info [AT] wikimirror [DOT] com

As a side note, I still think that Wikipedia contributers (me included) ought to be aware of these facts (or the existence of this mirror), which may imply that their contributions become commercial and thus money-making.

Comparing SERP’s (Search Engine Results Pages)

THERE appears to be some exciting development over at the Google Page Rank Comparison Tool. Earlier today I noticed the addition of several new features, which allow users to evaluate and compare different sites returned for a Google query.

If you are highly competitive when it comes to domination of popular Google searches (frequently the case in eCommerce), have a quick go and use the tool. Enter the SERP (search engine results page) that you aim for and see details of your competing URL‘s, even side-by-side view. For each URL, you can now see:

  • PageRank of target page
  • PageRank of domain’s front page
  • Number of IBL‘s, also known as BackLinks
  • Ditto for domain’s front page
  • Google Saturation (total # of pages indexed)
  • MSN Rank

The script is doing a lot of ‘leg work’, making it a valuable tool to run in the background (tab or separate window). It can take up to 2 minutes to complete a most comprehensive query which investigates the top 50 (maximum) entries for a given SERP.

Related items: PageRank Prediction and SEO Tools

PageRank versus traffic
The number of sites with PageRank 10 is tiny when compared to the number of sites with PageRank 0. Conversely, traffic is largely centralised in sites with a high PR (more details)

Non-Evolving Search Engines and Operating Systems

3 Monkeys

Refuse to explore and stay a monkey forever

Google algorithms are very complex at present. The very core of these remains PageRank — a mechanism that often gets misused and leads to disasters (referrer spam among other link spamming techniques). For my recent zombie attacks I blame:

  • Google – for unintentionally leading to ‘link greed’, not understanding or anticipating streetsmarts and penetration of second- and third-world countries into the Internet
  • ISP‘s – for apathetically harbouring traffic that is pure spam or targetted attacks
  • Microsoft – for creating an operating system that is so easy for crooks to capture

Google’s algorithms have become like a horse that has lots of decorations upon it, but is nothing more than a brute-force horse underneath. You can take a pig, put it in a dress and take it out for dinner. But it’s still a pig in dress, not a girlfriend.

Search must Evolve. A flawed or limited principle at the heart of something is bound to fail no matter how many bits you hang atop to patch it up and improve it. This is why traditional page indexing is not a good method for approaching the problem of information extraction and discovery. Microsoft’s operating system, for instance, suffers from the very same problem where a flawed and too complex an operating system was build from ‘code spaghetti’. It was recently heard through the grapevine that Longhorn was thrown away and reverted merely to ground zero to be based on the XP-related Server 2003 code. This comes to show that weekly updates were merely patching a mordid mess. Microsoft recognise their inability to complete with Linux performance (uptime, flexibility — the a reason for Monad). Linux just took a right approach — a right paradigm if you like — all along and was therefore able to sweep along all the best programmers in the world.

Returning to Google, by relying heavily on PageRank and making ad-hoc improvements, no real innovation will be made. This is why I intimated Iuron a few days ago. It ought to turn a large pool of indexed page into actual knowledge and provide definitely answers rather than a linear scatter of related pages.

Also comes to mind are AltaVista and other antiquated search engines with very fundamental and not-so-cunning methods for scanning pages. These were very quickly relaced by backlink-based engines, i.e. link counting in 1998. No progress has been made in nearly 8 years, however. One may begin to conceive a Google killer rather than a Windows killer. The required resources, however, in particular data centres, make the (financial) entry barrier too high to initiate a substantial enough threat. Proprietaries, however, are no concrete barrier, in contrary to the case with operating systems. So, I remain optimistic and I might soon meet some professors whose expertise is the semantic Web.

With reference to the famous 3-monkey image on top (I also have one on top of my monitor), those who refuse to evolve (Ballmer) show ‘zoo symptoms’ already. I vividly recall the day when Scoble quoted Microsoft CEO, Steve Ballmer, saying that RSS has no future. This was roughly 7 months ago. Ballmer also said that Google would vanish in 5 years and promised that MSN search was bound to ‘kill’ Google. I say: live in the past, be the past.

Moving on to a different topic, the latest article with the theme of aging at Microsoft came out on the day when I first composed this item: Pity poor Microsoft’s midlife crisis. Another recent article I have just been informed of is At 30, Microsoft Grapples With Growing Up.

Recommended reading:

Q: What about all the people in the corporate environments who are forced to use MS products and aren’t allowed the option/choice to use Mac/Linux/UNIX?

A: Kick your boss’s ass, or, choose to work for a company who have decisions that you liked.

Google Does Its Laundry

Laundry machines

GOOGLE’S Search Quality Team pushes on with an initiative to remove pages which contain hidden text, JavaScript re-directs and the like. The main intention is to penalise sites that use questionable, if not black-hat, SEO techniques. It is rather encouraging to know that much effort is being put into refinement of search results — an effort considerable enough to justify formation of an entirely independent team.

While we were indexing your webpages, we detected that some of your pages were using techniques that were outside our quality guidelines, which can be found here: [link]

In order to preserve the quality of our search engine, we have temporarily removed some webpages from our search results. Currently pages from [url removed] are scheduled to be removed for at least 30 days.”

Retrieval statistics: 21 queries taking a total of 0.170 seconds • Please report low bandwidth using the feedback form
Original styles created by Ian Main (all acknowledgements) • PHP scripts and styles later modified by Roy Schestowitz • Help yourself to a GPL'd copy
|— Proudly powered by W o r d P r e s s — based on a heavily-hacked version 1.2.1 (Mingus) installation —|