Introduction About Site Map

XML
RSS 2 Feed RSS 2 Feed
Navigation

Main Page | Blog Index

Archive for the ‘SEO’ Category

URL Ambiguity and Duplicates Attacks

Iron links

Attacking the competition using links

IERHAPS it is a case of stating the obvious, but this will be of interest to those Webmasters whose habit is to create directories with index.html therein. This approach often produces more elegant and graceful URL‘s, yet it may entail a hidden cost.

I believe I have identified an issue that badly affects my site and, in particular, its WordPress blog component. This certain deficiency is associated with links that point to a given page whose structure is artifically built using Apache’s mod_rewrite. Several URL’s can point at the same page (even without appending a question mark to pass extra file request arguments). It turns out that http://example.com/Example is different from http://example.com/Example/ (note the extra slash), as perceived by large engines. “Example” is assumed to be an object residing the main directory in the former case. In the latter case, it is definitely a directory, so a structural ambiguity does exist.

Apache redirections handle the two URL’s separately as well. If two inbound links are received (e.g. by creation in whichever sites), there is a chance for ambiguity, which then leads to duplication of pages. This means that pages might be penalised for duplicates in search engines’ cache. I begin to wonder if a vandal could maliciously point to ‘alias addresses’, thereby having pages duplicated in SE indices. In this way, the vandal could lead to penalties which are imposed on other sites — those Web sites where dirty tricks are believed to be employed. Such penalties do not involve the Webmaster, they cannot truly be avoided, but they can help someone knock down the competition in the SERP‘s.

DDOS attacks are another, totally separate matter because they can slow the Web server at worse circumstances, thus slowing down crawling and ruining a site’s profile. That aside, DDOS attacks are illegal and they require a lot of brute-force and excellent connectivity. Link creation does not.

Related item: Aftermath of a Zombie Attack

Criticism of Today’s Web

Internet

Two recent articles, which are definitely worth reading, are listed below.

Search engines extract too much of the Web’s value, leaving too little for the websites that actually create the content. Liberation from search dependency is a strategic imperative for both websites and software vendors.

To you who are toiling over an AJAX- and Ruby-powered social software product, good luck, God bless, and have fun. Remember that 20 other people are working on the same idea. So keep it simple, and ship it before they do, and maintain your sense of humor whether you get rich or go broke. Especially if you get rich. Nothing is more unsightly than a solemn multi-millionaire.

This reminded me of a fun blog which is purely dedicated to Web 2.0 bashing.

Search Engines Dig Deeper

SEARCH engines are constantly finding new ways to improve their performance. While there are many methods involved, some of them are less ethical than others. Think of the following perplexing scenario: Refinement of one’s results by means of crawling and utilising the opponents’ data, which is ‘exposed’ to everyone. In legitimate cases this is known as “harvesting” or — put negatively — ascribe to it the connotation of “scraping”. The root of the idea is use of search engines to improve and refine results of another. Scroogle and Webcrawler, for instance, are dependent purely on this concept. There is a cyclic trap here, too. Search engine ‘poisoning’ comes to mind.

Tractor armThink of MSN using Google Directories or Google using del.icio.us, which is a social linkage database that is now owned by Yahoo. Moreover, directories like DMOZ are non-profit, yet they are often open for use (or misuse) by profit-making companies. Link bases like deli.ci.us (owned by Yahoo) have the potential of refining results. As they are publicly available, could anyone truly restrict rivals from accessing and using the potentially valuable data? These links are contributed and managed by the public and have no prescribed copyrights. The depth of exploration for search engines does not seem to be limited, which is worrying.

There seems to be a certain ethical and legal border where image crawling and serving them within frames (Google Images) become questionable, let alone public forums, UseNet included (Google Groups). This by all means refers to Google Images and Google Groups, quite exclusively even. How deep should one be allowed to crawl the data in existence and how should it be attributed to the source? To dare is to win, but often this leads to demise which is catalysed by public opinion. There is no doubt as to whether company acquitions are intended for more extensive data collection, assuming that information, if obtained even in the form of spying, is available. Information can become powerful, but it has a cost which is often the death of privacy.

As I carry on with my drivel on the infiltration of search engines, I also discover the belated arrival of a European search engine.

In his New Year’s address outlining his administration’s plans for 2006, French President Jacques Chirac focused on plans for a European search engine to rival US internet companies such as Yahoo and Google. Some of the top tech labs in France and Germany are reportedly working on the ‘Quaero’ (Latin for ‘to search’) search engine.

Designing With Flash

SparkleIn my own mind, there is one golden rule for design with Flash:

Non-Flash browsers should miss no information. Their users should only miss out on the flash (no capital ‘F’ here), but never any content. If proprietary formats like Flash are made a requirement for information extraction, the outlook for the Web seems grim.

It has been a long time since I last designed with Flash. This goes back to 2001, in fact. At the time, little did I know about the SEO impact, which is why I no longer bother with Flash. Originally, its use was not my choice either; it was the client’s.

As a final word of advice (or caution), menus and text should remain in pure-text form, never embedded in something that requires pattern analysis or closed-source software. The same rules apply to graphics (images) where the alt attribute must be used as a surrogate, just in case images are not (or cannot) be displayed.

Accessibility-Friendly Search Engines

Shop sign
A mixed message is delivered to site visitors

AS time goes by, the needs of the disabled are better realised. The Web becomes not only a mainstream phenomenon, but it is also a necessity. To many, banking, shopping and even social aspects or life are dependent on the Internet. Currently, search engines tend to concentrates on content, not on style and graphics, let alone validity of code or issues pertaining to accessibility. Might this change?

It would not be surprising if a search engine emerge , which only bothered with pages that are pure text or are built to possess good accessibility traits. Blind and handicapped people, for example, could opt for this niche-serving search engine. Large players such as Google have already catered for specific types of searches such as localised search (Google Local) and blog-exclusive search. Accessibility-type search may soon become a reality.

Tools such as The SEO Analyzer would perhaps be valuable for ranking sites. Perhaps incorporating modules such as these into crawlers is a worthwhile move. Moreover, rather than separating the engine types altogether, the user could tick a box that says ‘display only lean, stripped-down pages’1 or ‘rank pages for accessibility and sort by quality’. This will encourage better Web standards and open ‘HTTP cyberspace’ to a larger audience.

1 This exists already, I suspect. Page size in the results page provides a clue as well.

Blog Plagiarism

Laundry machines
Help the search engines clean up the Web.
Report duplicates.

I recently mentioned site scrapers in the context of Internet plagiarism. More often do I hear about blogs copied systematically nowadays.

Blog plagiarism is a growing phenomenon, or so it seems on the surface. This even happens to me sometimes, but I refuse to spend my time or lose sleep over it. The process needed to remove stolen content is unnecessarily cumbersome. As as example, Podz and Mike Little, who are both WordPress developers, had people copy their entire site merely post-by-post. This can ultimately lead to mirror/duplicate penalties, which deter search engines. As far as I know, they had to engage in a lengthy process of correspondence before action was taken. The best one can do is keep an eye on the dodgy sites and report abuse when all blows out of proportion. As long as a site is public, it is susceptible to copyright infringement and can, in due time, become a victim.

As one example of stolen content, RSS Site Map is one such item that was once copied verbatim and in full. If I recall correctly, a Blogger member was the culprit. A subtle link was at least there, but no real attribution was made.

Other content thieves scrape random bits and stick them together to form ‘doorway pages’. These pages serve as a mechanism which hogs search engine referrals. It is one among many popular aspects of black-hat SEO practices, which are a form of spam by any definition.

Frequently-Asked Questions (or Useful Facts)

  • Q: How does one copy content systematically?
    A: RSSBlog [rel="nofollow"] and the like. Magpie can do this vis RSS when misused.
  • Q: How does one detect plagiarism?
    A: Tools such as Copyscape appear to do that trick. I imagine that they run a series of Web searches with large sentences involved. They then attempt to identify excessive overlap across sites on the Internet. These Web-based tools simplify and automate, at an upper-level at least, an old-styled method for detection of duplicates. This type of technique I can still recall from my days as an undergraduate.
  • Q: How does one report plagiarism?
    A: Probably the most suitable response is contacting the host of the offending site. Examples are needed to support the complaint/s.

Search Engines and Biased Results

Google Cookie

This one comes from a blog, but it is still worth reading.

The New York Times reported that Google will give AOL preferred placement for AOL’s videos in Google’s video search in Google’s new Google Video search site. In addition, Google will include links to AOL videos on the Google Video home page — and won’t label any of those links advertising, or call the preferred listings advertising, even though they clearly are ads.”

Retrieval statistics: 21 queries taking a total of 0.136 seconds • Please report low bandwidth using the feedback form
Original styles created by Ian Main (all acknowledgements) • PHP scripts and styles later modified by Roy Schestowitz • Help yourself to a GPL'd copy
|— Proudly powered by W o r d P r e s s — based on a heavily-hacked version 1.2.1 (Mingus) installation —|