Introduction About Site Map

XML
RSS 2 Feed RSS 2 Feed
Navigation

Main Page | Blog Index

Archive for the ‘Cyberspace’ Category

Hackers, Insults and Error Logs

Laptop

SEVERAL times in the past I whined about the state of the Internet. It is too susceptible to various faces of evil — something which is finally recognised at a higher level and is attributed to the way the Internet was initially conceived, engineered and set up. Blame it on Al Gore if you wish, for he is the one who “invented the Internet”.

My main domain continues to suffer from zombie attacks and brute-force hacking attempts, all of which are unsuccessful. Such attacks may seem like benign inconveniences when properly filtered, yet all such attempts contribute to ‘noise’. They also require a lot of work to circumvent and defeat.

If a Web page, let us say /foo/bar/ includes the word “guestbook” (especially in the page title), one may find errors in the site logs which resemble a particular pattern. These would be common sensitive addresses such as /foo/bar/addentry.php or /foo/bar//addentry.php, which indicate an attempt to spam em masse. The culprits are lazy spammers who scan a page (often a search results page) and run some scripts. The aim is to exploit widely-known vulnerabilities, which have been already patched in most cases. There are rarely open sores in Open Source, but large-scale spam continues to pose a risk and devours precious bandwidth.

As an example of spamming attempts, I find many requests that are similar to:

[Tue Jan 31 07:33:56 2006] [error] [client 69.31.80.114] File does not exist: /home/schestow/public_html/Weblog/archives/2005/07/addentry.php

These are, of course, automated attempts which are directed at pages containing the word “guestbook”. The attacks are thrown at many sites simultaneously, regardless of what software is actually used.

In other circumstances, hacking attempts involve hijacking of a content management systems or an entire Web site, which is worse than spam. These are attempts to deface, being the equivalent of a UseNet defamation or complete name mocking, crossposted for public humiliation (an example).

I used to very much worry about people’s ability to write self-derogatory blog comments, newsgroups posts, and mailing list messages ‘on behalf’ of somebody else. I saw it happening many times before. The least one can do is embrace PGP for signatures. No less. Not everyone can spot IP addresses and track them. People can nymshift without any restrictions.

If manners are the glue of on-line communities, what are the motives of such vandals? When has cracking (as opposed to “hacking”) become popular? The motives must be a boost to ego and clan vanity (or “klan” rather). Sometimes, Web sites are captured and then re-direct to steal ranks which are accredited by search engines.

What have I done on the matter? Not much so far, but I found a neat solution to the Windows zombies. Many common attempts to hack are being redirectd to this page rel="nofollow", to which I referred in this previous blog item. Errors and attempts to hack can be suppressed using re-directions on common URL‘s, which characterise vulnerable components or exploitation of script for mass-mailing or spam. All in all, after much work, Web malice has been lowered to a manageable level.

Newsgroups Statistics

Wikipedia statistics

THIS morning I found (and began playing with) a GPL‘d newsgroups statistics tool (homepage of the project). I compiled it from source code without the QT GUI and off I went experimenting. In case you choose to run that tool as well, there is an ample manual for its command-line mode.

I have reported a serious documentation bug to the author — a bug that cost me a fair bit of time. To quote the report, in case this helps somebody else:

In Turquoise 2.2, the help bit says:

Usage: turqstat [options] outputfile areadef(s)

Shouldn’t the ourputfile precede the [options]? It took me a very long
time to realise that.

Having run it successfully ‘off-line’, script invocation has been made a weekly cron job (i.e. a job which is scheduled to run repeatedly by a daemon). I decided that I can only post its output to forums where:

  • My participation is noticeable and is decent
  • My presence goes back a long time into the past
  • The amount of involvement and activity is high, or else there is not much to gauge. Statistics become uninteresting otherwise.
  • Nobody else generates statistics already

For the time being, I will generate and publish statistics for the search engines and Palm newsgroups only.

Nanny Country Snatches Search Logs

CCTV

‘Smile! Big brother is watching you.”

MSN, AOL and Yahoo have handed over log data to the U.S. government. The controversial move has seen strong resistance from Google however.

Yahoo acknowledges handing over search data requested in a subpoena from the Bush administration, which is hoping to use the information to revive an anti-porn law that was rejected by the U.S. Supreme Court.

Exposure of one’s search history is nothing new. In fact, exposure through search giants and third parties extends beyond this . The same companies maintain mail accounts and even statistics from other Web sites (Google Analytics).

Given sheer demand from up above, will they carrying on caving and exposing their customers’ data? Also, what about the new laws regarding data retention by ISP‘s? Everything you do gets logged, unless you use encryption of course. Being watched may be acceptable, but a so-called ‘nanny country’ is not, at least in my humble opinion.

Related items:

Challenge/Response Gets Blacklisted

Junk mail

LAST night, Brad Templeton pointed out that mail servers which run autoresponders or challenge/response filters could get blacklisted by spamcop.net. This is a database-driven Web site, which various spam filters rely on as a knowledgebase-type service. It also banned our LUG‘s mailing list earlier today.

I have been aware of the problems with such anti-spam tactics for quite some time, but never thought it could lead to this. As some commenters pointed out, other services may indirectly abolish anti-spam practices such as challenge/response, as well lead to banishment from people’s inboxes. Put in Brad’s words:

I learned a couple of days ago my mail server got blacklisted by spamcop.net. They don’t reveal the reason for it, but it’s likely that I was blacklisted for running an autoresponder, in this case my own custom challenge/response spam filter which is the oldest operating one I know of.

My personal solution, as posted in reply to the article, is to use a spam filter ‘on top’ of the challenge/response component. The intent: lowering the amount of challenges. One can reduce the likelihood of banishment in this way, as well as become less of a nuisance to the Net. In other words, it is possible to rule out cases when messsages are rather obviously spam. It leads to lower volume of messages being dispatched, which in turn can avoid blacklisting.

I use SpamAssasin, which is active at a layer higher than challenge/response (in this case Apache with BoxTrapper). Whatever gets scored as spam will be put aside in a mail folder which is reserved for spam. Only messages not marked as spam (and not in the whitelist either) will have a challenge delivered. This cuts down the number challenges by about 70% in my case. It never entails any false positive because I set the thresholds rather high.

Blog Plagiarism

Laundry machines
Help the search engines clean up the Web.
Report duplicates.

I recently mentioned site scrapers in the context of Internet plagiarism. More often do I hear about blogs copied systematically nowadays.

Blog plagiarism is a growing phenomenon, or so it seems on the surface. This even happens to me sometimes, but I refuse to spend my time or lose sleep over it. The process needed to remove stolen content is unnecessarily cumbersome. As as example, Podz and Mike Little, who are both WordPress developers, had people copy their entire site merely post-by-post. This can ultimately lead to mirror/duplicate penalties, which deter search engines. As far as I know, they had to engage in a lengthy process of correspondence before action was taken. The best one can do is keep an eye on the dodgy sites and report abuse when all blows out of proportion. As long as a site is public, it is susceptible to copyright infringement and can, in due time, become a victim.

As one example of stolen content, RSS Site Map is one such item that was once copied verbatim and in full. If I recall correctly, a Blogger member was the culprit. A subtle link was at least there, but no real attribution was made.

Other content thieves scrape random bits and stick them together to form ‘doorway pages’. These pages serve as a mechanism which hogs search engine referrals. It is one among many popular aspects of black-hat SEO practices, which are a form of spam by any definition.

Frequently-Asked Questions (or Useful Facts)

  • Q: How does one copy content systematically?
    A: RSSBlog [rel="nofollow"] and the like. Magpie can do this vis RSS when misused.
  • Q: How does one detect plagiarism?
    A: Tools such as Copyscape appear to do that trick. I imagine that they run a series of Web searches with large sentences involved. They then attempt to identify excessive overlap across sites on the Internet. These Web-based tools simplify and automate, at an upper-level at least, an old-styled method for detection of duplicates. This type of technique I can still recall from my days as an undergraduate.
  • Q: How does one report plagiarism?
    A: Probably the most suitable response is contacting the host of the offending site. Examples are needed to support the complaint/s.

Retrieval statistics: 21 queries taking a total of 0.248 seconds • Please report low bandwidth using the feedback form
Original styles created by Ian Main (all acknowledgements) • PHP scripts and styles later modified by Roy Schestowitz • Help yourself to a GPL'd copy
|— Proudly powered by W o r d P r e s s — based on a heavily-hacked version 1.2.1 (Mingus) installation —|