Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: Strip all but URL's from Web Pages [was: ping John, OT]

__/ [Borek] on Tuesday 20 December 2005 08:12 \__

> On Tue, 20 Dec 2005 05:09:51 +0100, Roy Schestowitz
> <newsgroups@xxxxxxxxxxxxxxx> wrote:
> 
>>> You know, it's finals week at my university, but give me a week or so
>>> and I can write one for you.  You're talking about wgetting a page like
>>> http://www.google.com/search?hl=en&lr=&q=i+love+seo right?
>>
>> I can't speak for Borek, but I would love to have something generic.
>> Unwanted
>> links can be stripped off the list manually, or even filtered out based
>> on
>> some criterion, e.g.
>>
>> fgrep 'google.' list_of_newline-separated_links.txt >google_links.txt
> 
> That will be ideal.
> 
> The truth is - in such cases I am usually writing some C/C++ code
> to handle such things. As Linux/bash are not my native environement
> (I grow up in DOS) I later often learn that what I did in C can be
> easily done using some fancy combination of awk/grep/sort and so on.
> 
> So as a last resort C is ready but perl will be more geeky ;)
> 
> Best,
> Borek

I am not entirely sure what you are trying to achieve. However, if only
Google SERP is what you handle, why not parse feeds or use them directly?

I assume you are trying to automate some type of analysis:

http://www.benhammersley.com/projects/google_to_rss.html

There are equivalents for Yahoo, MSN and many, many others.

Hope it's helpful,

Roy

-- 
Roy S. Schestowitz      | "Stand for nothing and you will fall for anything"
http://Schestowitz.com  |    SuSE Linux     |     PGP-Key: 0x74572E8E
  2:40am  up 10 days  9:51,  6 users,  load average: 0.00, 0.03, 0.04
      http://iuron.com - next generation of search paradigms

[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index