Home Messages Index
[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index

Re: Strip all but URL's from Web Pages [was: ping John, OT]

__/ [Borek] on Tuesday 20 December 2005 08:06 \__

> On Tue, 20 Dec 2005 02:09:32 +0100, Roy Schestowitz
> <newsgroups@xxxxxxxxxxxxxxx> wrote:
> 
>> I'd be /very/ interested in an answer/solution to that too, John. I need
>> to
>> generate files that contain newline-separated URL's rather than copy and
>> paste from Web pages. The closest I could ever get to minimal manual
>> labour was:
>>
>> less search.html  | grep http://
> 
> In google case it will not help - whole answer is one line.

What complicates matters are syntaxes like:

<A href="foo.bar"></A>
<a title="linky thing" href="foo.bar"></A>
<a  HREF='./foo.bar'></a>

To make something that covers all cases, you couldn't just lazily scan all
text while spewing out text that is contained between "<a href="" and """.
That's  where  standards  like XHTML and  standards-compliant  pages  come
handy.  As far as I know, Google are very fond of standards, which is rare
among other large companies.

Roy

-- 
Roy S. Schestowitz
http://Schestowitz.com  |    SuSE Linux     |     PGP-Key: 0x74572E8E
  2:25pm  up 9 days 21:36,  5 users,  load average: 0.05, 0.19, 0.16
      http://iuron.com - next generation of search paradigms

[Date Prev][Date Next][Thread Prev][Thread Next]
Author IndexDate IndexThread Index