[H]omer <spam@xxxxxxx> espoused:
> Verily I say unto thee, that Mark Kent spake thusly:
>
>> Has anyone tried to do anything like this already and perhaps has
>> solutions for these issues?
>
> How about running this on a leafnode spool?:
>
> ######
> #!/usr/bin/perl -w
> # parse-urls.pl
>
> use strict;
> use URI::Find;
>
> my $finder = URI::Find->new(
> sub {
> my($uri, $orig_uri) = @_;
> return $orig_uri;
> });
>
> while (<>) {
> my $text = $_;
> $finder->find(\$text);
> exec "lynx -source $text" or die;
> }
>
> 1;
> ######
>
> - http://search.cpan.org/dist/URI-Find/
>
> I'll play around with this, and see about adding URI verification.
>
> Also IMHO the final output should be something like:
>
> Article Name: <html title>
> Archive Date: <date fetched>
> Article URI : <orig_uri>
> Article Body: <output from parse-urls.pl>
>
> Getting the *real* posting date for an upstream article is a more
> difficult proposition, since that info is not always available.
>
> Also, for a proper citation, the upstream article *author* should be
> included, where possible.
>
Homer - it's a great start, but part of my issue was about dealing with
the web-page end, where many articles are broken up over multiple pages.
Still, let me know how you get along.
--
| Mark Kent -- mark at ellandroad dot demon dot co dot uk |
| Cola faq: http://www.faqs.org/faqs/linux/advocacy/faq-and-primer/ |
| Cola trolls: http://colatrolls.blogspot.com/ |
|
|