Verily I say unto thee, that Mark Kent spake thusly:
> Has anyone tried to do anything like this already and perhaps has
> solutions for these issues?
How about running this on a leafnode spool?:
######
#!/usr/bin/perl -w
# parse-urls.pl
use strict;
use URI::Find;
my $finder = URI::Find->new(
sub {
my($uri, $orig_uri) = @_;
return $orig_uri;
});
while (<>) {
my $text = $_;
$finder->find(\$text);
exec "lynx -source $text" or die;
}
1;
######
- http://search.cpan.org/dist/URI-Find/
I'll play around with this, and see about adding URI verification.
Also IMHO the final output should be something like:
Article Name: <html title>
Archive Date: <date fetched>
Article URI : <orig_uri>
Article Body: <output from parse-urls.pl>
Getting the *real* posting date for an upstream article is a more
difficult proposition, since that info is not always available.
Also, for a proper citation, the upstream article *author* should be
included, where possible.
--
K.
http://slated.org - Slated, Rated & Blogged
.----
| "Future archaeologists will be able to identify a 'Vista Upgrade
| Layer' when they go through our landfill sites" - Sian Berry, the
| Green Party.
`----
Fedora Core release 5 (Bordeaux) on sky, running kernel 2.6.19-1.2288.fc5
16:32:02 up 25 days, 3:57, 3 users, load average: 0.50, 0.73, 0.74
|
|