Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> espoused:
> __/ [ Mark Kent ] on Friday 16 March 2007 08:23 \__
>
>> Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> espoused:
>>> __/ [ Mark Kent ] on Thursday 15 March 2007 16:28 \__
>>>
>>>> Roy Schestowitz <newsgroups@xxxxxxxxxxxxxxx> espoused:
>>>>> Giving Back [to Linux]
>>>>>
>>>>> ,----[ Quote ]
>>>>>| My contribution to Vector Linux is quite small even when compared to
>>>>>| some of the other volunteer packagers. It?s minuscule compared to
>>>>>| the core developers. The point is that the strength of the Open
>>>>>| Source community is that lots of people give back and contribute
>>>>>| what they can. Lots of little contributions make a huge impact.
>>>>> `----
>>>>>
>>>>> http://www.oreillynet.com/linux/blog/2007/03/post_3.html
>>>>>
>>>>> [H]omer, for example, builds RPMs for Fedora. And look what we have as a
>>>>> result:
>>>>>
>>>>> Four good reasons to switch to RHEL 5
>>>>>
>>>>> ,----[ Quote ]
>>>>>| Sometimes you don't want the hassle of the big upgrade. For example,
>>>>>| there is no good reason to "upgrade" Windows to Vista. On the other
>>>>>| hand, there are upgrades like Red Hat Enterprise Linux 5 (RHEL) that
>>>>>| give you some darn good reasons to make the jump.
>>>>> `----
>>>>>
>>>>> http://www.linux-watch.com/news/NS6991009676.html
>>>>
>>>> I would suggest that the [News] postings provide a useful service into
>>>> the Community, too. In fact, I suspect that if you were to google all
>>>> the digests, you'd have more linux-related news URLs than any amount of
>>>> manual googling would bring. I have noticed though that some of the
>>>> older references seem to disappear in the end. Perhaps we should be
>>>> saving whole articles somewhere?
>>>
>>> If the URL merely changes, then doing a Web search with the title/snippet
>>> should bring up a mirror/identical article. Some time ago OpenAddict moved
>>> from one CMS to another and many URLs broke. I sent an E-mail to the
>>> Webmaster and this will be corrected, but I agree that we can't rely on
>>> the Internet Archive (Wayback Time Machine) and search engines cache. One
>>> option is to save every page before I post it (CTRL+S+ENTER), but it
>>> doesn't make the previous links live. It also doesn't make it public. It
>>> does, on the other hand, allow me to grep back to life any article which I
>>> post here. Those who volunteered in Groklaw and Slashdot did a huge favour
>>> to society, IMHO. The deposition tapes, for example, were immortalised by
>>> a Groklawian who chose to remain anonymous.
>>
>> I suppose I could start storing/hosting stuff here. These are only text
>> articles anyway, so the storage requirement would not be vast.
>>
>>>
>>> Maybe one day we'll have 'monopoly deniers' (now they have 'climate
>>> deniers', with the negative connotation), so all this evidence is very
>>> important. It will help write history properly, going past the scope of
>>> Gates' 'Museum of Computing' and charitable work (READ: investment).
>>>
>>
>> Indeed, and I remain somewhat concerned that although we get to keep the
>> short snippets of articles in google and elsewhere, we might be losing
>> the originals.
>
> True. We should learn from history that when evidence goes away, people
> conveniently ignore the past. I suggest we make use of a tool that parses
> [News]-tagged posts, extracts the URLs, and then curls/wgets them
> systematically, maybe putting them under a directory that's named just like
> the msg-id. I would have gladly implemented this if I was any good with Perl
> parsing, but it sounds like you could reuse a lot from your current
> digest-generating script. You already pick up the tags to isolate some posts
> from the rest and then extract values from the messages headers. Since
> information excess is not a major issue (it's just arhiving), then just
> wgetting everything (even tinyurl's and related posts) which begins with
> "http" might actually work. If the formatting of the posts needs to change,
> that'll be a non-issue, but I also make it convenient for Ed to parse as he
> creates local copies on his BSD server.
>
It's an interesting possibility, to be honest, you could do this in a
simple sense easily from a bash script, since wget is able to use URLs
in a file anyway, although there are already some good regexes around
for getting URLs out of files in "urlview", so it would be a case of
launching wget from something like urlview, perhaps. The issue that
causes me a little concern is that many articles are multiple page, and
deliberately designed so that you cannot easily do this trick, so in
practice, you're looking at something more like a webcrawler kind of
technology.
Has anyone tried to do anything like this already and perhaps has
solutions for these issues?
--
| Mark Kent -- mark at ellandroad dot demon dot co dot uk |
| Cola faq: http://www.faqs.org/faqs/linux/advocacy/faq-and-primer/ |
| Cola trolls: http://colatrolls.blogspot.com/ |
|
|