__/ [moma] on Sunday 08 January 2006 11:58 \__
> fritz-bayer@xxxxxx wrote:
>> Hi,
>>
>> I'm trying to mirror the homepages, retrieved from a google search. The
>> problem is, that wget does not retrieve the homepages from the search.
>>
>> Does anybody know how to use wget, so that it will download the
>> homepage for each search result returned from the search?
>>
> An idea,
>
> $ lynx --dump http://test.com (<- place your web engine search here)
>
> ....
> References
> Visible links
> 1. http://test.com/servlet/com.test.servlet.account.Login
> 2. javascript:popWin('/phoenix/tour_home.htm')
> 3. http://test.com/phoenix/contact_general.htm
> 4. http://test.com/phoenix/so_1_2.htm
> 5. http://test.com/phoenix/so_1_3.htm
> 6. http://test.com/phoenix/so_1_4.htm
> 7. http://test.com/phoenix/so_1_5.htm
> 8. http://test.com/phoenix/so_2_2.htm
> ....
$ wget http://test.com
With pages retrieved, e.g. a SERP (search engine results pages), you can also
extract the list of links, which you then pass to wget for traversal.
$ cat ~/serp | perl -ne '@url=m!(http://[^>"]+)!g;print "$_\n" foreach @url'
> ~/googleurls
> The listing begins after "References" and "Visible links" words. Study
> $ man lynx
> for other options. ($ sudo apt-get install lynx)
>
> Of course you can pipe the output to
> | grep -e "^\ *[0-9]*\." and
> | grep "http://" and
> | uniq | sort and
> | sed etc.
> for further processing
>
> and finally do wget -r -l1 -k -T5 SOME_HTTP_URL
> like this
> $ wget -r -l2 -k -T5 http://www.futuredesktop.org
>
> which downloads the entire web-site.
> -k, (--convert-links) will make links in downloaded HTML point to your
> local file.
>
> $ man wget
I tend to use wget in the following fashion for downloads that are 'kind'.
wget -r -l1 -H -t1 -nd -N -np -A.htm,.html,.php -erobots=off -i
~/url_list.txt
The list should be newline separated
Hope it helps,
Roy
--
Roy S. Schestowitz
http://Schestowitz.com | SuSE Linux | PGP-Key: 0x74572E8E
6:35pm up 29 days 1:46, 14 users, load average: 0.51, 0.70, 0.66
|
|