Thursday, 15 January 2015

web crawler - Cannot download using wget (wget retry "unlimitedly") -



web crawler - Cannot download using wget (wget retry "unlimitedly") -

i've crawl website http://docbao.com.vn/ using wget, wget message

http request sent, awaiting response... no info received. retrying.

for example, crawled webpages in category http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec , result was

congnh@congnh-pc:~/source/datasection/congnh-crawler/sh$ wget "http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec" -o - --2013-02-20 23:53:16-- http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec resolving docbao.com.vn (docbao.com.vn)... 123.30.51.174 connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:17-- (try: 2) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:19-- (try: 3) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:22-- (try: 4) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:27-- (try: 5) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:32-- (try: 6) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:38-- (try: 7) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:45-- (try: 8) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:53-- (try: 9) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. ...

why wget retry "unlimitedly"? or what's problem? thanks cong

sorry stating obvious, but: wget retries because not receive data. sends http header , remote host closes connection after that. can guess non-standard behaviour due misconfiguration on server side, maybe deliberate one.

after poking around bit, found out content will served 1 time signal can handle gzip-encoded response. can adding --header="accept-encoding: gzip" wget command. 1 time again problematic crawling wget, since cannot recurse gzipped content. need write script handle situation, or utilize tool can handle such content.

on sidenote: please aware not websites allow scraping content. please check tos before so.

wget web-crawler

No comments:

Post a Comment