web crawler - Cannot download using wget (wget retry "unlimitedly") -
i've crawl website http://docbao.com.vn/ using wget, wget message
http request sent, awaiting response... no info received. retrying.
for example, crawled webpages in category http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec , result was
congnh@congnh-pc:~/source/datasection/congnh-crawler/sh$ wget "http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec" -o - --2013-02-20 23:53:16-- http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec resolving docbao.com.vn (docbao.com.vn)... 123.30.51.174 connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:17-- (try: 2) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:19-- (try: 3) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:22-- (try: 4) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:27-- (try: 5) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:32-- (try: 6) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:38-- (try: 7) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:45-- (try: 8) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. --2013-02-20 23:53:53-- (try: 9) http://docbao.com.vn/chuyenmuc/muc-1/quoc_te.dec connecting docbao.com.vn (docbao.com.vn)|123.30.51.174|:80... connected. http request sent, awaiting response... no info received. retrying. ... why wget retry "unlimitedly"? or what's problem? thanks cong
sorry stating obvious, but: wget retries because not receive data. sends http header , remote host closes connection after that. can guess non-standard behaviour due misconfiguration on server side, maybe deliberate one.
after poking around bit, found out content will served 1 time signal can handle gzip-encoded response. can adding --header="accept-encoding: gzip" wget command. 1 time again problematic crawling wget, since cannot recurse gzipped content. need write script handle situation, or utilize tool can handle such content.
on sidenote: please aware not websites allow scraping content. please check tos before so.
wget web-crawler
No comments:
Post a Comment