python - Scrapy - doesn't crawl -
i'm trying recursive crawl running , since 1 wrote wasn't working fine, pulled illustration web , tried. don't know, problem is, crawl doesn't display errors. can help me this.
also, there step-by-step debugging tool help understand crawl flow of spider.
any help regarding appreciated.
macbook:spiders hadoop$ scrapy crawl craigs -o items.csv -t csv /system/library/frameworks/python.framework/versions/2.6/extras/lib/python/zope/__init__.py:1: userwarning: module pkg_resources imported /system/library/frameworks/python.framework/versions/2.6/extras/lib/python/pkg_resources.pyc, /library/python/2.6/site-packages beingness added sys.path__import__('pkg_resources').declare_namespace(__name__) /system/library/frameworks/python.framework/versions/2.6/extras/lib/python/zope/__init__.py:1: userwarning: module site imported /system/library/frameworks/python.framework/versions/2.6/lib/python2.6/site.pyc, /library/python/2.6/site-packages beingness added sys.path__import__('pkg_resources').declare_namespace(__name__) 2013-02-08 20:35:55+0530 [scrapy] info: scrapy 0.16.4 started (bot: myspider) 2013-02-08 20:35:55+0530 [scrapy] debug: enabled extensions: feedexporter, logstats, telnetconsole, closespider, webservice, corestats, spiderstate 2013-02-08 20:35:55+0530 [scrapy] debug: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, redirectmiddleware, cookiesmiddleware, httpcompressionmiddleware, chunkedtransfermiddleware, downloaderstats 2013-02-08 20:35:55+0530 [scrapy] debug: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2013-02-08 20:35:55+0530 [scrapy] debug: enabled item pipelines: 2013-02-08 20:35:55+0530 [craigs] info: spider opened 2013-02-08 20:35:55+0530 [craigs] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-02-08 20:35:55+0530 [scrapy] debug: telnet console listening on 0.0.0.0:6023 2013-02-08 20:35:55+0530 [scrapy] debug: web service listening on 0.0.0.0:6080 2013-02-08 20:35:58+0530 [craigs] debug: crawled (200) <get http://sfbay.craigslist.org/npo/> (referer: none) 2013-02-08 20:35:58+0530 [craigs] info: closing spider (finished) 2013-02-08 20:35:58+0530 [craigs] info: dumping scrapy stats: {'downloader/request_bytes': 230, 'downloader/request_count': 1, 'downloader/request_method_count/get': 1, 'downloader/response_bytes': 7291, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 2, 8, 15, 5, 58, 415553), 'log_count/debug': 7, 'log_count/info': 4, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2013, 2, 8, 15, 5, 55, 343482)} 2013-02-08 20:35:58+0530 [craigs] info: spider closed (finished)
the code have used follows,
from scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor scrapy.selector import htmlxpathselector #from craigslist_sample.items import craigslistsampleitem class myspider(crawlspider): name = "craigs" allowed_domains = ["sfbay.craigslist.org"] start_urls = ["sfbay.craigslist.org/npo/"] rules = (rule (sgmllinkextractor(allow=("d00\.html", ),restrict_xpaths=('//p[@id="nextpage"]',)) , callback="parse_items", follow= true), ) def parse_items(self, response): hxs = htmlxpathselector(response) titles = hxs.select("//p") items = [] titles in titles: item = craigslistsampleitem() item ["title"] = titles.select("a/text()").extract() item ["link"] = titles.select("a/@href").extract() items.append(item) return(items)
modify sgmllinkextractor payala suggested remove restrict_xpaths
section of link extractor
these changes prepare issue beingness experienced. i'd create next suggestion xpath used select titles, remove empty items occur because next page links beingness selected.
def parse_items(self, response): hxs = htmlxpathselector(response) titles = hxs.select("//p[@class='row']")
python web-crawler scrapy
No comments:
Post a Comment