python - Scrapy crawl from script always blocks script execution after scraping -
i next guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script run scrapy script. here part of script:
crawler = crawler(settings(settings)) crawler.configure() spider = crawler.spiders.create(spider_name) crawler.crawl(spider) crawler.start() log.start() reactor.run() print "it can't printed out!"
it works @ should: visits pages, scrape needed info , stores output json told it(via feed_uri). when spider finishing work(i can see number in output json) execution of script wouldn't resume. isn't scrapy problem. , reply should somewhere in twisted's reactor. how release thread execution?
you need stop reactor when spider finishes. can accomplish listening spider_closed
signal:
from twisted.internet import reactor scrapy import log, signals scrapy.crawler import crawler scrapy.settings import settings scrapy.xlib.pydispatch import dispatcher testspiders.spiders.followall import followallspider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = followallspider(domain='scrapinghub.com') crawler = crawler(settings()) crawler.configure() crawler.crawl(spider) crawler.start() log.start() log.msg('running reactor...') reactor.run() # script block here until spider closed log.msg('reactor stopped.')
and command line log output might like:
stav@maia:/srv/scrapy/testspiders$ ./api 2013-02-10 14:49:38-0600 [scrapy] info: running reactor... 2013-02-10 14:49:47-0600 [followall] info: closing spider (finished) 2013-02-10 14:49:47-0600 [followall] info: dumping scrapy stats: {'downloader/request_bytes': 23934,...} 2013-02-10 14:49:47-0600 [followall] info: spider closed (finished) 2013-02-10 14:49:47-0600 [scrapy] info: reactor stopped. stav@maia:/srv/scrapy/testspiders$
python twisted scrapy
No comments:
Post a Comment