Monday, 15 April 2013

python - Scrapy crawl from script always blocks script execution after scraping -



python - Scrapy crawl from script always blocks script execution after scraping -

i next guide http://doc.scrapy.org/en/0.16/topics/practices.html#run-scrapy-from-a-script run scrapy script. here part of script:

crawler = crawler(settings(settings)) crawler.configure() spider = crawler.spiders.create(spider_name) crawler.crawl(spider) crawler.start() log.start() reactor.run() print "it can't printed out!"

it works @ should: visits pages, scrape needed info , stores output json told it(via feed_uri). when spider finishing work(i can see number in output json) execution of script wouldn't resume. isn't scrapy problem. , reply should somewhere in twisted's reactor. how release thread execution?

you need stop reactor when spider finishes. can accomplish listening spider_closed signal:

from twisted.internet import reactor scrapy import log, signals scrapy.crawler import crawler scrapy.settings import settings scrapy.xlib.pydispatch import dispatcher testspiders.spiders.followall import followallspider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = followallspider(domain='scrapinghub.com') crawler = crawler(settings()) crawler.configure() crawler.crawl(spider) crawler.start() log.start() log.msg('running reactor...') reactor.run() # script block here until spider closed log.msg('reactor stopped.')

and command line log output might like:

stav@maia:/srv/scrapy/testspiders$ ./api 2013-02-10 14:49:38-0600 [scrapy] info: running reactor... 2013-02-10 14:49:47-0600 [followall] info: closing spider (finished) 2013-02-10 14:49:47-0600 [followall] info: dumping scrapy stats: {'downloader/request_bytes': 23934,...} 2013-02-10 14:49:47-0600 [followall] info: spider closed (finished) 2013-02-10 14:49:47-0600 [scrapy] info: reactor stopped. stav@maia:/srv/scrapy/testspiders$

python twisted scrapy

No comments:

Post a Comment