Friday, 15 February 2013

Sequentially crawl website using scrapy -



Sequentially crawl website using scrapy -

is there way tell scrapy stop crawling based upon status in 2nd level page? doing following:

i have start_url begin (1st level page) i have set of urls extracted start_url using parse(self, response) then add together queue links using request callback parsedetailpage(self, response) under parsedetail (2nd level page) come know if can stop crawling or not

right using closespider() accomplish this, problem urls parsed queued time start crawling sec level pages , not know how remove them queue. there way sequentially crawl list of links , able stop in parsedetailpage?

global job_in_range start_urls = [] start_urls.append("http://sfbay.craigslist.org/sof/") def __init__(self): self.job_in_range = true def parse(self, response): hxs = htmlxpathselector(response) results = hxs.select('//blockquote[@id="toc_rows"]') items = [] if results: links = results.select('.//p[@class="row"]/a/@href') link in links: if link self.end_url: break; nexturl = link.extract() isvalid = wputil.validateurl(nexturl); if isvalid: item = woodpeckeritem() item['url'] = nexturl item = request(nexturl, meta={'item':item},callback=self.parsedetailpage) items.append(item) else: self.error.log('could not parse document') homecoming items def parsedetailpage(self, response): if self.job_in_range false: raise closespider('end date reached - no more crawling ' + self.name) hxs = htmlxpathselector(response) print response body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]') item = response.meta['item'] item['postdate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1].select('.//date/text()')[0].extract() if item['jobtitle'] 'admin': self.job_in_range = false raise closespider('stop crawling') item['jobtitle'] = body.select('.//h2[@class="postingtitle"]/text()')[0].extract() item['description'] = body.select(str('.//section[@class="userbody"]/section[@id="postingbody"]')).extract() homecoming item

do mean stop spider , resume without parsing urls have been parsed? if so, may seek set the job_dir setting. setting can maintain request.queue in specified file on disk.

scrapy

No comments:

Post a Comment