Wednesday, 15 August 2012

python - Issue with designing function for data scraping using BS4 -



python - Issue with designing function for data scraping using BS4 -

data require nowadays under 2 different combination of tag + class. want function search under both combinations , nowadays info under both together. both combinations mutually exclusive. if 1 combination nowadays other absent.

code using is:

# -*- coding: cp1252 -*- import csv import urllib2 import sys import urllib import time bs4 import beautifulsoup itertools import islice def match_both2(arg1,arg2): if arg1 == 'div' , arg2 == 'detailinternetfirstcontent empty openpostit': homecoming true if arg1 == 'p' , arg2 == 'connection': homecoming true homecoming false page = urllib2.urlopen('http://www.sfr.fr/mobile/offres/toutes-les-offres-sfr?vue=000029#sfrintid=v_nav_mob_offre-abo&sfrclicid=v_nav_mob_offre-abo').read() soup = beautifulsoup(page) datas = soup.findall(match_both2(0),{'class':match_both2(1)}) print datas

right now, trying utilize match_both2 function accomplish this, giving me typeerror passing 1 argument , requires 2. don't know in case how pass 2 arguments it, have called function match_both2(example1,example2). here, not able think of method can solve problem.

please help me in resolving issue.

when utilize function filter matching elements, pass in reference function, not it's result. in other words, not supposed phone call before passing .findall().

the function called one argument, element itself. moreover, class attribute has been split list. match specific elements, need difen match function as:

def match_either(tag): if tag.name == 'div': # @ *least* these 3 classes must nowadays homecoming {'detailinternetfirstcontent', 'empty', 'openpostit'}.issubset(tag.get('class', [])) if tag.name == 'p': # @ *least* 1 class must nowadays homecoming 'connection' in tag.get('class', [])

this function returns true p tag connection class, or div tag 3 classes present.

pass findall() without calling it:

datas = soup.findall(match_either)

python python-2.7 beautifulsoup

No comments:

Post a Comment