Hee: scripting - How to scrape html tags spread over multiple lines in python? -

Wednesday, 15 June 2011

scripting - How to scrape html tags spread over multiple lines in python? -

i trying scrape webpage in python. able results tags on single line, tags spread on multiple lines, code cannot retrieve anything.

in html source single line tags nowadays as:

<td>john matthew falletta, md

and multiple line tags nowadays as:

<td>division:    </td><td>hematology/oncology</td>

here wrote:

patfinderfullname = re.compile('<span class="facultyname">(.*)</span>')  fullname = re.findall(patfinderfullname,webpage)         #works fine  patfinderdivision = re.compile('<span class="label">division:</span>&nbsp;&nbsp;</td><td>(.*)</td>')   partition = re.findall(patfinderdivision,webpage)       #doesn't work

here webpage variable contains url has scraped. can point out, missing, or wrong?

i highly recommend utilize beautifulsoup. python library parsing html documents.

p.s: if want stick own code, utilize \s* skip white spaces in regex.

patfinderdivision = re.compile('division:\s*  \s*</td><td>(.*)</td>')

python scripting web-scraping

Hee

Wednesday, 15 June 2011

scripting - How to scrape html tags spread over multiple lines in python? -

No comments:

Post a Comment