Wednesday, 15 June 2011

scripting - How to scrape html tags spread over multiple lines in python? -



scripting - How to scrape html tags spread over multiple lines in python? -

i trying scrape webpage in python. able results tags on single line, tags spread on multiple lines, code cannot retrieve anything.

in html source single line tags nowadays as:

<td><span class="facultyname">john matthew falletta, md</span>

and multiple line tags nowadays as:

<td><span class="label">division:</span> &nbsp;&nbsp; </td><td>hematology/oncology</td>

here wrote:

patfinderfullname = re.compile('<span class="facultyname">(.*)</span>') fullname = re.findall(patfinderfullname,webpage) #works fine patfinderdivision = re.compile('<span class="label">division:</span>&nbsp;&nbsp;</td><td>(.*)</td>') partition = re.findall(patfinderdivision,webpage) #doesn't work

here webpage variable contains url has scraped. can point out, missing, or wrong?

i highly recommend utilize beautifulsoup. python library parsing html documents.

p.s: if want stick own code, utilize \s* skip white spaces in regex.

patfinderdivision = re.compile('<span class="label">division:</span>\s*&nbsp;&nbsp;\s*</td><td>(.*)</td>')

python scripting web-scraping

No comments:

Post a Comment