Saturday, 15 September 2012

python - Anonymize data in column 1 and column 3 of tables in an HTML file -



python - Anonymize data in column 1 and column 3 of tables in an HTML file -

i have html file several tables in it. alter info in columns 1 , columns 3 name+number, number increments after each row updated. this:

<!doctype html public "-//w3c//dtd html 4.01//en"> <html lang="en"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>blah blah doc</title> <style type="text/css"> ... ... </style> </head> <body> <!-- lots of html tags p h1, h2 ul etc no tables skipped on --> <table id="something" summary="..."> <thead> <th ...</th> ... </thead> <tbody> <tr> <td>mark jones</td> <td>blah blah</td> <td>mark jones</td> <td>blah blah</td> <td>11/12/2009</td> <td>blah blah</td> </tr> ...

would become:

... <tr> <td>name1</td> <td>blah blah</td> <td>name1</td> <td>blah blah</td> <td>11/12/2009</td> <td>blah blah</td> </tr>

there lots of other html tags , text before, after, , between tables.

above illustration of row; name , other column info different in each row. whitespace how appears when view source. i'm reasonably comfortable perl , python, don't know plenty tackle this.

assuming that's in table , have lxml installed (and caveat haven't had morning dose of coffee yet!):

html = """ <table> <tr> <td>mark jones</td> <td>blah blah</td> <td>mark jones</td> <td>blah blah</td> <td>11/12/2009</td> <td>blah blah</td> </tr></table>""" import lxml.html import lxml.etree itertools import count tree = lxml.html.fromstring(html) next_name = lambda count=count(1): 'name{}'.format(next(count)) trs in tree.findall('tr'): tds = trs.findall('td') anon_name = next_name() tds[0].text = anon_name tds[2].text = anon_name print lxml.etree.tostring(tree)

gives you:

<table><tr><td>name1</td> <td>blah blah</td> <td>name1</td> <td>blah blah</td> <td>11/12/2009</td> <td>blah blah</td> </tr></table>

python perl sed

No comments:

Post a Comment