python - Anonymize data in column 1 and column 3 of tables in an HTML file -
i have html file several tables in it. alter info in columns 1 , columns 3 name+number, number increments after each row updated. this:
<!doctype html public "-//w3c//dtd html 4.01//en"> <html lang="en"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>blah blah doc</title> <style type="text/css"> ... ... </style> </head> <body> <!-- lots of html tags p h1, h2 ul etc no tables skipped on --> <table id="something" summary="..."> <thead> <th ...</th> ... </thead> <tbody> <tr> <td>mark jones</td> <td>blah blah</td> <td>mark jones</td> <td>blah blah</td> <td>11/12/2009</td> <td>blah blah</td> </tr> ...
would become:
... <tr> <td>name1</td> <td>blah blah</td> <td>name1</td> <td>blah blah</td> <td>11/12/2009</td> <td>blah blah</td> </tr>
there lots of other html tags , text before, after, , between tables.
above illustration of row; name , other column info different in each row. whitespace how appears when view source. i'm reasonably comfortable perl , python, don't know plenty tackle this.
assuming that's in table , have lxml
installed (and caveat haven't had morning dose of coffee yet!):
html = """ <table> <tr> <td>mark jones</td> <td>blah blah</td> <td>mark jones</td> <td>blah blah</td> <td>11/12/2009</td> <td>blah blah</td> </tr></table>""" import lxml.html import lxml.etree itertools import count tree = lxml.html.fromstring(html) next_name = lambda count=count(1): 'name{}'.format(next(count)) trs in tree.findall('tr'): tds = trs.findall('td') anon_name = next_name() tds[0].text = anon_name tds[2].text = anon_name print lxml.etree.tostring(tree)
gives you:
<table><tr><td>name1</td> <td>blah blah</td> <td>name1</td> <td>blah blah</td> <td>11/12/2009</td> <td>blah blah</td> </tr></table>
python perl sed
No comments:
Post a Comment