Saturday, 15 March 2014

regex - Pulling valid data from bytestring in Python 3 -



regex - Pulling valid data from bytestring in Python 3 -

given next bytestring, how can remove characters matching \xff, , create list object what's left (by splitting on removed areas)?

b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"

desired result:

["~", "pts/5", "/5", "user"]

the above string illustration - i'd remove \x.. (non-decoded) bytes.

i'm using python 3.2.3, , prefer utilize standard libraries only.

>>> = b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00" >>> import re >>> re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a) [b'~', b'pts/5', b'/5', b'user']

the results still bytes objects. if want results strings:

>>> [i.decode("ascii") in re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)] ['~', 'pts/5', '/5', 'user']

explanation:

[^\x00-\x1f\x7f-\xff]+ matches 1 or more (+) characters not in range ([^...]) between ascii 0 , 31 (\x00-\x1f) or between ascii 127 , 255 (\x7f-\xff).

be aware approach works if "embedded texts" ascii. remove extended alphabetic characters (like ä, é, etc.) strings encoded in 8-bit codepage latin-1, , destroy strings encoded in utf-8 , other unicode encodings because contain byte values between 0 , 31/127 , 255 parts of character codes.

of course, can manually fine-tune exact ranges want remove according illustration given in answer.

regex python-3.x python-3.2 bytestring

No comments:

Post a Comment