regex - Pulling valid data from bytestring in Python 3 -
given next bytestring, how can remove characters matching \xff, , create list object what's left (by splitting on removed areas)?
b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00"
desired result:
["~", "pts/5", "/5", "user"]
the above string illustration - i'd remove \x.. (non-decoded) bytes.
i'm using python 3.2.3, , prefer utilize standard libraries only.
>>> = b"\x07\x00\x00\x00~\x10\x00pts/5\x00\x00/5\x00\x00user\x00\x00" >>> import re >>> re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a) [b'~', b'pts/5', b'/5', b'user']
the results still bytes
objects. if want results strings:
>>> [i.decode("ascii") in re.findall(rb"[^\x00-\x1f\x7f-\xff]+", a)] ['~', 'pts/5', '/5', 'user']
explanation:
[^\x00-\x1f\x7f-\xff]+
matches 1 or more (+
) characters not in range ([^...]
) between ascii 0 , 31 (\x00-\x1f
) or between ascii 127 , 255 (\x7f-\xff
).
be aware approach works if "embedded texts" ascii. remove extended alphabetic characters (like ä
, é
, €
etc.) strings encoded in 8-bit codepage latin-1
, , destroy strings encoded in utf-8
, other unicode encodings because contain byte values between 0 , 31/127 , 255 parts of character codes.
of course, can manually fine-tune exact ranges want remove according illustration given in answer.
regex python-3.x python-3.2 bytestring
No comments:
Post a Comment