python - Extract multiple line data between two symbols - Regex and Python3 -
i have huge file need info specific entries. file construction is:
>entry1.1 #size=1688 704 1 1 1 4 979 2 2 2 0 1220 1 1 1 4 1309 1 1 1 4 1316 1 1 1 4 1372 1 1 1 4 1374 1 1 1 4 1576 1 1 1 4 >entry2.1 #size=6251 6110 3 1.5 0 2 6129 2 2 2 2 6136 1 1 1 4 6142 3 3 3 2 6143 4 4 4 1 6150 1 1 1 4 6152 1 1 1 4 >entry3.2 #size=1777 , on-----------
what have accomplish need extract lines (complete record) entries. e.x. need record entry1.1 can utilize name of entry '>entry1.1' till next '>' markers in regex extract lines in between. not know how build such complex regex expressions. 1 time have such look set loop:
for entry in entrylist: record big_file processing write in result file
what regex perform such extraction of record specific entries? there more pythonic way accomplish this? appreciate help on this.
ak
with regex
import re ss = ''' >entry1.1 #size=1688 704 1 1 1 4 979 2 2 2 0 1220 1 1 1 4 1309 1 1 1 4 1316 1 1 1 4 1372 1 1 1 4 1374 1 1 1 4 1576 1 1 1 4 >entry2.1 #size=6251 6110 3 1.5 0 2 6129 2 2 2 2 6136 1 1 1 4 6142 3 3 3 2 6143 4 4 4 1 6150 1 1 1 4 6152 1 1 1 4 >entry3.2 #size=1777 , on----------- ''' patbase = '(>entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\z))' while true: x = raw_input('what entry want ? : ') found = re.findall(patbase % x, ss, re.dotall) if found: print 'found ==',found each_entry in found: print '\n%s\n' % each_entry else: print '\n ** there no such entry **\n'
explanation of '(>entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\z))'
:
%s
receives reference of entry: 1.1 , 2 , 2.1 etc
the portion (?![^\n]+?\d)
verification.
(?![^\n]+?\d)
negative look-ahead assertion says after %s
must not [^\n]+?\d
characters [^\n]+?
before digit \d
i write [^\n]
mean "any character except newline \n
". obliged write instead of .+?
because set flag re.dotall
, pattern portion .+?
acting until end of entry. however, want verify after entered reference (represented %s in pattern), there won't supplementary digits before end of line, entered error
all because if there entry2.1 no entry2 , , user enters 2 because wants entry2 , no other, regex observe presence of entry2.1 , yield it, though user entry2 in fact.
3)at end of '(>entry *%s(?![^\n]+?\d).+?)
, part .+?
grab finish block of entry, because dot represents character, comprised newline \n
it's aim set flag re.dotall
in order create next pattern portion .+?
capable pass newlines until end of entry.
i want matching stop @ end of entry desired, not within next one, grouping defined parenthesises in (>entry *%s(?![^\n]+?\d).+?)
grab want hence, set @ end positive look-ahaed assertion (?=>|(?:\s*\z))
says character before running ungreedy .+?
must stop match either >
(beginning of next entry) or end of string \z
. possible end of lastly entry wouldn't end of entire string, set \s*
means "possible whitespaces before end". \s*\z
means "there can whitespaces before bump end of string" whitespaces blank
, \f
, \n
, \r
, \t
, \v
python regex
No comments:
Post a Comment