Hee: python - Extract multiple line data between two symbols

Friday, 15 March 2013

python - Extract multiple line data between two symbols - Regex and Python3 -

i have huge file need info specific entries. file construction is:

>entry1.1 #size=1688 704 1   1   1   4 979 2   2   2   0 1220    1   1   1   4 1309    1   1   1   4 1316    1   1   1   4 1372    1   1   1   4 1374    1   1   1   4 1576    1   1   1   4 >entry2.1 #size=6251 6110    3   1.5 0   2 6129    2   2   2   2 6136    1   1   1   4 6142    3   3   3   2 6143    4   4   4   1 6150    1   1   1   4 6152    1   1   1   4 >entry3.2 #size=1777 , on-----------

what have accomplish need extract lines (complete record) entries. e.x. need record entry1.1 can utilize name of entry '>entry1.1' till next '>' markers in regex extract lines in between. not know how build such complex regex expressions. 1 time have such look set loop:

for entry in entrylist: record big_file processing write in result file

what regex perform such extraction of record specific entries? there more pythonic way accomplish this? appreciate help on this.

with regex

import re  ss = ''' >entry1.1 #size=1688 704 1   1   1   4 979 2   2   2   0 1220    1   1   1   4 1309    1   1   1   4 1316    1   1   1   4 1372    1   1   1   4 1374    1   1   1   4 1576    1   1   1   4 >entry2.1 #size=6251 6110    3   1.5 0   2 6129    2   2   2   2 6136    1   1   1   4 6142    3   3   3   2 6143    4   4   4   1 6150    1   1   1   4 6152    1   1   1   4 >entry3.2 #size=1777 , on----------- '''  patbase = '(>entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\z))'   while true:     x = raw_input('what entry want ? : ')     found = re.findall(patbase % x, ss, re.dotall)     if found:         print 'found ==',found         each_entry in found:             print '\n%s\n' % each_entry     else:         print '\n ** there no such entry **\n'

explanation of '(>entry *%s(?![^\n]+?\d).+?)(?=>|(?:\s*\z))' :

%s receives reference of entry: 1.1 , 2 , 2.1 etc

the portion (?![^\n]+?\d) verification.

(?![^\n]+?\d) negative look-ahead assertion says after %s must not [^\n]+?\d characters [^\n]+? before digit \d

i write [^\n] mean "any character except newline \n". obliged write instead of .+? because set flag re.dotall , pattern portion .+? acting until end of entry. however, want verify after entered reference (represented %s in pattern), there won't supplementary digits before end of line, entered error

all because if there entry2.1 no entry2 , , user enters 2 because wants entry2 , no other, regex observe presence of entry2.1 , yield it, though user entry2 in fact.

at end of '(>entry *%s(?![^\n]+?\d).+?) , part .+? grab finish block of entry, because dot represents character, comprised newline \n it's aim set flag re.dotallin order create next pattern portion .+? capable pass newlines until end of entry.

i want matching stop @ end of entry desired, not within next one, grouping defined parenthesises in (>entry *%s(?![^\n]+?\d).+?) grab want hence, set @ end positive look-ahaed assertion (?=>|(?:\s*\z)) says character before running ungreedy .+? must stop match either > (beginning of next entry) or end of string \z. possible end of lastly entry wouldn't end of entire string, set \s* means "possible whitespaces before end". \s*\z means "there can whitespaces before bump end of string" whitespaces blank, \f, \n, \r, \t, \v

python regex

Hee

Friday, 15 March 2013

python - Extract multiple line data between two symbols - Regex and Python3 -

No comments:

Post a Comment