java - Regex for almost JSON but not quite -
hello i'm trying parse out pretty formed string it's component pieces. string json it's not json strictly speaking. they're formed so:
createdat=fri aug 24 09:48:51 edt 2012, id=238996293417062401, text='test test', source="region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
with output chunks of text nil special has done @ point.
createdat=fri aug 24 09:48:51 edt 2012 id=238996293417062401 text='test test' source="region" entities=[foo, bar] user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
using next look able of fields separated out
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))
which split on commas not in quotes of type, can't seem create jump splits on commas not in brackets or braces well.
because want handle nested parens/brackets, "right" way handle them tokenize them separately, , maintain track of nesting level. instead of single regex, need multiple regexes different token types.
this python, converting java shouldn't hard.
# comma sep_re = re.compile(r',') # open paren or open bracket inc_re = re.compile(r'[[(]') # close paren or close bracket dec_re = re.compile(r'[)\]]') # string literal # (i lazy escaping. add together other escape sequences, or find # "official" regex use.) chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''') # class could've been generator function, couldn;'t # find way manage state in match function wasn't # awkward. class tokenizer: def __init__(self): self.pos = 0 def _match(self, regex, s): m = regex.match(s, self.pos) if m: self.pos += len(m.group(0)) self.token = m.group(0) else: self.token = '' homecoming self.token def tokenize(self, s): field = '' # field we're working on depth = 0 # how many parens/brackets deep while self.pos < len(s): if not depth , self._match(sep_re, s): # in java, alter "yields" append list, , you'll # have equivalent (but non-lazy). yield field field = '' else: if self._match(inc_re, s): depth += 1 elif self._match(dec_re, s): depth -= 1 elif self._match(chunk_re, s): pass else: # else consume 1 character @ time self.token = s[self.pos] self.pos += 1 field += self.token yield field
usage:
>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz')) ['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']
this implementation takes few shortcuts:
the string escapes lazy: supports\"
in double quoted strings , \'
in single-quoted strings. easy fix. it keeps track of nesting level. not verify parens matched parens (rather brackets). if care can alter depth
sort of stack , push/pop parens/brackets onto it. java regex