Friday, 15 July 2011

c# - Replace with Regex -



c# - Replace with Regex -

in our application, user entering info ms word asp.net textarea command , info saved in sql server. reason, there few junk characters looks little squares when viewed sql server management studio.

this causing error while generating crystal reports.

i need regex strip such characters along bullets. valid input

a-z, a-z , 0-9, ~ ! @ # % $ ^ & * ( ) _ + | ` - = \ {}:">? < [ ] ; ' , . /

also, tab spaces should replaced single space. come in key or new line allowed.

currently using

regex.replace(data, @"[^\u0000-\u007f]", " ");

but won't work remove bullets or tab spaces.

can regex ninja help me problem? in advance.

you can utilize 2 regexes. first, pattern "\t|<bullet>" (where <bullet> stands representation of bullet) used first, replace tabs , bullets spaces (" "). second, pattern of negated character set containing list of valid characters, used second, replace invalid characters empty string (""), is, rid of them. since need maintain cr , lf characters (and space), these must added set of valid characters:

using system; using system.text.regularexpressions; static class programme { public static void main() { string pattern1 = @"\t"; regex regex1 = new regex(pattern1, regexoptions.compiled); string pattern2 = @"[^a-za-z0-9~!#$^&*()_+|`\-=\\{}:"">?<\[\];',./ \r\n]"; regex regex2 = new regex(pattern2, regexoptions.compiled); string input = "abzabz09~!#$^&*()_+|`-=\\{}:\">?<[];',./ \r\nárvíztűrő\ttükörfúrógép"; string temp = regex1.replace(input, " "); string output = regex2.replace(temp, ""); console.writeline(input); console.writeline(output); console.readkey(true); } }

output:

abzabz09~!#$^&*()_+|`-=\{}:">?<[];',./ árvíztűrő tükörfúrógép abzabz09~!#$^&*()_+|`-=\{}:">?<[];',./ rvztr tkrfrgp

note tab after árvíztűrő replaced single space.

about bullets:

i made bulleted list in word , copied textarea in webpage. saved html , figured out bullets saved utf-8-encoded character e280a2. called above "representation of bullet". should figure out binary representation of possible bullet characters , add together them first pattern: either or them tab character, or set of them character set:

using system; using system.text; using system.text.regularexpressions; static class programme { public static void main() { byte[] bulletbytes = new byte[] { 0xe2, 0x80, 0xa2 }; string bullet= encoding.utf8.getstring(bulletbytes); string pattern1 = @"[\t" + bullet + "]"; regex regex1 = new regex(pattern1, regexoptions.compiled); string pattern2 = @"[^a-za-z0-9~!#$^&*()_+|`\-=\\{}:"">?<\[\];',./ \r\n]"; regex regex2 = new regex(pattern2, regexoptions.compiled); string input = bullet + "abzabz09~!#$^&*()_+|`-=\\{}:\">?<[];',./ \r\n" + bullet + "árvíztűrő\ttükörfúrógép"; string temp = regex1.replace(input, " "); string output = regex2.replace(temp, ""); console.outputencoding = encoding.utf8; console.writeline(input); console.writeline(output); console.readkey(true); } }

output (you should alter console font lucida console see bullet):

•abzabz09~!#$^&*()_+|`-=\{}:">?<[];',./ •árvíztűrő tükörfúrógép abzabz09~!#$^&*()_+|`-=\{}:">?<[];',./ rvztr tkrfrgp

now in add-on tab, bullet @ origin of each line has been replaced space.

c# regex

No comments:

Post a Comment