Wednesday, 15 August 2012

encoding - In Perl, How do I replace UTF8 characters, such as \x91, \x{2018}, \x{2013}, \x{2014} with simple ASCII chars? -



encoding - In Perl, How do I replace UTF8 characters, such as \x91, \x{2018}, \x{2013}, \x{2014} with simple ASCII chars? -

i'm working various articles , problem i'm having various authors utilize various characters punctuation characters.

for example, several documents i'm work have characters such as:

\x91 \x92 \x{2018} \x{2019}

and these characters represent simple quote '.

what want simplify articles had same formatting style.

does know module, or method, of converting these character , similar ones (like double quotes, dashes, etc) simple ascii characters?

i'm doing things like:

sub fix_chars_in_document { $document = shift; $document =~ s/\xa0/ /g; $document =~ s/\x91/'/g; $document =~ s/\x92/'/g; $document =~ s/\x93/"/g; $document =~ s/\x94/"/g; $document =~ s/\x97/-/g; $document =~ s/\xab/"/g; $document =~ s/\xa9//g; $document =~ s/\xae//g; $document =~ s/\x{2018}/'/g; $document =~ s/\x{2019}/'/g; $document =~ s/\x{201c}/"/g; $document =~ s/\x{201d}/"/g; $document =~ s/\x{2022}//g; $document =~ s/\x{2013}/-/g; $document =~ s/\x{2014}/-/g; $document =~ s/\x{2122}//g; homecoming $document ; }

but hard i've manually find characters , replace them.

first, solution benefit hash.

my %asciify = ( chr(0x00a0) => ' ', chr(0x0091) => "'", chr(0x0092) => "'", chr(0x0093) => '"', chr(0x0094) => '"', chr(0x0097) => '-', chr(0x00ab) => '"', chr(0x00a9) => '/', chr(0x00ae) => '/', chr(0x2018) => "'", chr(0x2019) => "'", chr(0x201c) => '"', chr(0x201d) => '"', chr(0x2022) => '/', chr(0x2013) => '-', chr(0x2014) => '-', chr(0x2122) => '/', ); $pat = bring together '', map quotemeta, keys %asciify; $re = qr/[$pat]/; sub fix_chars { ($s) = @_; $s =~ s/($re)/$asciifi{$1}/g; homecoming $s; }

that said, want text::unidecode.

just punctuation characters:

use text::unidecode qw( unidecode ); s/(\p{punct}+)/ unidecode($1) /eg;

perl encoding character-encoding ascii non-ascii-characters

No comments:

Post a Comment