encoding - In Perl, How do I replace UTF8 characters, such as \x91, \x{2018}, \x{2013}, \x{2014} with simple ASCII chars? -
i'm working various articles , problem i'm having various authors utilize various characters punctuation characters.
for example, several documents i'm work have characters such as:
\x91 \x92 \x{2018} \x{2019}
and these characters represent simple quote '
.
what want simplify articles had same formatting style.
does know module, or method, of converting these character , similar ones (like double quotes, dashes, etc) simple ascii characters?
i'm doing things like:
sub fix_chars_in_document { $document = shift; $document =~ s/\xa0/ /g; $document =~ s/\x91/'/g; $document =~ s/\x92/'/g; $document =~ s/\x93/"/g; $document =~ s/\x94/"/g; $document =~ s/\x97/-/g; $document =~ s/\xab/"/g; $document =~ s/\xa9//g; $document =~ s/\xae//g; $document =~ s/\x{2018}/'/g; $document =~ s/\x{2019}/'/g; $document =~ s/\x{201c}/"/g; $document =~ s/\x{201d}/"/g; $document =~ s/\x{2022}//g; $document =~ s/\x{2013}/-/g; $document =~ s/\x{2014}/-/g; $document =~ s/\x{2122}//g; homecoming $document ; }
but hard i've manually find characters , replace them.
first, solution benefit hash.
my %asciify = ( chr(0x00a0) => ' ', chr(0x0091) => "'", chr(0x0092) => "'", chr(0x0093) => '"', chr(0x0094) => '"', chr(0x0097) => '-', chr(0x00ab) => '"', chr(0x00a9) => '/', chr(0x00ae) => '/', chr(0x2018) => "'", chr(0x2019) => "'", chr(0x201c) => '"', chr(0x201d) => '"', chr(0x2022) => '/', chr(0x2013) => '-', chr(0x2014) => '-', chr(0x2122) => '/', ); $pat = bring together '', map quotemeta, keys %asciify; $re = qr/[$pat]/; sub fix_chars { ($s) = @_; $s =~ s/($re)/$asciifi{$1}/g; homecoming $s; }
that said, want text::unidecode.
just punctuation characters:
use text::unidecode qw( unidecode ); s/(\p{punct}+)/ unidecode($1) /eg;
perl encoding character-encoding ascii non-ascii-characters
No comments:
Post a Comment