Wednesday, 15 September 2010

utf 8 - Scala - Converting from ISO-8859-1 to UTF-8 gives foreign character strangeness -



utf 8 - Scala - Converting from ISO-8859-1 to UTF-8 gives foreign character strangeness -

here's problem; have inputstream i've converted byte array, don't know character set of inputstream @ runtime. original thought in utf-8, see unusual issues streams encoded iso-8859-1 , have foreign characters. (those crazy swedes)

here's code in question:

ioutils.tostring(inputstream, "utf-8") // fails on iso8859-1 foreign characters

to simulate this, have:

new string("\u00f6") // returns ö expected, since default encoding utf-8 new string("\u00f6".getbytes("utf-8"), "utf-8") // returns ö expected. new string("\u00f6".getbytes("iso-8859-1"), "utf-8") // returns \uffff, unknown character

what missing?

you should have source of info telling encoding, if cannot happen either need reject or guess encoding if it's not utf-8.

for western languages, guessing iso-8859-1 if it's not utf-8 going work of time:

bytebuffer bytes = bytebuffer.wrap(ioutils.tobytearray(inputstream)); charbuffer chars; seek { seek { chars = charset.forname("utf-8").newdecoder().decode(bytes); } grab (malformedinputexception e) { throw new runtimeexception(e); } grab (unmappablecharacterexception e) { throw new runtimeexception(e); } grab (charactercodingexception e) { throw new runtimeexception(e); } } grab (runtimeexception e) { chars = charset.forname("iso-8859-1").newdecoder().decode(bytes); } system.out.println(chars.tostring());

all boilerplate getting encoding exceptions , beingness able read same info multiple times.

you can utilize mozilla chardet uses more sophisticated heuristics determine encoding if it's not utf-8. it's not perfect, instance recall detecting finnish text in windows-1252 hebrew windows-1255.

also note arbitrary binary info valid in iso-8859-1 why observe utf-8 first (it extremely if passes utf-8 without exceptions, utf-8) , why cannot seek observe else after iso-8859-1.

scala utf-8 character-encoding iso-8859-1

No comments:

Post a Comment