utf 8 - Scala - Converting from ISO-8859-1 to UTF-8 gives foreign character strangeness -
here's problem; have inputstream i've converted byte array, don't know character set of inputstream @ runtime. original thought in utf-8, see unusual issues streams encoded iso-8859-1 , have foreign characters. (those crazy swedes)
here's code in question:
ioutils.tostring(inputstream, "utf-8") // fails on iso8859-1 foreign characters
to simulate this, have:
new string("\u00f6") // returns ö expected, since default encoding utf-8 new string("\u00f6".getbytes("utf-8"), "utf-8") // returns ö expected. new string("\u00f6".getbytes("iso-8859-1"), "utf-8") // returns \uffff, unknown character
what missing?
you should have source of info telling encoding, if cannot happen either need reject or guess encoding if it's not utf-8.
for western languages, guessing iso-8859-1 if it's not utf-8 going work of time:
bytebuffer bytes = bytebuffer.wrap(ioutils.tobytearray(inputstream)); charbuffer chars; seek { seek { chars = charset.forname("utf-8").newdecoder().decode(bytes); } grab (malformedinputexception e) { throw new runtimeexception(e); } grab (unmappablecharacterexception e) { throw new runtimeexception(e); } grab (charactercodingexception e) { throw new runtimeexception(e); } } grab (runtimeexception e) { chars = charset.forname("iso-8859-1").newdecoder().decode(bytes); } system.out.println(chars.tostring());
all boilerplate getting encoding exceptions , beingness able read same info multiple times.
you can utilize mozilla chardet uses more sophisticated heuristics determine encoding if it's not utf-8. it's not perfect, instance recall detecting finnish text in windows-1252 hebrew windows-1255.
also note arbitrary binary info valid in iso-8859-1 why observe utf-8 first (it extremely if passes utf-8 without exceptions, utf-8) , why cannot seek observe else after iso-8859-1.
scala utf-8 character-encoding iso-8859-1
No comments:
Post a Comment