java - HWPFDocument / XWPFDocument New Lines -
i trying pull info microsoft-word , translate sql statement , inserting oracle database.
when info in ms-word contains new line created [shift-enter] , not enter,
the text contains icon looks box question mark.
where et standard new line using come in key , st new lines using
shift-enter combination. when generating sql , inserting oracle, oracle counts not text, hex.
my question is, how remove lines created [shift-enter] standard '\n'?
thanks
update how text information
poifsfilesystem fs = new poifsfilesystem(new fileinputstream(file)); hwpfdocument doc = new hwpfdocument(fs); wordextractor = new wordextractor(doc); text = we.gettext();
update answer: bug in poi-3.6. in poi-3.8 shows \r.
what you're seeing "fields" in word document, special blocks of text such links, macros etc
option number 1 go on using wordextractor, phone call stripfields(string) on resulting text before using it. that'll remove of these fields text you.
the other alternative utilize different way of getting text out. wordtotextconverter part of apache poi, , more complex code handles more of format , should skip these (wordextractor pretty simple , low level). other utilize apache tika, provides mutual way of extracting text number of file formats. have proper code deal fields, , added bonus it'll trivial back upwards .docx or .pdf when requirements change!
java apache-poi
No comments:
Post a Comment