Archive for December, 2009

JTidy and UTF-8 (international characters)

Tuesday, December 15th, 2009

To make JTidy work correctly with UTF-8 strings and process international characters in a proper way, use the following code:

JAVA:

  1. Document doc = Tidy.createEmptyDocument();
  2.         try {
  3.             doc = tidy.parseDOM(new InputStreamReader(IOUtils.toInputStream(html), "UTF-8"), new NullWriter());
  4.         } catch (UnsupportedEncodingException e) {
  5.             log.error(e);
  6.         }