01 Dec 2006

UTF-7

Encoding problems

Character encoding is one of those things that needs careful attention all the time. My usual take is: “UTF-8 is the answer. What was the question?” UTF-8 has some handy properties, such as being compatible with ASCII, while being able to store any Unicode character. It’s described nicely in What Is UTF-8 And Why Is It Important? and more generally in the Java Internationalization book.

But other encodings are available, and I recently had to deal with UTF-7. This was a new one on me, but if you have to deal with low-level email encoding, you’ll find it’s surprisingly popular. The big thing about UTF-7 is that content can be included in SMTP without having to wrapper it in base64 or some other transfer encoding.

UTF-7 is supposed to be dead because, despite its plus points, we now have things like 8BITMIME. In practice, though, it’s not at all dead, and for the Java programmer this is a problem because the platform does not support UTF-7.

No panic: the language has an extension mechanism in the CharsetProvider class. Go find an open source library and drop it in. There are a couple: Zimbra have a Mozilla licensed implementation (thanks to Andy for spotting this), and there’s a GPL version called JCharSet that can also be purchased as a commercial license. I went with the Zimbra one:

$ cvs -z6 -d :pserver:anonymous:@cvs.zimbra.com:/usr/local/cvsroot co
main
$ cd main/ZimbraCharset
$ ant

… which produces build/zimbra-charset.jar, which looks like this:

JAR folder layout

Drop the jar into your project, and you’re away with code like:

Charset utf7 = Charset.forName("UTF-7");

However, drop that library and code into a web application, and it won’t work. Didn’t for me, with Tomcat. You might be able to drop the library into the JRE/lib/ext folder or similar places, but I don’t like that option. Instead I thought I’d try to understand what’s going on. The short answer is: I don’t know.

The documentation says: Charset providers may be installed in an instance of the Java platform as extensions, that is, jar files placed into any of the usual extension directories. Providers may also be made available by adding them to the applet or application class path or by some other platform-specific means. Charset providers are looked up via the current thread’s context class loader. I’d have thought that WEB-INF/lib is a good extension directory, but I suspect the phrase “extension directory” is being used in a specific technical sense so perhaps WEB-INF/lib doesn’t apply. Maybe it’s a security issue, or perhaps it’s a bug.

So for now, I’ve used an explicit request along the lines of:

final Charset charset;
if (false == Charset.isSupported("UTF-7"))
{
 ZimbraCharsetProvider zcpi = new ZimbraCharsetProvider();
 charset = zcpi.charsetForName("UTF-7);
}
else
{
 charset = Charset.forName("UTF-7");
}

It’s a pain, but at least that works.

Given that Java is going GPL, I presume I can take the JCharSet UTF-7 implementation and submit it as a big fix and get this sorted once and for all in the platform…