Character encoding is one of those things that needs careful attention all the time. My usual take is: “UTF-8 is the answer. What was the question?” UTF-8 has some handy properties, such as being compatible with ASCII, while being able to store any Unicode character. It’s described nicely in What Is UTF-8 And Why Is It Important? and more generally in the Java Internationalization book.
But other encodings are available, and I recently had to deal with UTF-7. This was a new one on me, but if you have to deal with low-level email encoding, you’ll find it’s surprisingly popular. The big thing about UTF-7 is that content can be included in SMTP without having to wrapper it in base64 or some other transfer encoding.
UTF-7 is supposed to be dead because, despite its plus points, we now have things like 8BITMIME. In practice, though, it’s not at all dead, and for the Java programmer this is a problem because the platform does not support UTF-7.
No panic: the language has an extension mechanism in the
CharsetProvider
class. Go find an open source library and drop it in.
There are a couple: Zimbra have a Mozilla licensed implementation
(thanks to Andy for spotting this), and there’s a GPL version called
JCharSet that can also be purchased as a commercial license. I went
with the Zimbra one:
$ cvs -z6 -d :pserver:anonymous:@cvs.zimbra.com:/usr/local/cvsroot co
main
$ cd main/ZimbraCharset
$ ant
… which produces build/zimbra-charset.jar
, which looks like this:
Drop the jar into your project, and you’re away with code like:
Charset utf7 = Charset.forName("UTF-7");
However, drop that library and code into a web application, and it won’t
work. Didn’t for me, with Tomcat. You might be able to drop the library
into the JRE/lib/ext
folder or similar places, but I don’t like that
option. Instead I thought I’d try to understand what’s going on. The
short answer is: I don’t know.
The documentation says: Charset providers may be installed in an
instance of the Java platform as extensions, that is, jar files placed
into any of the usual extension directories. Providers may also be made
available by adding them to the applet or application class path or by
some other platform-specific means. Charset providers are looked up via
the current thread’s context class loader. I’d have thought that
WEB-INF/lib
is a good extension directory, but I suspect the phrase
“extension directory” is being used in a specific technical sense so
perhaps WEB-INF/lib
doesn’t apply. Maybe it’s a security issue, or perhaps it’s a bug.
So for now, I’ve used an explicit request along the lines of:
final Charset charset;
if (false == Charset.isSupported("UTF-7"))
{
ZimbraCharsetProvider zcpi = new ZimbraCharsetProvider();
charset = zcpi.charsetForName("UTF-7);
}
else
{
charset = Charset.forName("UTF-7");
}
It’s a pain, but at least that works.
Given that Java is going GPL, I presume I can take the JCharSet UTF-7 implementation and submit it as a big fix and get this sorted once and for all in the platform…