Burned by subtlety: DOM4J vs W3C DOM (Xerces/Xalan)

No Comments

I was tasked to switch a utility from using the DOM4J API to the W3C DOM API because the DOM4J implementation is transforming Scandinavian characters into garbage wihle the XMLSerializer class in Xerces does it without any sweat.

Sounds relatively easy? Not when you are inexperienced in XML programming, Java or otherwise, as well as in XSL and its likes. Another factor is that the code was deeply embedded in the DOM4J API and not all have a corresponding functionality in the target implementation.

Pass forward to 12 working days later, I have almost completed the porting process but the final output has some element starting markups (‘< ', '>‘) escaped. I searched high and low for an answer as well as played with output format options of the Transformer class but to no avail.

Then Francis helped me understand a little bit of XSL, at least he pointed me to where I should start. One man-day and a few online XSL tutorials later, I found out this link that describes my issues perfectly. It seems the interpretation of the contents of the CDATA section is parser-specific. Where was that link two days ago before I started tearing my hair apart? Grrr…

Lesson of the Day: Google might be your friend, but like any friend it can play a nasty and costly trick on you.

Onward to transforming the CDATA texts into xsl:text elements with output escaping disabled and hopefully complete this task today. I will play the good colleague role and put out a note on each stylesheet so the next person who messes with the parsers wont go through the same grief as me.

ciao!

Leave a Reply