XML Special Characters
A couple weeks ago, QA team of my project has reported that there was a defect in a module I have developed. The module is a part of our web services for distributing news data. The inbound execution flow is quite simple; the module accept SOAP request from client, transform it to a proprietary XML format of our backend server then send the transformed request to the backend news engine. The characteristic of the defect is that the users can not use string containing XML special characters as a search keyword e.g. they can not query news headlines containing “S&P500”. The news engine could not parse the requests causing exception on the backend server.
It’s a pretty well known fact that there is a set of special characters that must be properly escaped using entity reference before an XML instance that contains these characters can be consumed by any standard XML parser. These characters are apostrophe ( ‘ ), ampersand ( & ), quotation mark ( ” ), less-than symbol ( < ) and greater-than symbol ( > ). Normally, we don’t need to inspect all characters of our string data manually. All decent XML libraries should be able to handle the task for us. For example, JAX-WS stub internally uses JAXB for message parsing so you can just set “S&P500” directly as a search keyword.
NewsProvider_Service service = new NewsProvider_Service();
NewsProvider provider = service.getNewsProviderPort();
List<String> hls = provider.getNewsHeadlines("companies:S&P500");
JAXB will escape the “&” character using “&” entity reference to form a valid XML document. The actual SOAP string sent through network will look like:
<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
<S:Body>
<ns2:getNewsHeadlines xmlns:ns2="http://ws.news.devguli/">
<query>companies:S&P500</query>
</ns2:getNewsHeadlines>
</S:Body>
</S:Envelope>
The receiver at the other end of this communication must also use XML libraries to parse the message to get back the original “companies:S&P500” string.
It was not too hard to figure out that there was something wrong with the code responsible for creating the messages between my web services and the backend server. I must have done something that let the characters out un-escaped. What I didn’t quite understand was that the error occurred only when the search keyword contained “<” or “&” character. The rest of the special characters could be sent to the backend server just fine. I thought it was something about implementation dependent behavior of each XML libraries so I tried playing with various API and found that all standard Java XML libraries perform escaping only for “<”, “>” and “&”. I have to admit I haven’t noticed this before.
public class Main {
public static void main(String[] args) throws Exception {
String msg = " [ < ], [ > ], [ \" ] , [ & ], [ ' ]";
StringWriter writer = new StringWriter();
writeDOM(msg, writer);
System.out.println("DOM Output = " + writer);
writer = new StringWriter();
writeJAXB(msg, writer);
System.out.println("JAXB Output = " + writer);
writer = new StringWriter();
writeSTAX(msg, writer);
System.out.println("StaX Output = " + writer);
}
public static void writeDOM(String msg, Writer writer) throws Exception{
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element data = doc.createElement("Data");
data.appendChild( doc.createTextNode(msg) );
doc.appendChild( data);
Transformer tr = TransformerFactory.newInstance().newTransformer();
tr.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
tr.transform( new DOMSource(doc.getDocumentElement()) , new StreamResult(writer) );
}
public static void writeJAXB(String msg, Writer writer) throws Exception{
QName qn = new QName("Data");
JAXBElement<String> elem = new JAXBElement<String>(qn, String.class, msg);
JAXBContext ctx = JAXBContext.newInstance(String.class);
Marshaller m = ctx.createMarshaller();
m.setProperty(Marshaller.JAXB_FRAGMENT, Boolean.TRUE);
m.marshal(elem, writer);
}
public static void writeSTAX(String msg, Writer writer) throws Exception {
XMLStreamWriter xmlWriter = XMLOutputFactory.newInstance().createXMLStreamWriter(writer);
xmlWriter.writeStartElement("Data");
xmlWriter.writeCharacters(msg);
xmlWriter.writeEndElement();
xmlWriter.close();
}
}
The above code show that JAXB, DOM and StAX are all output the same string; “ [ < ], [ > ], [ " ] , [ & ], [ ' ]”. I tried to gather for more information and found a post on StackOverflow that Jon Skeet (you must have heard the name if you are a regular at the site) had post a valuable reply to answer the question.
From section 2.4 of the XML 1.0 spec (5th edition)
“The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&" and "<" respectively. The right angle bracket (>) may be represented using the string “>”, and must, for compatibility, be escaped using either “>” or a character reference when it appears in the string “]]>” in content, when that string is not marking the end of a CDATA section.”
The above paragraph states that escaping “<” and “&” is the must. This explains why, in my cases, the exception occurred only when the request contains search keyword with those two characters but the requests with keyword containing “>” works just fine.
For the greater-than character, it seems like the rules are a bit relax. In writing-out operation, XML libraries “must” escape “>” character if the libraries want to produce XML instance that compatible with SGML standard (superset of XML standard). I can see all standard Java XML libraries do just like that but I am not sure it’s because this compatibility concern or it’s just a good practice to do so. In reading-in operation, the “>” characters in the raw xml string don’t need to be escaped. XML libraries are able to parse a file containing the string content as shown below successfully.
<Data> Special text A > B </ Data>
There is one exception. If the greater-than character is part of the string “]]>” but the string doesn’t form a proper CDATA section then the “>” character must be escaped.
<Data> Special text ]]> </Data> // cause parsing error <Data> Special text ]]> </Data> // valid XML <Data><![CDATA[ Special text]]></Data> // valid XML
I have to say this small defect teach me a lot about escaping special characters in XML.
Root cause
I will write about the root cause of this defect here in case you are interested how my module make uses of XML libraries but still has a hole that let those special characters out un-escaped. The reason is that the module contains a part that construct XML document by just appending string together. The messages using in web services are JAXB object and I have a requirement to marshal some JAXB objects to XML string but with different namespace. Since namespace information is the inherent property of JAXB which can’t be modified so I have to marshal those object to SAX content handler and manipulate namespace information in the handler instead.
public static String overrideNameSpace(JAXBElement<?> jaxb)
throws JAXBException{
Marshaller m = JAXBContext.newInstance("com.devguli").createMarshaller();
StringBuffer xml = new StringBuffer();
String nsPrefix = "news";
NameSpaceOverriderHandler handler = new NameSpaceOverriderHandler (nsPrefix);
m.marshal( jaxb , new SAXResult(handler) );
xml.append("<" + nsPrefix + ":Root xmlns:" + nsPrefix + "='http://devguli.com/news'>");
xml.append(handler.getOutputXML() );
xml.append("</" + nsPrefix + ":Root>");
return xml.toString();
}
public class NameSpaceOverriderHandler extends DefaultHandler {
private final String nsPrefix;
private final StringBuilder outputXml;
public NameSpaceOverriderHandler(String nsPrefix) {
this.nsPrefix = nsPrefix;
this.outputXml = new StringBuilder();
}
public void startElement(String uri, String localName, String qName, Attributes atts)
throws SAXException {
outputXml.append("<" + nsPrefix + ":" + localName + serializeAllAttributes(atts) + ">");
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
outputXml.append("</" + nsPrefix + ":" + localName + ">");
}
public void characters(char[] ch, int start, int length) throws SAXException {
outputXml.append(ch, start, length);
}
………
………
public String getOutputXML() {
return outputXml.toString();
}
}
You can see in the characters() method of SAX callback that I just appended all characters directly without checking whether the char array contained XML special characters or not.
read more




