XML Special Characters

Feb 23, 2011 by

A couple weeks ago, QA team of my project has reported that there was a defect in a module I have developed. The module is a part of our web services for distributing news data. The inbound execution flow is quite simple; the module accept SOAP request from client, transform it to a proprietary XML format of our backend server then send the transformed request to the backend news engine. The characteristic of the defect is that the users can not use string containing XML special characters as a search keyword e.g. they can not query news headlines containing “S&P500”. The news engine could not parse the requests causing exception on the backend server.

It’s a pretty well known fact that there is a set of special characters that must be properly escaped using entity reference before an XML instance that contains these characters can be consumed by any standard XML parser. These characters are apostrophe ( ‘ ), ampersand ( & ), quotation mark ( ” ), less-than symbol ( < ) and greater-than symbol ( > ). Normally, we don’t need to inspect all characters of our string data manually. All decent XML libraries should be able to handle the task for us. For example, JAX-WS stub internally uses JAXB for message parsing so you can just set “S&P500” directly as a search keyword.

NewsProvider_Service service = new NewsProvider_Service();
NewsProvider provider = service.getNewsProviderPort();
List<String> hls = provider.getNewsHeadlines("companies:S&P500");

JAXB will escape the “&” character using “&” entity reference to form a valid XML document. The actual SOAP string sent through network will look like:

<S:Envelope xmlns:S="http://schemas.xmlsoap.org/soap/envelope/">
    <S:Body>
        <ns2:getNewsHeadlines xmlns:ns2="http://ws.news.devguli/">
            <query>companies:S&amp;P500</query>
        </ns2:getNewsHeadlines>
    </S:Body>
</S:Envelope>

The receiver at the other end of this communication must also use XML libraries to parse the message to get back the original “companies:S&P500” string.

It was not too hard to figure out that there was something wrong with the code responsible for creating the messages between my web services and the backend server. I must have done something that let the characters out un-escaped. What I didn’t quite understand was that the error occurred only when the search keyword contained “<” or “&” character. The rest of the special characters could be sent to the backend server just fine. I thought it was something about implementation dependent behavior of each XML libraries so I tried playing with various API and found that all standard Java XML libraries perform escaping only for “<”, “>” and “&”. I have to admit I haven’t noticed this before.

public class Main {
    public static void main(String[] args) throws Exception {
        String msg = " [ < ], [ > ],  [ \" ] , [ & ], [ ' ]";
        StringWriter writer = new StringWriter();

        writeDOM(msg, writer);
        System.out.println("DOM Output = " + writer);

        writer = new StringWriter();
        writeJAXB(msg, writer);
        System.out.println("JAXB Output = " + writer);

        writer = new StringWriter();
        writeSTAX(msg, writer);
        System.out.println("StaX Output = " + writer);
    }

    public static void writeDOM(String msg, Writer writer) throws Exception{
        DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc = builder.newDocument();

        Element data = doc.createElement("Data");
        data.appendChild( doc.createTextNode(msg) );

        doc.appendChild( data);

        Transformer tr = TransformerFactory.newInstance().newTransformer();
        tr.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        tr.transform( new DOMSource(doc.getDocumentElement()) , new StreamResult(writer) );
    }

    public static void writeJAXB(String msg, Writer writer) throws Exception{
        QName qn = new QName("Data");
        JAXBElement<String> elem = new JAXBElement<String>(qn, String.class, msg);

        JAXBContext ctx = JAXBContext.newInstance(String.class);
        Marshaller m = ctx.createMarshaller();
        m.setProperty(Marshaller.JAXB_FRAGMENT, Boolean.TRUE);
        m.marshal(elem, writer);
    }

    public static void writeSTAX(String msg, Writer writer) throws Exception {
        XMLStreamWriter xmlWriter = XMLOutputFactory.newInstance().createXMLStreamWriter(writer);

        xmlWriter.writeStartElement("Data");
        xmlWriter.writeCharacters(msg);
        xmlWriter.writeEndElement();
        xmlWriter.close();
    }
}

The above code show that JAXB, DOM and StAX are all output the same string; “ [ &lt; ], [ &gt; ], [ " ] , [ &amp; ], [ ' ]”. I tried to gather for more information and found a post on StackOverflow that Jon Skeet (you must have heard the name if you are a regular at the site) had post a valuable reply to answer the question.

From section 2.4 of the XML 1.0 spec (5th edition)

“The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&" and "<" respectively. The right angle bracket (>) may be represented using the string “>”, and must, for compatibility, be escaped using either “>” or a character reference when it appears in the string “]]>” in content, when that string is not marking the end of a CDATA section.”

The above paragraph states that escaping “<” and “&” is the must. This explains why, in my cases, the exception occurred only when the request contains search keyword with those two characters but the requests with keyword containing “>” works just fine.

For the greater-than character, it seems like the rules are a bit relax. In writing-out operation, XML libraries “must” escape “>” character if the libraries want to produce XML instance that compatible with SGML standard (superset of XML standard). I can see all standard Java XML libraries do just like that but I am not sure it’s because this compatibility concern or it’s just a good practice to do so. In reading-in operation, the “>” characters in the raw xml string don’t need to be escaped. XML libraries are able to parse a file containing the string content as shown below successfully.

<Data> Special text A > B </ Data>

There is one exception. If the greater-than character is part of the string “]]>” but the string doesn’t form a proper CDATA section then the “>” character must be escaped.

<Data> Special text  ]]> </Data>  // cause parsing error
<Data> Special text  ]]&gt; </Data>  // valid XML
<Data><![CDATA[ Special text]]></Data> // valid XML

I have to say this small defect teach me a lot about escaping special characters in XML.

Root cause

I will write about the root cause of this defect here in case you are interested how my module make uses of XML libraries but still has a hole that let those special characters out un-escaped. The reason is that the module contains a part that construct XML document by just appending string together. The messages using in web services are JAXB object and I have a requirement to marshal some JAXB objects to XML string but with different namespace. Since namespace information is the inherent property of JAXB which can’t be modified so I have to marshal those object to SAX content handler and manipulate namespace information in the handler instead.

public static String overrideNameSpace(JAXBElement<?> jaxb)
throws JAXBException{
        Marshaller m = JAXBContext.newInstance("com.devguli").createMarshaller();

        StringBuffer xml = new StringBuffer();
        String nsPrefix = "news";

        NameSpaceOverriderHandler handler = new NameSpaceOverriderHandler (nsPrefix);
        m.marshal( jaxb , new SAXResult(handler) ); 

        xml.append("<" + nsPrefix + ":Root xmlns:" + nsPrefix + "='http://devguli.com/news'>");
        xml.append(handler.getOutputXML() );
        xml.append("</" + nsPrefix + ":Root>");

        return xml.toString();
    }

public class NameSpaceOverriderHandler extends DefaultHandler {

    private final String nsPrefix;
    private final StringBuilder outputXml;

    public NameSpaceOverriderHandler(String nsPrefix) {
        this.nsPrefix = nsPrefix;
        this.outputXml = new StringBuilder();
    }

    public void startElement(String uri, String localName, String qName, Attributes atts)
    throws SAXException {
        outputXml.append("<" + nsPrefix + ":" + localName + serializeAllAttributes(atts) + ">");
    }

    public void endElement(String uri, String localName, String qName)
    throws SAXException {
        outputXml.append("</" + nsPrefix + ":" + localName + ">");
    }

    public void characters(char[] ch, int start, int length) throws SAXException {
        outputXml.append(ch, start, length);
    }

     ………
     ………

    public String getOutputXML() {
        return outputXml.toString();
    }

}

You can see in the characters() method of SAX callback that I just appended all characters directly without checking whether the char array contained XML special characters or not.

read more

Related Posts

Share This

JAXB Binder and XPath

May 14, 2010 by

I came across javax.xml.bind.Binder when I was reading SOA Using Java Web Services (excellent book). I had never used this class before so I set out to find how the class could be used. I found that the class hadn’t been mentioned as much as the classes like Marshaller or UnMarshaller but it was very useful.

Binder is usually used to perform partial binding; unmarshalling JAXB object from a part of XML DOM tree. JAXB specification states three use cases of the class. Two are related to partial binding and another one is about the capability of using XPath navigation. It is the last one that I am interested in the most because I actually have a module of my product that I can make use of this technique perfectly.

Below is the XML schema I have created just to simulate the functionality of the module.

<schema xmlns="http://www.w3.org/2001/XMLSchema"
    targetNamespace="http://ws.news.com/query"
    xmlns:tns="http://ws.news.com/query"
    elementFormDefault="qualified"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <element name="Query" type="tns:Query"/>

    <complexType name="Query">
    	<sequence>
    		<element name="TimeOut" type="int"/>
    		<element name="Hit" type="int"/>
    		<element name="Filter" type="tns:Filter"/>
    	</sequence>
    </complexType> 

    <complexType name="Filter">
		<group ref="tns:Searchable"/>
	</complexType>

	<element name="And" type="tns:BooleanExpr"/>
	<element name="Or" type="tns:BooleanExpr"/>

	<group name="Searchable">
		<choice>
			<element name="Company" type="string"/>
			<element name="Section" type="string"/>
			<element name="TitleText" type="string"/>
			<element name="TitleAndBodyText" type="string"/>
			<element ref="tns:And"/>
			<element ref="tns:Or"/>
		</choice>
	</group>

	<complexType name="BooleanExpr">
		<sequence>
			<group ref="tns:Searchable" minOccurs="2" maxOccurs="unbounded"/>
		</sequence>
	</complexType>
</schema>

The schema describes request format of a kind of search engine. Users are able to search for item that associated with metadata; Company/Section or search for item that contains a particular string. In my real production code, it’s a news server. The interesting thing is that the schema allow user to group searchable indexes using boolean operator like And, Or. The boolean operators can also be comprise of sub boolean operators allowing the request to grow with no limit of the depth of content tree.

The functionality of the module I’ve mentioned is to extract all occurrences of JAXB object correspondent to , perform decoration on the content of the indexes then replace the original content with the newly decorated one. Below is an example of a simple request.

<Query xmlns="http://ws.news.com/query"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

	<TimeOut>10</TimeOut>
	<Hit>60</Hit>
	<Filter>
		<Or>
			<And>
				<Or>
					<Section>News</Section>
					<Section>Announcement</Section>
					<Section>Product</Section>
				</Or>
				<Company>BBL.BK</Company>
			</And>
			<And>
				<Or>
					<Section>News</Section>
					<Section>Announcement</Section>
					<Section>Product</Section>
				</Or>
				<Company>PTT.BK</Company>
			</And>
			<And>
				<Section>Trade</Section>
				<Company>SCB.BK</Company>
				<Company>MSFT.O</Company>
				<Company>IBM.N</Company>
			</And>
		</Or>
	</Filter>
</Query>

Manipulating JAXB object is normally easier than operating on low level DOM. But traversing through the whole JAXB object hierarchy is not much better than traversing DOM tree. Especially when the JAXB object we are working with is not quite straightforward. Let’s look at the generated BooleanExpr class for example.

@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "BooleanExpr", propOrder = {
    "searchable"
})
public class BooleanExpr {

    @XmlElementRefs({
        @XmlElementRef(name = "Or", namespace = "http://ws.news.com/query", type = JAXBElement.class),
        @XmlElementRef(name = "TitleAndBodyText", namespace = "http://ws.news.com/query", type = JAXBElement.class),
        @XmlElementRef(name = "TitleText", namespace = "http://ws.news.com/query", type = JAXBElement.class),
        @XmlElementRef(name = "Section", namespace = "http://ws.news.com/query", type = JAXBElement.class),
        @XmlElementRef(name = "And", namespace = "http://ws.news.com/query", type = JAXBElement.class),
        @XmlElementRef(name = "Company", namespace = "http://ws.news.com/query", type = JAXBElement.class)
    })
    protected List<JAXBElement<?>> searchable;

public List<JAXBElement<?>> getSearchable() {
        if (searchable == null) {
            searchable = new ArrayList<JAXBElement<?>>();
        }
        return this.searchable;
    }
}

The concept of data binding between Java and XML is not a perfect world. XML is a very large and complex standard. It’s very difficult if not impossible to define mapping between Java representation and the whole XML information set seamlessly. Some XML artifacts are not able to be mapped to Java with all XML constraints 100% preserved.

Content in BooleanExpr is a choice model group which combines with the maxOccurs=”unbounded” constraint to make the getSearchable() method doesn’t look so nice. Traversing through this BooleanExpr need some checking to see what is the object being operated on.

public static void handleBooleanExpr(BooleanExpr expr){
  List<JAXBElement<?>> searchableList = expr.getSearchable();
  for(JAXBElement<?> elem : searchableList){
  if( elem.getName().equals(andQname ) || elem.getName().equals(orQname )){
	handleBooleanExpr( (BooleanExpr)elem.getValue() );

   }else{
	if( elem.getName().equals(companyQname ) ){
		decorate(elem);
	}
   }
  }
}

I am using just one choice model group in the example because I don’t want to make it too complicated. You may be able to guess that the code to traverse JAXB object will get bloated quickly if there are three or more choice model groups.

If our JAXB object was DOM document then XPath is the clear choice for this kind of task. But if I want to use DOM then I have to marshall the JAXB object to DOM, apply XPath query to perform decoration then unmarshall the modified DOM back to JAXB object. I need to repeat this round-trip processing every time I want to use XPath on the request. It would be nice if I can operate on the request both with JAXB object and XPath. JAXB Binder allows you to do just that.

public void decorateCompany(Query query) throws JAXBException, XPathExpressionException{
		Binder<Node> binder = _ctx.createBinder();
		Node queryDOMView = createBlankDOMDocument(true);  

		//Marshall Query object to a blank DOM document.
		//Binder will maintains association between two views.
		QName qname = new QName("http://ws.news.com/query", "Query");
		binder.marshal( new JAXBElement<Query>(qname, Query.class, query)  , queryDOMView);

		//Search for all occurrences of Company using XPath.
		XPath xpath = XPathFactory.newInstance().newXPath();
		xpath.setNamespaceContext( new QueryNamespaceContext());
		NodeList compList = (NodeList)xpath.evaluate("//query:Company", queryDOMView, XPathConstants.NODESET);

		//Perform decoration
		for(int i=0; i<compList.getLength(); i++){
			Node comp = compList.item(i);
			comp.setTextContent( decorate( comp.getTextContent() ));
		}

		//Synchronize the changes back to Query object.
		binder.updateJAXB(queryDOMView);

	}

	public Node createBlankDOMDocument(boolean namespaceAware) {
		DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
		fact.setNamespaceAware(namespaceAware);
		DocumentBuilder builder;
		try {
			builder = fact.newDocumentBuilder();

		} catch (ParserConfigurationException e) {
			throw new RuntimeException(e);
		}

		return builder.newDocument();
	}

Binder maintains the association between JAXB object and its correspondent XML information set. You can bind Query object to DOM document then modify JAXB object and update the modification to the associated DOM. Or you can modify the DOM tree and then synchronize the changes back to JAXB object. This will give us the best from both worlds. It’s easy to get simple properties like Hit or TimeOut from Query object and I also have option to use low level XML manipulation like XPath to search for particular information from the whole Query object graph.

read more

Related Posts

Share This