XML parsing with Haskell

by Colin Ross on February 16, 2009

Although I have been aware of the existence of XML, I have never had any real experience of working with it. It looks like that is changing.

It all started when, on a whim, I purchased XSLT by Doug Tidwell. As I began reading it, typing in various bits and pieces of XML, I came to the conclusion that that I really needed to sit down and learn what XML is actually all about.

Up until that point my idea of XML was as a relatively simple format with tags, and so long as the tags match, everything is ok. How hard could it be to create an XML parser? It surely  must be pretty straightforward, and could be achieved in a few dozen lines of code.

And so began journey into the depths of the XML 1.0 specification. My code is currently just creeping over 2000 lines and has yet to pass my initial test suite, although it is getting closer.

I used the parsec library to do the parsing of the file together with the compact-string library to deal with encodings and they worked pretty smoothly once I figured out what I was trying to achieve with them. In fact, getting the initial XML parser was relatively straightforward – just some grunt work to implement it all. The problems mainly occurred with various well-formedness constraints and (something I have yet to tackle at all) the validity constraints. Additionally, since there are some nice tests to go with the specification, my focus is currently on getting them to pass, and in particular writing out my parsed XML document in canonical form.

My development of the XML parser has (will have!) the following stages:

  • Read the file as UTF16
  • Preprocess to get rid of carriage returns
  • Parse the basic structure
  • Expand various entities and make sure they also parse correctly
  • Ensure the various well-formedness constraints are satisfied
  • Ensure the various validity constraints are satisfied
  • Completion!

I am currently in the middle of stage 5 partly because I keep getting distracted by making the canonical export work (it is needed for checking results of some tests). Unfortunately, I believe I am still doing the easy stuff and that the most difficult stage is the next one where the validity constraints have to be enforced.

Nonetheless, it has been a valuable learning experience – my knowledge of XML has increased many-fold as has my familiarity with parsec and compact-string libraries.

One thing I have noticed with parsec is that I probably over-use the “try” combinator. I took a very lazy approach and basically implemented my own version of the parsec combinators which scatter “try”s like there is no tomorrow, to ensure I didn’t have to do any hard thinking about when they are actually required. While this means that parsing successfully is easy, it does seem to mean that if there is a parse error, I get almost no clue where the error was, which is somewhat unfortunate. Still, hopefully as time goes on, I can start to remove the “try”s bit by bit and gain some performance improvements along the way.

I did come up with a perhaps-useful combinator of my own during the process of implementing the XML specification. At several stages, there is a rule of the form

NotVowel = Char - [aeiou]

which should match any character except a vowel. My first attempt to implement this used “manyTill” like so

notVowel = do
   manyTill char vowel

but this has the unfortunate effect of “eating” the first vowel encountered. Instead, it should be implemented something like this:

notVowel = do
   m <- optionMaybe $ C.lookAhead vowel
   case m of
      Just _ -> P.unexpected "Vowel encountered"
      Nothing -> char

or, in its more general form:

type Parser = P.Parsec Input XmlState
butNot :: (Parser a) -> (Parser b) -> (Parser a)
butNot p e = do
   m <- optionMaybe $ C.lookAhead e
   case m of
      Just _ -> P.unexpected "butNot failure"
      Nothing -> p

This seems to do the job marvellously well. Basically, it looks for the “bad” input and if it finds it, fails. If it doesn’t find the “bad” input, then it happily parses “good” input.

Bookmark and Share

No related posts.

{ 6 comments… read them below or add one }

Antoine Latter February 17, 2009 at 4:41 am

Try this:


> vowel = oneOf "aeiouAEIOU"
> notVowels = many (notFollowedBy vowel >> anyChar)

:-)

Reply

Colin Ross February 18, 2009 at 3:59 pm

@Antoine

Doesn’t that “eat” the char after the vowels though.

That is what I am trying to avoid.

Reply

Joe Thornber February 18, 2009 at 4:02 pm

> vowels = “aeiouAEIOU”
> vowel = oneOf vowels
> notVowel = noneOf vowels

Reply

Colin Ross February 19, 2009 at 10:21 am

Whoops, I realise I have been confusing myself – not a good thing to do. Antoine’s solution is in fact perfect. I was confusing whether I wsa matching vowels or not.

Also, I somehow managed to miss the useful “notFollowedBy” function in the parsec documentation.

So,

butNot p e

is equivalent to

notFollowedBy p >> e

which is nice since it gets rid of the awful-looking “optionMaybe” and “unexpected” as well as the “case”.

Reply

Lila January 22, 2010 at 10:03 am

Hello,
I have some questiones about the parsec Library and I would really appreciate some help.
I need to make a Hakell programme that takes a XML and XPath expression as inputs and solve the Xpath expression for the XML file. I need first to represent both XML and Xpath as Haskell data.
I am supposing that my programme will take only valid XML files and valid Xpath expression, do you think and can use the parsec Library for this, or as they’re valid inputs there is no need to use a parser for that? I just have some toubles representing an XML file in Haskell and I was searching for alternatives to do so..
Thank you in advance for your help,
Lila

Reply

Colin Ross January 22, 2010 at 10:46 am

I think the best way would probably to use something like the hxt library (http://hackage.haskell.org/package/hxt). Having a browse though it, there looks to be support for XPath .

In particular, the Text.XML.HXT.XPath.XPathEval module looks to have what you need. Basically you would parse your xml with hxt, then perform the XPath query on the result.

I haven’t got a great deal of experience with this though so have no idea how successful it might be.

Good luck!

Reply

Leave a Comment

Previous post:

Next post: