Although I have been aware of the existence of XML, I have never had any real experience of working with it. It looks like that is changing.
It all started when, on a whim, I purchased XSLT by Doug Tidwell. As I began reading it, typing in various bits and pieces of XML, I came to the conclusion that that I really needed to sit down and learn what XML is actually all about.
Up until that point my idea of XML was as a relatively simple format with tags, and so long as the tags match, everything is ok. How hard could it be to create an XML parser? It surely must be pretty straightforward, and could be achieved in a few dozen lines of code.
And so began journey into the depths of the XML 1.0 specification. My code is currently just creeping over 2000 lines and has yet to pass my initial test suite, although it is getting closer.
I used the parsec library to do the parsing of the file together with the compact-string library to deal with encodings and they worked pretty smoothly once I figured out what I was trying to achieve with them. In fact, getting the initial XML parser was relatively straightforward – just some grunt work to implement it all. The problems mainly occurred with various well-formedness constraints and (something I have yet to tackle at all) the validity constraints. Additionally, since there are some nice tests to go with the specification, my focus is currently on getting them to pass, and in particular writing out my parsed XML document in canonical form.
My development of the XML parser has (will have!) the following stages:
- Read the file as UTF16
- Preprocess to get rid of carriage returns
- Parse the basic structure
- Expand various entities and make sure they also parse correctly
- Ensure the various well-formedness constraints are satisfied
- Ensure the various validity constraints are satisfied
- Completion!
I am currently in the middle of stage 5 partly because I keep getting distracted by making the canonical export work (it is needed for checking results of some tests). Unfortunately, I believe I am still doing the easy stuff and that the most difficult stage is the next one where the validity constraints have to be enforced.
Nonetheless, it has been a valuable learning experience – my knowledge of XML has increased many-fold as has my familiarity with parsec and compact-string libraries.
One thing I have noticed with parsec is that I probably over-use the “try” combinator. I took a very lazy approach and basically implemented my own version of the parsec combinators which scatter “try”s like there is no tomorrow, to ensure I didn’t have to do any hard thinking about when they are actually required. While this means that parsing successfully is easy, it does seem to mean that if there is a parse error, I get almost no clue where the error was, which is somewhat unfortunate. Still, hopefully as time goes on, I can start to remove the “try”s bit by bit and gain some performance improvements along the way.
I did come up with a perhaps-useful combinator of my own during the process of implementing the XML specification. At several stages, there is a rule of the form
NotVowel = Char - [aeiou]
which should match any character except a vowel. My first attempt to implement this used “manyTill” like so
notVowel = do manyTill char vowel
but this has the unfortunate effect of “eating” the first vowel encountered. Instead, it should be implemented something like this:
notVowel = do m <- optionMaybe $ C.lookAhead vowel case m of Just _ -> P.unexpected "Vowel encountered" Nothing -> char
or, in its more general form:
type Parser = P.Parsec Input XmlState butNot :: (Parser a) -> (Parser b) -> (Parser a) butNot p e = do m <- optionMaybe $ C.lookAhead e case m of Just _ -> P.unexpected "butNot failure" Nothing -> p
This seems to do the job marvellously well. Basically, it looks for the “bad” input and if it finds it, fails. If it doesn’t find the “bad” input, then it happily parses “good” input.
No related posts.

{ 6 comments… read them below or add one }
Try this:
> vowel = oneOf "aeiouAEIOU"
> notVowels = many (notFollowedBy vowel >> anyChar)
@Antoine
Doesn’t that “eat” the char after the vowels though.
That is what I am trying to avoid.
> vowels = “aeiouAEIOU”
> vowel = oneOf vowels
> notVowel = noneOf vowels
Whoops, I realise I have been confusing myself – not a good thing to do. Antoine’s solution is in fact perfect. I was confusing whether I wsa matching vowels or not.
Also, I somehow managed to miss the useful “notFollowedBy” function in the parsec documentation.
So,
is equivalent to
notFollowedBy p >> ewhich is nice since it gets rid of the awful-looking “optionMaybe” and “unexpected” as well as the “case”.
Hello,
I have some questiones about the parsec Library and I would really appreciate some help.
I need to make a Hakell programme that takes a XML and XPath expression as inputs and solve the Xpath expression for the XML file. I need first to represent both XML and Xpath as Haskell data.
I am supposing that my programme will take only valid XML files and valid Xpath expression, do you think and can use the parsec Library for this, or as they’re valid inputs there is no need to use a parser for that? I just have some toubles representing an XML file in Haskell and I was searching for alternatives to do so..
Thank you in advance for your help,
Lila
I think the best way would probably to use something like the hxt library (http://hackage.haskell.org/package/hxt). Having a browse though it, there looks to be support for XPath .
In particular, the Text.XML.HXT.XPath.XPathEval module looks to have what you need. Basically you would parse your xml with hxt, then perform the XPath query on the result.
I haven’t got a great deal of experience with this though so have no idea how successful it might be.
Good luck!