Design and Implementation of a validating XML parser in Haskell

Master's thesis

University of Applied Sciences Wedel

Martin Schmidt

Abstract

This thesis introduces the core component of the Haskell XML Toolbox: a validating XML parser that supports almost fully the Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation [WWW01]. The thesis presents how a validating XML parser and XML processing applications can be implemented by using filter functions as a uniform design.

The Haskell XML Toolbox is a collection of tools for processing XML with Haskell. It is itself purely written in Haskell. The Toolbox is a project of the University of Applied Sciences Wedel, initialized by Prof. Dr. Uwe Schmidt.

The Haskell XML Toolbox bases on the ideas of HaXml [WWW21] and HXML [WWW25], but introduces a more general approach for processing XML with Haskell. It uses a generic data model for representing XML documents, including the DTD subset and the document subset. This data model makes is possible to use filter functions as a uniform design of XML processing applications. Libraries with filters and combinators are provided for processing this data model.

The following components are included:

hdom - Core data types and functions for processing XML with Haskell
hparser - XML parser
hvalidator - Modules for validating XML documents
hxslt - Modules for XSL transformations

Prof. Dr. Uwe Schmidt wrote the basic parser and core functions. His master student Christine Nickel wrote the package hxslt, his master student Martin Schmidt wrote the package hvalidator and some parts of the parser.

Related work

Malcolm Wallace and Colin Runciman wrote HaXml [WWW21], a collection of utilities for using Haskell and XML together. The Haskell XML Toolbox is based on their idea of using filter combinators for processing XML with Haskell.

Joe English wrote HXML [WWW25], a non-validating XML parser in Haskell. His idea of validating XML by using derivatives of regular expressions [WWW26] was implemented in the validation functions of this software.

Dedication

"Learn at least one new [programming] language every year. Different languages solve the same problems in different ways. By learning several different approaches, you can help broaden your thinking and avoid getting stuck in a rut." --- The Pragmatic Programmer

Table of Contents

1.1. XML

1.1.1. Introduction
1.1.2. Processing XML
1.1.3. Correctness of documents
1.1.4. Document Type Definition (DTD)
1.1.5. Example

1.2. Haskell

1.2.1. Introduction
1.2.2. Functions
1.2.3. Types
1.2.4. Pattern Matching
1.2.5. Guards
1.2.6. Higher-order Functions
1.2.7. Modules

2. Package hdom

2.1. Modules

2.2. Data structures

2.2.1. Core components
2.2.2. Data type XmlTree
2.2.3. Example

2.3. Filter functions

2.3.1. Introduction
2.3.2. Filters from module NTree
2.3.3. Filters from module XmlTreeAccess

2.4. Filter combinators

2.4.1. Introduction
2.4.2. List of combinators
2.4.3. Binding power and associativity

2.5. Examples for filters and filter combinators

2.5.1. Removing comments
2.5.2. Merging internal and external DTD subset

2.6. Access functions

2.7. State-I/O monad from module XmlState

3. Package hparser

3.1. Overview
3.2. Module HdomParser
3.3. Module XmlParser
3.4. Module XmlInput
3.5. Module DTDProcessing
3.6. Module XmlOutput

4. Package hvalidator

4.1. Module hierarchy

4.2. Creating a validating XML parser

4.3. Validation of the Document Type Definition

4.4. Validation of the document subset

4.5. Transformation of the document subset

4.6. Validation of attribute values

4.7. Derivatives of regular expressions

4.7.1. Description
4.7.2. Examples
4.7.3. Realization in Haskell
4.7.4. Conclusions

5. Conclusion

5.1. XML conformance of the parser
5.2. The Haskell XML Toolbox in comparison to HaXml and HXML
5.3. Conclusions and future work

A. User handbook

A.1. System requirements

A.2. Missing features and known problems

A.3. Directory structure

A.4. HXmlParser - Well-formedness checker and validator

A.5. Check the XML parser with the XML Test Suites

A.6. Performance and profiling

A.6.1. Usage information
A.6.2. How to interpret the results

B. MIT License

Bibliography

List of Tables
1-1. Connectors
1-2. Occurrence indicators
1-3. Grouping
1-4. #PCDATA in content model
1-5. Keywords
1-6. Type of the attribute value
1-7. Attribute defaults
2-1. Binding power and associativity of functional combinators
2-2. Binding power and associativity of monadic combinators

List of Figures
2-1. Module hierarchy of XmlTree
2-2. General modules of package hdom
2-3. Modules of XmlState
3-1. Modules of XmlParser
4-1. Modules of package hvalidator

List of Examples
1-1. Sample XML document
1-2. Adding two integers
1-3. Prefix and infix notation
1-4. Type signature of an infix function
1-5. Define associativity and binding power
1-6. A tuple for persons with ID and name
1-7. Get the head of a list
1-8. Definition of String
1-9. A type for binary trees
1-10. Calculate the length of a list
1-11. Get all values from Bintree
1-12. Get all leafs from Bintree
1-13. Calculate the maximum of two numbers
1-14. Apply a function to a list
1-15. Increase all numbers in a list
1-16. Modules
2-1. Input document
2-2. Input document as XmlTree
2-3. Internal representation of the document subset
2-4. Adding a new attribute to an XTag node.
2-5. Using the monad from XmlState
4-1. Validating a document with errors
4-2. Validation that notations are not declared for EMPTY elements
4-3. Validation that all attributes are unique
5-1. Document subset in HXML
5-2. DTD subset in HXML
5-3. XML documents in HaXml
5-4. Document subset in HaXml
5-5. The filter type of HaXml
5-6. XML documents represented in the Haskell XML Toolbox
A-1. Executing RunTestCases

		Next
		Preface