Haskell XML Toolbox 8.5.0
Contents
Introduction
The Haskell XML Toolbox is a collection of tools for
processing XML with
Haskell.
It is purely written in Haskell.
The Haskell XML Toolbox is a project of the
University of Applied Sciences Wedel,
The main design goal of the Haskell XML Toolbox
is the support of various XML standards including
Extensible Markup
Language (XML) 1.0 (Second Edition) with DTD
processing and Validation,
Namespaces in XML
1.0 (Second Edition),
XML Path Language
(XPath),
XSL Transformations
(XSLT),
RELAX NG Specification,
as well as HTML/XHTML processing.
Description
The Haskell XML Toolbox bases on the ideas of
HaXml and
HXML,
but introduces a more general and flexible approach for processing XML with
Haskell.
The Haskell XML Toolbox uses a generic data model for
representing XML documents,
including the DTD subset and the document subset, in Haskell.
This data model makes it possible to use filter functions
as a uniform design of XML processing applications.
The processing filters are implemented as
arrows.
This is more flexible than the filter approach from HXML and
HaXml,
but all filter applications can easily be transformed into
arrows.
Since version 5.2 HXT works with arrows instead of filters.
The filter part has been separated from this library and is
available in an extra package (see HXT with Filters)
There is a cookbook for using this arrow interface
to build (nontrivial) applications. Manuel Ohlendorf
has developed examples for processing RDF and has documented the
development in his master thesis: A Cookbook for the Haskell
XML Toolbox with Examples for Processing RDF Documents
(the thesis as PDF)
Features:
- Unicode and UTF-8, US-ASCII and ISO-Laitin-1 support
- http: and file: protocol support
- http access via proxy
- wellformed document parsing, validation
- namespace support: namespace propagation and checking
- XPath support for selection of document parts
- liberal HTML parser for interpreting any text containing <
... > as HTML/XML
- liberal and lasy lightweight HTML/XML parser based on
tagsoup
- Relax NG schema
validator
- integrated XSLT transformer
- easy conversion between user defined data structures and XML
by the use of pickler functions
Documentation
The HXT API Documentation generated
with Haddock.
A (somewhat) gentle introduction to HXT is avalable in the
Haskell
Wiki.
There's also a page about
HXT: Conversion of Haskell data from/to XML with picklers.
The XSLT transformer has been developed by Tim Walkenhorst in
this master thesis: Implementing an XSLT
processor for the Haskell XML Toolbox. It's a
rather complete implementation, but it's of course not a
substitute for Xalan or other advanced XSLT systems. The XSLT
module consists of less than 2000 lines of code. Compared with
the more than 300,000 lines of Java for Xalan, this Haskell code
can be viewed as one of the first formal specifications for XSLT.
Manuel Ohlendorfs master thesis, describing the arrow interface
of the toolbox: A Cookbook for the Haskell
XML Toolbox with Examples for Processing RDF Documents
(the thesis as PDF).
The source code of the example application is included in the
doc/cookbook directory of the distribution.
The master's thesis
"Design and Implementation of a validating XML parser in
Haskell"
by Martin Schmidt describes the design and motivation of the
Haskell XML Toolbox
(the thesis as HTML
or PDF) and the development of the DTD
validator module.
The documentation in the thesis is a bit out of date, the modules
and module names and some function names have been changed. For details the online
haddock documentation should be used.
The description of the development of the XPath modules
is described (in german) in Konzeption und Implementierung
eines XPath-Moduls für die Haskell XML Toolbox
(PDF-document).
The description of the internals of the Relax NG validator
modules is described (in german) in Design und Entwicklung
eines Relax NG Schema Validators auf Basis der Haskell XML
Toolbox (PDF-document).
Requirements
It is recommended to install the versions available from
Hackage.
HXT Downloads
hxt-8.5.0.tar.gz
hxt-xpath-8.5.0.tar.gz
hxt-xslt-8.5.0.tar.gz
hxt-filter-8.4.0.tar.gz
extension modules
- This version works with ghc 6.12.1 with cabal 1.8.
For ghc 6.10.* please use HXT 8.3.*.
- This version does not include the filter variant of hxt.
These modules are separated into an extra packages
hxt-filter.
Also
the XPath and XSLT modules are separated into packages
hxt-xpath and hxt-xslt.
- Includes sources for building a ghc package
hxt with Cabal.
This package contains a Haskell DOM, an XML parser, a HTML parser based on parsec,
a lightweight HTML/XML parser based on tagsoup, a
DTD validator, namespace processing functions, and a Relax
NG validator and the serialization/deserialisation of data to/from XML.
- HTTP access is done via the curl binding.
When installing curl (under Linux) be aware to first install the
source package of the curl library.
- Includes various examples, e.g. in example dir
examples/arrows/hparser/
a validating parser, which can be used as a starting point for
a HXT command line application.
- Includes an arrow interface with type classes and
overloading for a more flexible use of the filter technique.
A git repository is available under
http://git.fh-wedel.de/repos/hxt.git.
This repository also contains the other hxt-packages.
Installation
Before installing this version, install the curl and tagsoup modules.
For a quick install and access to hackage use the cabal installation
program.
A quick test of the example programs unpack the tar files and move into the example
subdirectories.
cd <hxt-package>/examples
make all
make test
Change History
- In Version 8.5.0
-
XPath module has been internally refactored and optimized
for speed. Runtime deficiencies for complex and ambigious
XPath expressions have been removed.
- XPath modules are separated into extra package hxt-xpath
- In Version 8.4.0
- Changes for ghc-6.12. This version requires
ghc-6.12. There is a new option a_text_mode for
text output of documents.
- XSLT part separated into extra package hxt-xslt
- In Version 8.3.2
This is a bug fix release.
-
New output option a_output_xhtmlfor writing XHTML.
-
New output option a_no_empty_elem_for for precise
control, which empty elements shall not be emitted in
short form <name .../>.
- New output option a_add_default_dtd for easy
adding a Document Type Declaration.
-
writeDocumentToString changed, such that it is a pure
arrow and does not need to run in the IO monad.
-
Dealing with URIs containing unescaped chars changed.
When URIs can't be parsed (with Network.URI), then the
not allowed chars will be escaped in %XX format and
URI parsing is retried. This enables normal file names to contain blanks and
other chars without explicit escaping.
- In Version 8.3.1
- Additional input option "a_accept_mimetypes" for setting a list of allowed
mime types when using readDocument.
- In Version 8.3.1
This is only a bug fix release.
-
Interface and option handling for libcurl reworked.
New input option "a_no_redirect" for preventing autmatic
redirect added.
-
Encoding of none XML/HTML text data done with the same
encoding routines as for XML/HTML. This enables easy processing
of other text documents.
- In Version 8.3.0
-
New output option "a_no_empty_elements" for preventing
the XML short format "<name/>" for HTML elements,
like "script", "p", and others. Especially a script tag
of the form "<script href="..."/>" does not work in
firefox.
Turning on this option gives the form "<script href="..."></script>".
-
An input option "a_strict_input" for bytestring input of
files added. Lazy input, especially when using the tagsoup
parser, can lead to error messages like "too many open
files" when processing a whole bunsh of documents.
- Internal representation of qualified names changed to gain
more space efficency.
- In Version 8.2.0
-
Modifications to work with ghc-6.10.
-
A new module Data.Atom for dealing with names like LISP atoms
and for sharing the memory for these values. When using names as keys
in tables, trees or maps, it becomes much more efficient to represent these
names as atoms than as strings. Equality check on atoms is constant in time and really fast, and all occurences
of an atom share the same internal value.
-
Implementation of strictA changed. strictA is marked deprecated.
The implementation is not longer done with a DeepSeq function but with
Control.Parallel.Strategies, the NFData class and rnf.
There is a new combinator rnfA for complete deep evaluation of an
arrow result.
-
Further functions for working with W3C XML Schema Regular expressions
in module Text.XML.HXT.RelaxNG.XmlSchema.RegexMatch,
especially for tokenizing and sed like editing of text.
- In Version 8.1.1
- In Version 8.1.0
-
HTTP interface changed to work with libcurl
via curl bindings with package curl.
So the HTTP package is not longer needed, also the old
and somewhat inefficent interface to curl by starting an external process and
communicate via a pipe is not longer needed.
When installing the curl bindings, be aware that the
libcurl development packages including the C header files
must be installed. Otherwise the Setup.hs will complain of
missing files.
-
New input option for ignoring none XML/HTML contents
when reading documents (useful for crawler like applications).
-
Mime type support for the file: protocol. Mime type mapping
can be controlled by a config file in the format of
/etc/mime.types on some linux systems.
A few more picklers for de-serialization from/to XML, e.g. for maps.
-
A new option for ignoring decoding errors when reading XML documents.
This may be useful for crawler like applications.
In Version 8.0.0
-
Old filter interface separated from the hxt package and moved
to an extra package hxt-filter.
-
Version numbers added in hxt.cabal for required package versions.
-
DTD validation and XPath modules refactored to work with arrows instead of
filters.
This is done for separating the old hxt filter library
from the actively developed and maintained arrow part.
Known problems and limitations
The parser has been tested with the XML Validation Suite form the
W3C. The following problems have been encountered:
- Line numbers in XML parser do not always point to the
correct position of the syntax error.
- Line numbers are not yet reported for validation constraint
errors.
- The standalone document check is not yet implemented.
- The XSLT module does not support the complete XSLT standard.
Portability
Portability to Windows based systems has not been tested very
intensively, but did work on an XP system with the Cygwin tools installed.
Development was done under Linux with GHC 6.12 with the -Wall
flag. No warnings were issued, when compiling the sources.
HXT with Filters
For older applications using the filter functionality,
there is an extra package hxt-filter.
This package must be installed on top of hxt.
The filter package will not be actively developed any more.
Please move to the arrow version for long term projects.
Installation works with cabal in the usual way.
Download archive is hxt-filter-8.4.0.tar.gz,
HXT Filter API Documentation
with source links is availabe as well as a
darcs repository under
http://git.fh-wedel.de/repos/hxt.git.
Extension Packages
There is a small package
hxt-binary-0.0.1.tar.gz
for serializing and deserializing documents represented as
XmlTrees. This package contains Binary instances for the HXT
data types representing XML data.
The above package is used for a package implementing a simple
cache system for parsed XML data
hxt-cache-0.0.1.tar.gz.
The cache system may be useful for the development of crawlers
or for preprocessing templates for web frameworks.
Related work
- Malcolm Wallace and Colin Runciman wrote
HaXml,
a collection of utilities for using Haskell and XML together.
The Haskell XML Toolbox is based on their idea
of using filter combinators for processing XML with Haskell.
- Joe English wrote
HXML
- a non-validating XML parser in Haskell.
His
idea
of validating XML by using derivatives of regular expressions
was
implemented in the validation functions of this software.
Also his ideas and sources for navigateble trees are used
in the hxpath modules.
Feedback
We are interested in hearing your feedback
on our Haskell XML Toolbox, suggestions
for improvements, comments and criticisms.
Mail address is
hxmltoolbox@fh-wedel.de
|
The Haskell XML Toolbox
is distributed under the
MIT License.
|
|
| Last modified: 2010-01-20 | |