Statistics and Technical Guide

Abstract

A technical guide to how the CSV files have been processed to produce the reports published on these pages.

Statistics and Technical Guide

The CSV source files contain errors and inconsistencies. Some minor manual effort is required to correct some of the data prior to processing.

The first step is to convert the CSV file into an XML equivalent file.

Each comma within a quoted string need to be converted to a code such as "XXXQXXX" or the comma can be deleted. Other commas can be converted to "XXXCXXX". Various special characters can be converted to "XXX?XXX".

Each CSV file has a fixed number of fields. A string editor is used to convert each field to an XML node. The first node is converted to "<c01>...</c01>" and so on. After conversion the file should be checked to see that all "XXXCXXX" have been converted. Each record that contains "XXXCXXX" has an error in the CSV source file. These need manual editing. Note some files contain a total row. The total row should be deleted.

Once a valid XML file has been created then it is trivial to transform into the required report. I use the Docbook XML dialect and transform using Docbook XSLT.

One issue may be the size of the files to be processed especially if a year of transactions are concatenated together. I use Unix utilities such as SED, AWK and SORT to produce "programlisting" node in preference to formatting a Docbook table node.

Software Tools

These are the software tools that are used to transform the information. These tools need to be installed prior to use.

Java (www.java.com)
Ant (ant.apache.org)
Eclipse (www.eclipse.org)
Cygwin (www.cygwin.com) (for Unix utilities running under Windows)
Docbook (docbook.sourceforge.net)

It is easier to develop and test many small scripts for the Unix utilities with each one executing a well defined task. Debug is built in using debug variables. Performance is not a major issue in producing one off reports.

Ant has tasks that provides most of the functions of the Unix utilities. My preference is for the GNU Unix utilities rather than Perl scripts or REGEXP.

Ant property files are extensively used.

Ant Task Template Extracts

AWK

<delete file="${file.tmp.2}" failonerror="false" />
<exec dir="." executable="${awkexe}" output="${file.tmp.2}">
<arg line="-f" />
<arg line="${awk.dir}${file.awk.1}"/>
<arg line="${file.tmp.1}" />
</exec>

SED

<delete file="${file.tmp.2}" failonerror="false" />
<exec dir="." executable="${sedexe}" output="${file.tmp.2}">
<arg line="-f" />
<arg line="${sed.dir}/${file.sed.1}"/>
<arg line="${file.tmp.1}" />
</exec>

SORT

<delete file="${file.tmp.2}" failonerror="false" />
<exec dir="." executable="${sortexe}" output="${file.tmp.2}">
<arg line="${file.tmp.1}" />
</exec>

XML

<delete file="${file.tmp.2}" failonerror="false" />
<concat destfile="${file.tmp.2}">
<fileset file="${xml.dir}xml-head.xml" />
<fileset file="${xml.dir}root-head.xml" />
<fileset file="${file.tmp.1}" />
<fileset file="${xml.dir}root-tail.xml" />
</concat>

XSLT

<delete file="${file.tmp.2}" failonerror="false" />
<xslt   in="${file.tmp.1}"
out="${file.tmp.2}"
style="${xslt.dir}/${file.xslt.1}"/>