Flat File Parsed to XML Using C#

Matthew Cochran
13y
72k
0
0

Article

Code overview:

Use:

This is a static class with two public methods used to parse an input string representing tab delimited or comma delimited data into an XmlDocument:

public static XmlDocument ParseCsvToXml(string input, string topElementName, string recordElementName, params string[] recordItemElementName)

public static XmlDocument ParseTabToXml(string input, string topElementName, string recordElementName, params string[] recordItemElementName)

The first thing that I'd like to point out is the signature of the publicly facing methods. Notice the params keyword in the last parameter in the ParseCsvToXml() method. This parameter will let me pass in a variable amount of parameters at the end of the method which will represent all the xml node names so we add as many node names as there are columns in our input string the end of the call.

XmlDocument result = Parser.ParseCsvToXml(input, "TopElement", "Record", "Field1", "Field2", "Field3");

The following document will be built.

<?xml version="1.0" encoding="utf-8" ?>
<TopElement>
<Record>
    <Field1>data</Field1>
    <Field2>data</Field2>
    <Field3>data</Field3>
</Record>
</TopElement>

There must be a node name specified for each column in the input string.

Development notes:

I'm using two main steps in the conversion process: (1) disassembling the flat file into a 2D matrix of strings and then (2) constructing an xml document from the matrix. There are a pre-process and (in the case of the csv conversion) post-process step that have to happen in order to clean up the data.

If it's worth doing twice, it's worth doing once

Originally I had two separate recursive methods for post processing csv and tab delimited data that is now living in the nodes that were built. Basically the point is to remove any double quotes and put back commas that were embedded in double quotes in the csv input.

The first method I wrote was to recursively post process the tab-delimited data. This works really well because the XmlDocument inherits from XmlNode so I can have one method that can accept a node in the document and the document itself.

private static void PostProcessTabNode(XmlNode node)
{
if (!String.IsNullOrEmpty(node.Value) && m_quotesOnBothEnds.IsMatch(node.Value))
node.Value = node.Value.Substring(1, node.Value.Length - 2);

foreach (XmlNode subNode in node.ChildNodes)
PostProcessTabNode(subNode);
}

Next I wrote the method to recursively post process the comma-delmited data:

private static void PostProcessCsvNode(XmlNode node)
{
if(! String.IsNullOrEmpty(node.Value))
node.Value = node.Value.Replace(strTemporaryPlaceholder, strComma);

foreach (XmlNode subNode in node.ChildNodes)
PostProcessCsvNode(subNode);
}

What I ended up with was two methods with some code repeated at the end. Anytime I see code repeated a shudder goes down my spine because it screams out "MAINTENANCE AND CONSISTANCY NIGHTMARE". I may eventually have more types of data I'd like to parse into xml, so I thought it would be worth refactoring at this point.

I moved to a "controller" method that will be responsible for the recursion.

private static void PostProcess(XmlNode node, Action<XmlNode> process)
{
process(node);

foreach (XmlNode subNode in node.ChildNodes)
PostProcess(subNode, process);
}

The Action<XmlNode> is a predefined delegate that I'll use to point to a method with the same signature that will actually do the work.

private static void PostProcessTabNode(XmlNode node)
{
if (!String.IsNullOrEmpty(node.Value) && m_quotesOnBothEnds.IsMatch(node.Value))
node.Value = node.Value.Substring(1, node.Value.Length - 2);
}

private static void PostProcessCsvNode(XmlNode node)
{
if(! String.IsNullOrEmpty(node.Value))
node.Value = node.Value.Replace(strTemporaryPlaceholder, strComma);
}

The nice thing about this refactoring is that now all my methods are more cohesive (each method has a distinct purpose) which corresponds to ease of maintenance and ease of understanding.

When I call the PostProcess() method I'll pass in the document to be cleaned up and the name of the method to do the cleaning. The compiler is smart enough to know that a new delegate of type Action<XmlNode> needs to be created so I don't have to specify it.

PostProcess(doc, PostProcessTabNode);

I could have called this method in the following way with the exact same results but to me it is much harder to read and understand at a quick glance:

PostProcess(doc, new Action<XmlNode>(PostProcessTabNode));

Strings are Evil

Having a good handle on where strings are in our code is pretty important. Because they are immutable, they can be very expensive. If there are multiple instances of the same string within the code, the CLR will "intern" the strings and use a single memory space to hold the string value and pass out multiple references to that memory space.

http://msdn2.microsoft.com/en-us/library/system.string.intern(vs.80).aspx

For me, declaring re-used strings as constant and readonly variables ensures I'm not accidentally using a different spelling or extra space in my strings and so it helps keep the warts off the IL code generated and will keep the assembly load time to a minimum (each time the assembly is loaded into memory, it finds the literal strings and interns them. Less literal strings to intern means less work for the CLR to do when loading my assembly).

        private const string
            strComma = ",",
            strTemporaryPlaceholder = "~~`~~",
            strTab = "\t";

Anyways, that's about it for the general overview. Other code you might be interested in are the disassembly and xml building methods in the source code. The unit tests I used are pretty rough and I used them to do a general visual check of the output, but I included them with the code anyways.

I hope you find the library useful.

Until next time,
Happy coding