Search code examples
c#parsingtextformattingflat-file

Flat file normalization with a dynamic number of columns


I have a flat file with an unfortunately dynamic column structure. There is a value that is in a hierarchy of values, and each tier in the hierarchy gets its own column. For example, my flat file might resemble this:

StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Tier3ObjectId|Status
1234|7890|abcd|efgh|ijkl|mnop|Pending
...

The same feed the next day may resemble this:

StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Status
1234|7890|abcd|efgh|ijkl|Complete
...

The thing is, I don't care much about all the tiers; I only care about the id of the last (bottom) tier, and all the other row data that is not a part of the tier columns. I need normalize the feed to something resembling this to inject into a relational database:

StatisticID|FileId|ObjectId|Status
1234|7890|ijkl|Complete
...

What would be an efficient, easy-to-read mechanism for determining the last tier object id, and organizing the data as described? Every attempt I've made feels kludgy to me.

Some things I've done:

  • I have tried to examine the column names for regular expression patterns, identify the columns that are tiered, order them by name descending, and select the first record... but I lose the ordinal column number this way, so that didn't look good.
  • I have placed the columns I want into an IDictionary<string, int> object to reference, but again reliably collecting the ordinal of the dynamic columns is an issue, and it seems this would be rather non-performant.

Solution

  • I ran into a simular problem a few years ago. I used a Dictionary to map the columns, it was not pretty, but it worked.

    First make a Dictionary:

    private Dictionary<int, int> GetColumnDictionary(string headerLine)
        {
            Dictionary<int, int> columnDictionary = new Dictionary<int, int>();
            List<string> columnNames = headerLine.Split('|').ToList();
    
            string maxTierObjectColumnName = GetMaxTierObjectColumnName(columnNames);
            for (int index = 0; index < columnNames.Count; index++)
            {
                if (columnNames[index] == "StatisticID")
                {
                    columnDictionary.Add(0, index);
                }
    
                if (columnNames[index] == "FileId")
                {
                    columnDictionary.Add(1, index);
                }
    
                if (columnNames[index] == maxTierObjectColumnName)
                {
                    columnDictionary.Add(2, index);
                }
    
                if (columnNames[index] == "Status")
                {
                    columnDictionary.Add(3, index);
                }
            }
    
            return columnDictionary;
        }
    
        private string GetMaxTierObjectColumnName(List<string> columnNames)
        {
            // Edit this function if Tier ObjectId is greater then 9
            var maxTierObjectColumnName = columnNames.Where(c => c.Contains("Tier") && c.Contains("Object")).OrderBy(c => c).Last();
    
            return maxTierObjectColumnName;
        }
    

    And after that it's simply running thru the file:

    private List<DataObject> ParseFile(string fileName)
        {
            StreamReader streamReader = new StreamReader(fileName);
    
            string headerLine = streamReader.ReadLine();
            Dictionary<int, int> columnDictionary = this.GetColumnDictionary(headerLine);
    
            string line;
            List<DataObject> dataObjects = new List<DataObject>();
            while ((line = streamReader.ReadLine()) != null)
            {
                var lineValues = line.Split('|');
    
                string statId = lineValues[columnDictionary[0]];
                dataObjects.Add(
                    new DataObject()
                    {
                        StatisticId = lineValues[columnDictionary[0]],
                        FileId = lineValues[columnDictionary[1]],
                        ObjectId = lineValues[columnDictionary[2]],
                        Status = lineValues[columnDictionary[3]]
                    }
                );
            }
    
            return dataObjects;
        }
    

    I hope this helps (even a little bit).