Search code examples
c#filehelpers

Using FileHelpers.Dynamic, read a fixed-width file and upload to SQL


Okay, I'll try to explain this as best as I can. I wrote an application that uses a SQL table to define a fixed-width datasource structure (so, header, start index, field length, etc). When my application runs, it queries this table and creates a DataTable object (call it finalDT) with DataColumn objects holding ColumnName = header. I then append to this table a set of DataColumn objects that exist in every datasource we use (which I tend to call derived columns). I also create a Primary Key field that is an auto-incrementing integer. Originally, I rolled my own solution to read the fixed-width file, but am trying to convert this to using FileHelper. Mainly, I'm looking to incorporate it so I can have access to the other file types that FileHelper can parse (CSV, Excel, etc).

Now, my problem. Using FileHelper.Dynamic, I was able to create a FileHelperEngine object using the following method:

private static FileHelperEngine GetFixedWidthFileClass(bool ignore)
{
    singletonArguments sArgs = singletonArguments.sArgs;
    singletonSQL sSQL = singletonSQL.sSQL;
    List<string> remove = new List<string>();

    FixedLengthClassBuilder flcb = new FixedLengthClassBuilder(sSQL.FixedDataDefinition.DataTableName);
    flcb.IgnoreFirstLines = 1;
    flcb.IgnoreLastLines = 1;
    flcb.IgnoreEmptyLines = true;

    foreach (var dcs in sSQL.FixedDataDefinition.Columns)
    {
        flcb.AddField(dcs.header, Convert.ToInt32(dcs.length), "String");

        if (ignore && dcs.ignore)
        {
            flcb.LastField.FieldValueDiscarded = true; //If we want to ignore a column, this is how to do it.  Would like to incorporate this.
            flcb.LastField.Visibility = NetVisibility.Protected;
        }
        else
        {
            flcb.LastField.TrimMode = TrimMode.Both;
            flcb.LastField.FieldNullValue = string.Empty;
        }
    }

    return new FileHelperEngine(flcb.CreateRecordClass());
}

sSQL.FixedDataDefinition.Columns is how I store the field definitions for the fixed-width datasource file. I then generate a DataTable by doing:

DataTable dt = engine.ReadFileAsDT(file);

where file is the full path to the fixed-width file and engine is where I keep the results from the GetFixedWidthFileClass() method shown above. Okay, now I have a DataTable with no primary key and none of the derived columns. Also, all of the fields in dt are marked as ReadOnly = true. Here is where things become a mess.

I need to stuff dt into finalDT and it needs to be okay with dt not having any Primary Key information. If that can happen, then I can use finalDT to upload my data into my SQL table. If that can't happen, then I need a way for finalDT to not have a Primary Key, yet still upload to my SQL table. Will SqlBulkCopy allow this? Is there another way?

At this point, I'm willing to start from scratch so long as I can use FileHelper to parse the fixed-width file and the results are stored into my SQL table, I'm just not seeing the path there.


Solution

  • I figured it out. It isn't pretty, but here is how it works. Basically, how I setup my code in my original post still applies since I changed nothing in the GetFixedWidthFileClass() method. I then had to add two methods to get finalDT setup correctly:

    /// <summary>
    ///     For a given a datasource file, add all rows to the DataSet and collect Hexdump data 
    /// </summary>
    /// <param name="ds">
    ///     The <see cref="System.Data.DataSet" /> to add to 
    /// </param>
    /// <param name="file">
    ///     The datasource file to process 
    /// </param>
    internal static void GenerateDatasource(ref DataSet ds, ref FileHelperEngine engine, DataSourceColumnSpecs mktgidSpecs, string file)
    {
        // Some singleton class instances to hold program data I will need.
        singletonSQL sSQL = singletonSQL.sSQL;
        singletonArguments sArgs = singletonArguments.sArgs;
    
        try
        {
            // Load a DataTable with contents of datasource file.
            DataTable dt = engine.ReadFileAsDT(file);
    
            // Clean up the DataTable by removing columns that should be ignored.
            DataTableCleanUp(ref dt, ref engine);
    
            // ReadFileAsDT() makes all of the columns ReadOnly. Fix that.
            foreach (DataColumn column in dt.Columns)
                column.ReadOnly = false;
    
            // Okay, now get a Primary Key and add in the derived columns.
            GenerateDatasourceSchema(ref dt);
    
            // Parse all of the rows and columns to do data clean up and assign some custom
            // values. Add custom values for jobID and serial columns to each row in the DataTable.
            for (int row = 0; row < dt.Rows.Count; row++)
            {
                string version = string.Empty; // The file version
                bool found = false; // Used to get out of foreach loops once the required condition is found.
    
                // Iterate all configured jobs and add the jobID and serial number to each row
                // based upon match.
                foreach (JobSetupDetails job in sSQL.VznJobDescriptions.JobDetails)
                {
                    // Version must match id in order to update the row. Break out once we find
                    // the match to save time.
                    version = dt.Rows[row][dt.Columns[mktgidSpecs.header]].ToString().Trim().Split(new char[] { '_' })[0];
                    foreach (string id in job.ids)
                    {
                        if (version.Equals(id))
                        {
                            dt.Rows[row][dt.Columns["jobid"]] = job.jobID;
    
                            lock (locklist)
                                dt.Rows[row][dt.Columns["serial"]] = job.serial++;
    
                            found = true;
                            break;
                        }
                    }
                    if (found)
                        break;
                }
    
                // Parse all columns to do data clean up.
                for (int column = 0; column < dt.Columns.Count; column++)
                {
                    // This tab character keeps showing up in the data. It should not be there,
                    // but customer won't fix it, so we have to.
                    if (dt.Rows[row][column].GetType() == typeof(string))
                        dt.Rows[row][column] = dt.Rows[row][column].ToString().Replace('\t', ' ');
                }
            }
    
            dt.AcceptChanges();
    
            // DataTable is cleaned up and modified. Time to push it into the DataSet.
            lock (locklist)
            {
                // If dt is writing back to the DataSet for the first time, Rows.Count will be
                // zero. Since the DataTable in the DataSet does not have the table schema and
                // since dt.Copy() is not an option (ds is referenced, so Copy() won't work), Use
                // Merge() and use the option MissingSchemaAction.Add to create the schema.
                if (ds.Tables[sSQL.FixedDataDefinition.DataTableName].Rows.Count == 0)
                    ds.Tables[sSQL.FixedDataDefinition.DataTableName].Merge(dt, false, MissingSchemaAction.Add);
                else
                {
                    // If this is not the first write to the DataSet, remove the PrimaryKey
                    // column to avoid duplicate key values. Use ImportRow() rather then .Merge()
                    // since, for whatever reason, Merge() is overwriting ds each time it is
                    // called and ImportRow() is actually appending the row. Ugly, but can't
                    // figure out another way to make this work.
                    dt.PrimaryKey = null;
                    dt.Columns.Remove(dt.Columns[0]);
                    foreach (DataRow dr in dt.Rows)
                        ds.Tables[sSQL.FixedDataDefinition.DataTableName].ImportRow(dr);
                }
    
                // Accept all the changes made to the DataSet.
                ds.Tables[sSQL.FixedDataDefinition.DataTableName].AcceptChanges();
            }
    
            // Clean up memory.
            dt.Clear();
    
            // Log my progress.
            log.GenerateLog("0038", log.Info
                            , engine.TotalRecords.ToString() + " DataRows successfully added for file:\r\n\t"
                            + file + "\r\nto DataTable "
                            + sSQL.FixedDataDefinition.DataTableName);
        }
        catch (Exception e)
        {
            // Something bad happened here.
            log.GenerateLog("0038", log.Error, "Failed to add DataRows to DataTable "
                            + sSQL.FixedDataDefinition.DataTableName
                            + " for file\r\n\t"
                            + file, e);
        }
        finally
        {
            // Successful or not, get rid of the datasource file to prevent other issues.
            File.Delete(file);
        }
    }
    

    And this method:

    /// <summary>
    ///     Deletes columns that are not needed from a given <see cref="System.Data.DataTable" /> reference.
    /// </summary>
    /// <param name="dt">
    ///     The <see cref="System.Data.DataTable" /> to delete columns from. 
    /// </param>
    /// <param name="engine">
    ///     The <see cref="FileHelperEngine" /> object containing data field usability information. 
    /// </param>
    private static void DataTableCleanUp(ref DataTable dt, ref FileHelperEngine engine)
    {
        // Tracks DataColumns I need to remove from my temp DataTable, dt.
        List<DataColumn> removeColumns = new List<DataColumn>();
    
        // If a field is Discarded, then the data was not imported because we don't need this
        // column. In that case, mark the column for deletion by adding it to removeColumns.
        for (int i = 0; i < engine.Options.Fields.Count; i++)
            if (engine.Options.Fields[i].Discarded)
                removeColumns.Add(dt.Columns[i]);
    
        // Reverse the List so changes to dt don't generate schema errors.
        removeColumns.Reverse();
    
        // Do the deletion.
        foreach (DataColumn column in removeColumns)
            dt.Columns.Remove(column);
    
        // Clean up memory.
        removeColumns.Clear();
    }
    

    Basically, since ds (the DataSet where finalDT lives) was referenced in the GenerateDatasource method, I could not use dt.Copy() to push data into it. I had to use Merge() to do this. Then, where I would have liked to use Merge(), I had to use a foreach loop and ImportRow() because Merge() was overwriting finalDT.

    Other issues I had to fix were:

    1. When I use ImportRow(), then I also need to delete the PrimaryKey from dt or else I get errors about duplicate keys.
    2. FileHelperEngine or FileHelpers.Dynamic.FixedLengthClassBuilder has a problem with jumping past columns I want to ignore. It either won't acknowledge them at all, thus killing my column offset and, subsequently, the accuracy on how data is read in the datasource file (using the FieldHidden option) or it reads them and creates the column in anyhow, but doesn't load the data (using the FieldValueDiscarded and Visibility.Private or .Protected options). What this means for me is I had to iterate dt after the call to engine.ReadFileAsDT(file) and delete columns marked as Discarded.
    3. Since FileHelper knows nothing about my PrimaryKey column or the other derived columns that are added to all of my datasources during processing, I had to pass dt to a method (GenerateDatasourceSchema()) to sort that out. That method basically just adds these columns and ensures the PrimaryKey is the first column.

    The rest of the code is fix up I need to do to columns and rows. In some cases, I'm setting values for a column for each row, in others, I'm cleaning up errors in the original data (as it comes to me from my customer).

    It's not pretty and I hope to figure out a better way down the road. If anyone has input on how I did it, I would love to hear it.