Search code examples
csvimportetlssis-2012data-cleaning

Wrong data is inserted in sql server


I want to know what is the right approach to take for importing a CSV file in sqlserver which have Data like

GLMAC1,GLMAC2,GLMAC3,GLYR,GLMON,GLSUB,GLTREF,GLDATE,GLDESC,GLAMT,GLSRC,GLBTCH,GLMCMP

1    ,32   ,110  ,13  ,1   ,0          ,171406200A  ,120801  ,MH YM PANT W/DRAWS     ,-.15         ,NOIA,ITCGR119,1    
1    ,32   ,110  ,13  ,1   ,13402747   ,446286      ,120801  ,URBAN 1714062     ,15.13        ,904 ,ITCGR11B,1    
1    ,32   ,110  ,13  ,1   ,0          ,172830300A  ,120801  ,OP 5+2 SOCKS       ,-.39         ,NOIA,ITCGR165,1    
1    ,32   ,110  ,13  ,1   ,13402802   ,338728      ,120801  ,INDUSTRIES 1728303     ,39.28        ,904 ,ITCGR16C,1    
1    ,32   ,110  ,13  ,1   ,0          ,171450700A  ,120801  ,FA M.3PK FASHION S     ,-.08         ,NOIA,ITCGR19Z,1    
1    ,32   ,110  ,13  ,1   ,13402845   ,121811      ,120801  ,BO & CO...      1714507     ,7.49         ,904 ,ITCGR1B0,1 

There are around 50 Million rows like this, I want to import this data in SQL Server, but I have notices that after importing the data some columns are getting shifted to another column, this is probably because column 9 might have some Comma (,) values in that and SQL server is picking it as a (,) delimiter .

is there a way I can insert data in sql server without error or probably clean the CSV file before inserting. the file is around 8 GB in size and I have to use 010Editor to open the file in an editor or any software available that can help me figure out what values in column 9 has (,) in it so that I can manually remove the comma.


Solution

  • This C# code will read the file, will add a pair of "s around the 9th field to delimit it, and will write it to a new output file. For example:

    A line like this

    1    ,32   ,110  ,13  ,1   ,0          ,171406200A  ,120801  ,MH YM,PANT W/DRAWS     ,-.15         ,NOIA,ITCGR119,1    
    

    Will be written to the output file as:

    1    ,32   ,110  ,13  ,1   ,0          ,171406200A  ,120801  ,"MH YM,PANT W/DRAWS     ",-.15         ,NOIA,ITCGR119,1    
    

    With the text column properly delimited it will be suitable for import into SQL Server.

    Code follows:

    using System.Text.RegularExpressions;
    
    void Main() {
        Regex regex = new Regex("(.*?,.*?,.*?,.*?,.*?,.*?,.*?,.*?,)(.*)(,.*?,.*?,.*?,)", RegexOptions.CultureInvariant | RegexOptions.Compiled);
        string regexReplace = "$1\"$2\"$3";
    
        // The file to read - change this to your location
        var iStream = new FileStream(@"R:\FILE.CSV", FileMode.Open, FileAccess.Read, FileShare.ReadWrite); 
        // The file to write 
        var oStream = new FileStream(@"R:\FIXEDFILE.CSV", FileMode.Create, FileAccess.Write, FileShare.Read); 
    
        int lineNo=0;
        var sw = new System.IO.StreamWriter(oStream);
        var sr = new System.IO.StreamReader(iStream); 
        while(!sr.EndOfStream) {
            lineNo++;
            string line=sr.ReadLine();
            if (!regex.IsMatch(line)) {
                // Bad lines don't get written to the output file
                Console.WriteLine("Line {0} is bad - does not match the expected format:\n\r  {1}", lineNo, line);
            } else {
                // Write the line with the ""'s around the 9th field
                sw.WriteLine(regex.Replace(line,regexReplace));
            }
            if (lineNo%10000==0) Console.WriteLine("{0} lines processed", lineNo);
        }
        sr.Close();
        sw.Close();
        Console.WriteLine("Finished. Written {0} lines", lineNo);
    }