Search code examples
pythonmysqlregexcsvload-data-infile

Remove CSV formula values during mysql load data infile import


So I'm trying import a .csv file directly into mysql using load data local infile command but I'm running into fields that are remnants an excel formula field and I don't know how to get rid of them. The content for the first field in the example below has an = in front of it.

The table structure is setup to allow the first field to be a VARCHAR(100) but I'd like to make it an INT if possible. Here is a sample of the csv content that is being uploaded.

"MID","DBA Name","Partner ID","Partner Name","Sub Partner ID","Sub Partner Name","Active Months","Bonus Amount","Bonus Applied Date","Partner Percentage","Partner Share","Total Payment"
="0008788014065741","company2","7968","me,"11839","Joe Blow","0","$50.00","","","","$350.64"
="0008788014065756","company2","7968","you","11839","Joe Blow","0","$50.00","","","","$294.60"

And here is the mysql load command that I'm using to import the data:

sql = """
    LOAD DATA LOCAL INFILE '%(upload)s' IGNORE INTO TABLE `%(table)s`
    FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' 
    LINES TERMINATED BY '\\r\\n' 
    IGNORE 1 LINES ;
      """ % {"upload": file, "table": report}
self.db.query( sql )

Is there something that can be done on the import in terms of a regex or something? I don't know, I'm just grasping at straws here...

Thanks for the input!


Solution

  • You can do it two ways with LOAD DATA INFILE.

    First read first field values as is and then strip equal sign = and double quotes from it in SET clause. Also you most likely want to do other transformations while you're loading data like:

    • set actual NULLs when your fields are empty
    • strip dollar signs from monetary values
    • you probably have to transform date values (but your sample data doesn't have them so there is no information to deduce from)
    LOAD DATA LOCAL INFILE '/path/to/your/file.csv' 
    IGNORE INTO TABLE table_name
        FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' 
        LINES TERMINATED BY '\r\n'
        IGNORE 1 LINES 
    (@MID, DBAName, PartnerID, PartnerName, SubPartnerID, SubPartnerName, ActiveMonths,
     @BonusAmount, @BonusAppliedDate, @PartnerPercentage, @PartnerShare, @TotalPayment)
    SET MID = TRIM(BOTH '"' FROM SUBSTR(@MID, 2)), -- here we get rid of equal sign and double quotes
        BonusAmount  = TRIM(LEADING '$' FROM NULLIF(@BonusAmount, '')),
        BonusAppliedDate = NULLIF(@BonusAppliedDate, ''),
        PartnerPercentage = NULLIF(@PartnerPercentage, ''),
        PartnerShare = TRIM(LEADING '$' FROM NULLIF(@PartnerShare, '')),
        TotalPayment = TRIM(LEADING '$' FROM NULLIF(@TotalPayment, ''))
    

    Second approach is to leverage LINES STARTING BY clause

    LOAD DATA LOCAL INFILE '/path/to/your/file.csv' 
    IGNORE INTO TABLE table_name
        FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' 
        LINES TERMINATED BY '\r\n' STARTING BY '='
        IGNORE 1 LINES 
    (MID, DBAName, PartnerID, PartnerName, SubPartnerID, SubPartnerName, ActiveMonths, @BonusAmount, BonusAppliedDate, PartnerPercentage, PartnerShare, @TotalPayment)
    SET BonusAmount  = TRIM(LEADING '$' FROM NULLIF(@BonusAmount, '')),
        BonusAppliedDate = NULLIF(@BonusAppliedDate, ''),
        PartnerPercentage = NULLIF(@PartnerPercentage, ''),
        PartnerShare = TRIM(LEADING '$' FROM NULLIF(@PartnerShare, '')),
        TotalPayment = TRIM(LEADING '$' FROM NULLIF(@TotalPayment, ''))
    

    Now if your target table schema looks something like

    CREATE TABLE table_name 
    (
        MID BIGINT, 
        DBAName           VARCHAR(100),
        PartnerID         INT,
        PartnerName       VARCHAR(100),
        SubPartnerID      INT,
        SubPartnerName    VARCHAR(100),
        ActiveMonths      INT,
        BonusAmount       DECIMAL(19, 2),
        BonusAppliedDate  DATE,
        PartnerPercentage DECIMAL(3, 2),
        PartnerShare      DECIMAL(19, 2),
        TotalPayment      DECIMAL(19, 2)
    );
    

    Then after loading with either method here is what we get in the table

    mysql> select * from table_name;
    +---------------+----------+-----------+-------------+--------------+----------------+--------------+-------------+------------------+-------------------+--------------+--------------+
    | MID           | DBAName  | PartnerID | PartnerName | SubPartnerID | SubPartnerName | ActiveMonths | BonusAmount | BonusAppliedDate | PartnerPercentage | PartnerShare | TotalPayment |
    +---------------+----------+-----------+-------------+--------------+----------------+--------------+-------------+------------------+-------------------+--------------+--------------+
    | 8788014065741 | company2 |      7968 | me          |        11839 | Joe Blow       |            0 |       50.00 | NULL             |              NULL |         NULL |       350.64 |
    | 8788014065756 | company2 |      7968 | you         |        11839 | Joe Blow       |            0 |       50.00 | NULL             |              NULL |         NULL |       294.60 |
    +---------------+----------+-----------+-------------+--------------+----------------+--------------+-------------+------------------+-------------------+--------------+--------------+
    2 rows in set (0.00 sec)