Search code examples
mysqlload-data-infile

MySQL LOAD DATA INFILE non consistent field


I have a file with four fields per line that looks like this:

<uri> <uri> <uri> <uri> .
:_non-spaced-alphanumeric <uri> "25"^^<uri:integer> <uri> .
:_non-spaced-alphanumeric <uri> "Hello"@en <uri> .
:_non-spaced-alphanumeric <uri> "just text in quotes" <uri> .
...

and this sql script:

LOAD DATA LOCAL INFILE 'data-0.nq'
IGNORE
INTO TABLE btc.btc_2012 
FIELDS 
    TERMINATED BY ' ' OPTIONALLY ENCLOSED BY '"'
LINES 
    TERMINATED BY '.\n'
(subject,predicate,object,provenance);

The third field in the examples can be of any of the formats seen above. I don't really care about the 3rd value unless it's a uri, which is parsed fine by the script anyway. But if it's not then the fourth field consists of the part of the third after the quotation plus the fourth itself.

Is there a way I can get it working without manipulating the file, which by the way is 17GB?


Solution

  • Yes, there's a way to work with this. Have the data fields loaded into MySQL user variables, and then assign expressions to the actual columns.

    For example, in place of:

    (subject,predicate,object,provenance
    

    do something like this:

    (subject, predicate, @field3, @field4)
    SET object = CASE WHEN @field3 LIKE '"%"_%' THEN ... ELSE @field3 END
      , provenance = CONCAT(CASE WHEN @field3 LIKE '"%"%_"' THEN ... ELSE '' END,@field4)
    

    That's just an outline. Obviously, those ... need to replaced with appropriate expressions that return the portions of the field values you want assigned to the columns. (That will be some combination of SUBSTRING, SUBSTRING_INDEX, INSTR, LOCATE, REPLACE, et al. string functions, and you may need additional WHEN constructs to handle variations.

    (I'm not exactly clear on what conditions you need to check.)


    If this is running on Unix or Linux, another option would be to make use of a named pipe, and external program to read the file, perform the require manipulation, and write to the named pipe, run that in the background.

    e.g.

    > mkfifo /tmp/mydata.pipe
    > myprogram <myfile >/tmp/mydata.pipe 2>/tmp/mydata.err &
    
    mysql> LOAD DATA LOCAL INFILE /tmp/mydata.pipe ...
    

    FOLLOWUP

    With an input line like this:

    abc def "Hello"@en klm .
    

    given FIELDS TERMINATED BY ' ' OPTIONALLY ENCLOSED BY '"'

    field1 = 'abc'
    field2 = 'def'
    field3 = '"Hello"@en'
    field4 = 'klm'
    

    To test for the case when field3 contains double quotes, with the first double quote as the first character in the string, we could use something like this:

    LIKE '"%"%'
    

    That says the First character has to be a double quote, followed by zero one or more characters, followed by another double quote, followed again by zero one or more characters.

    To get the portion of the field3 before the second double quote:

    SUBSTRING_INDEX(@field3,'"',2)
    

    To get rid of the leading double quote from that, i.e. to return what's between the double quotes in field3, you could do something like this:

    SUBSTRING_INDEX(SUBSTRING_INDEX(@field3,'"',2),'"',-1)
    

    To get the portion of field3 following the last double quote:

    SUBSTRING_INDEX(SUBSTRING_INDEX(@field3,'"',-1)
    

    (These expressions assume that there are at most two double quotes in field3.)

    To get the value for the third column:

    CASE
    -- when field starts with a double quote and is followed by another double quote
    WHEN @field3 LIKE '"%"%"'
    -- return whats between the double quotes in field3
    THEN SUBSTRING_INDEX(SUBSTRING_INDEX(@field3,'"',2),'"',-1)
    -- otherwise return the entirety of field3
    ELSE @field3
    END
    

    To get the value to be prepended to the fourth column, when field3 contains two double quotes:

    CASE
    -- when field starts with a double quote and is followed by another double quote
    WHEN @field3 LIKE '"%"%"'
    -- return whats after the last double quote in field3
    THEN SUBSTRING_INDEX(@field3,'"',-1)
    -- otherwise return an empty string
    ELSE ''
    END
    

    To prepend that to field4, use the CONCAT function with te CASE expression above and field4.

    And these are the values we'd expect to have inserted into the table:

    column1 = 'abc'
    column2 = 'def'
    column3 = 'Hello'
    column4 = '@enklm'
    

    ANOTHER FOLLOWUP

    If the LOAD DATA isn't recognizing the line delimiter because it's not recognizing the field delimiters, then you'd have to ditch the field delimiters, and do the parsing yourself. Load the whole line into a user variable, and parse away.

    e.g.

    LINES TERMINATED BY '.\n'
    (@line)
    SET subject
        = SUBSTRING_INDEX(@line,' ',1)
      , predicate
        = SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',2),' ',-1)
      , object 
        = CASE
          WHEN SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',3),' ',-1) LIKE '"%' 
          THEN SUBSTRING_INDEX(SUBSTRING_INDEX(@line,'"',2),'"',-1)
          ELSE SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',3),' ',-1)
          END
      , provenance
        = CASE 
          WHEN SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',3),' ',-1) LIKE '"%'
          THEN SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(@line,'"',-1),' ',2),' ',-1)
          ELSE SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',4),' ',-1)
          END
    

    This will work for all the lines in your example data, with fields delimited by a single space, with the exception of matching double quotes in the third field.

    NOTE: The functions available in SQL for string manipulation lead to clumsy and awkward syntax; SQL wasn't specifically designed for easy string manipulation.