I have a file with four fields per line that looks like this:
<uri> <uri> <uri> <uri> .
:_non-spaced-alphanumeric <uri> "25"^^<uri:integer> <uri> .
:_non-spaced-alphanumeric <uri> "Hello"@en <uri> .
:_non-spaced-alphanumeric <uri> "just text in quotes" <uri> .
...
and this sql script:
LOAD DATA LOCAL INFILE 'data-0.nq'
IGNORE
INTO TABLE btc.btc_2012
FIELDS
TERMINATED BY ' ' OPTIONALLY ENCLOSED BY '"'
LINES
TERMINATED BY '.\n'
(subject,predicate,object,provenance);
The third field in the examples can be of any of the formats seen above. I don't really care about the 3rd value unless it's a uri, which is parsed fine by the script anyway. But if it's not then the fourth field consists of the part of the third after the quotation plus the fourth itself.
Is there a way I can get it working without manipulating the file, which by the way is 17GB?
Yes, there's a way to work with this. Have the data fields loaded into MySQL user variables, and then assign expressions to the actual columns.
For example, in place of:
(subject,predicate,object,provenance
do something like this:
(subject, predicate, @field3, @field4)
SET object = CASE WHEN @field3 LIKE '"%"_%' THEN ... ELSE @field3 END
, provenance = CONCAT(CASE WHEN @field3 LIKE '"%"%_"' THEN ... ELSE '' END,@field4)
That's just an outline. Obviously, those ...
need to replaced with appropriate expressions that return the portions of the field values you want assigned to the columns. (That will be some combination of SUBSTRING, SUBSTRING_INDEX, INSTR, LOCATE, REPLACE, et al. string functions, and you may need additional WHEN constructs to handle variations.
(I'm not exactly clear on what conditions you need to check.)
If this is running on Unix or Linux, another option would be to make use of a named pipe, and external program to read the file, perform the require manipulation, and write to the named pipe, run that in the background.
e.g.
> mkfifo /tmp/mydata.pipe
> myprogram <myfile >/tmp/mydata.pipe 2>/tmp/mydata.err &
mysql> LOAD DATA LOCAL INFILE /tmp/mydata.pipe ...
FOLLOWUP
With an input line like this:
abc def "Hello"@en klm .
given FIELDS TERMINATED BY ' ' OPTIONALLY ENCLOSED BY '"'
field1 = 'abc'
field2 = 'def'
field3 = '"Hello"@en'
field4 = 'klm'
To test for the case when field3 contains double quotes, with the first double quote as the first character in the string, we could use something like this:
LIKE '"%"%'
That says the First character has to be a double quote, followed by zero one or more characters, followed by another double quote, followed again by zero one or more characters.
To get the portion of the field3 before the second double quote:
SUBSTRING_INDEX(@field3,'"',2)
To get rid of the leading double quote from that, i.e. to return what's between the double quotes in field3, you could do something like this:
SUBSTRING_INDEX(SUBSTRING_INDEX(@field3,'"',2),'"',-1)
To get the portion of field3 following the last double quote:
SUBSTRING_INDEX(SUBSTRING_INDEX(@field3,'"',-1)
(These expressions assume that there are at most two double quotes in field3.)
To get the value for the third column:
CASE
-- when field starts with a double quote and is followed by another double quote
WHEN @field3 LIKE '"%"%"'
-- return whats between the double quotes in field3
THEN SUBSTRING_INDEX(SUBSTRING_INDEX(@field3,'"',2),'"',-1)
-- otherwise return the entirety of field3
ELSE @field3
END
To get the value to be prepended to the fourth column, when field3 contains two double quotes:
CASE
-- when field starts with a double quote and is followed by another double quote
WHEN @field3 LIKE '"%"%"'
-- return whats after the last double quote in field3
THEN SUBSTRING_INDEX(@field3,'"',-1)
-- otherwise return an empty string
ELSE ''
END
To prepend that to field4, use the CONCAT
function with te CASE expression above and field4.
And these are the values we'd expect to have inserted into the table:
column1 = 'abc'
column2 = 'def'
column3 = 'Hello'
column4 = '@enklm'
ANOTHER FOLLOWUP
If the LOAD DATA isn't recognizing the line delimiter because it's not recognizing the field delimiters, then you'd have to ditch the field delimiters, and do the parsing yourself. Load the whole line into a user variable, and parse away.
e.g.
LINES TERMINATED BY '.\n'
(@line)
SET subject
= SUBSTRING_INDEX(@line,' ',1)
, predicate
= SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',2),' ',-1)
, object
= CASE
WHEN SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',3),' ',-1) LIKE '"%'
THEN SUBSTRING_INDEX(SUBSTRING_INDEX(@line,'"',2),'"',-1)
ELSE SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',3),' ',-1)
END
, provenance
= CASE
WHEN SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',3),' ',-1) LIKE '"%'
THEN SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(@line,'"',-1),' ',2),' ',-1)
ELSE SUBSTRING_INDEX(SUBSTRING_INDEX(@line,' ',4),' ',-1)
END
This will work for all the lines in your example data, with fields delimited by a single space, with the exception of matching double quotes in the third field.
NOTE: The functions available in SQL for string manipulation lead to clumsy and awkward syntax; SQL wasn't specifically designed for easy string manipulation.