Search code examples
mysqlxmlperformanceimportload-data-infile

How to improve performance of a mysql LOAD XML LOCAL INFILE table import?


For a mysql database, I have some XML dump files.

One table import file for a pricing table looks like this:

<database>
    <table>
        <row>
            <id>5954017</id>
            <foo>narf</foo>
            <bar_id>1377</bar_id>
            <price_single>800.00</price_single>
            <price_double>1500.00</price_double>
            <price_triple>2000.00</price_triple>
            <price_quad>1900.00</price_quad>
            <currency>USD</currency>
        </row>
        ...
    </table>
</database>

It has

$ xmllint --xpath "count(//row)" import.xml 
223198

rows and its size is:

du -h import.xml 
69M import.xml

I want to import via mysql's LOAD XML feature. The table will be always be truncated beforehand.

MySQL [my_database]> LOAD XML LOCAL INFILE 'import.xml' INTO TABLE pricing ROWS IDENTIFIED BY '<row>' \G

It succeeds, yet that table import takes seemingly a rather long time:

Query OK, 223198 rows affected (1 hour 44 min 48.40 sec)
Records: 223198  Deleted: 0  Skipped: 0  Warnings: 0

I was reading that people are importing gigabytes of data with this LOAD INFILE feature, and I expected its performance to be much faster, in the realms of minutes, not hours. Is my expectation wrong? Is this a normal time for such a dataset of 200,000 entries?

(I also am comparing this speed to a custom php import script that parses the XML manually and inserts each row one by one; and that tasks takes 45 minutes for all the tables. I expected the LOAD XML LOCAL INFILE to outperform that task.)

My table looks like this:

MySQL [my_database]> DESCRIBE pricing;
+----------------+--------------+------+-----+---------+-------+
| Field          | Type         | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| id             | int(11)      | NO   | PRI | NULL    |       |
| foo            | varchar(256) | YES  |     | NULL    |       |
| bar_id         | int(11)      | YES  |     | NULL    |       |
| price_single   | float        | YES  |     | NULL    |       |
| price_double   | float        | YES  |     | NULL    |       |
| price_triple   | float        | YES  |     | NULL    |       |
| price_quad     | float        | YES  |     | NULL    |       |
| currency       | varchar(3)   | YES  |     | NULL    |       |
+----------------+--------------+------+-----+---------+-------+

And was created via:

DROP TABLE IF EXISTS `pricing`;
CREATE TABLE `pricing` (
      `id` int(11) NOT NULL,
      `foo` varchar(256) DEFAULT NULL,
      `bar_id` int(11) DEFAULT NULL,
      `price_single` float DEFAULT NULL,
      `price_double` float DEFAULT NULL,
      `price_triple` float DEFAULT NULL,
      `price_quad` float DEFAULT NULL,
      `currency` varchar(3) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
      PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

What can I do to improve the performance of LOAD XML LOCAL INFILE?


Solution

  • Transform your XML into a CSV file making the import almost instant.

    You can transform your XML via xslt using e.g. xsltproc:

     $ xsltproc transformToCsv.xsl price.xml > price.csv
     "5954017"╡"narf"╡"1377"╡"800.00"╡"1500.00"╡"2000.00"╡"1900.00"╡"USD"
    

    The appropriate xslt transformToCsv.xsl looks like this:

    <xsl:stylesheet version="1.0"
                    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        <xsl:output method="text"
                    encoding="utf-8"/>
    
        <xsl:param name="delim"
                   select="'╡'"/>
        <xsl:param name="quote"
                   select="'&quot;'"/>
        <xsl:param name="break"
                   select="'&#10;'"/>
    
        <xsl:template match="/">
            <xsl:apply-templates select="database/table/row"/>
        </xsl:template>
    
        <xsl:template match="row">
            <xsl:apply-templates/>
            <xsl:if test="following-sibling::*">
                <xsl:value-of select="$break"/>
            </xsl:if>
        </xsl:template>
    
        <xsl:template match="*">
            <xsl:value-of select="concat($quote, normalize-space(), $quote)"/>
            <xsl:if test="following-sibling::*">
                <xsl:value-of select="$delim"/>
            </xsl:if>
        </xsl:template>
    
        <xsl:template match="text()"/>
    </xsl:stylesheet>
    

    The import syntax then changes to:

    $ mysql \
        -h YOUR_MYSQL_HOST \
        -P YOUR_PORT \
        -uYOUR_USER \
        -pYOUR_PASSWORD \
        YOUR_DATABASE \
        -e "LOAD DATA LOCAL INFILE 'pricing.csv' INTO TABLE pricing FIELDS TERMINATED BY '╡' ENCLOSED BY '\"' \G"
    

    Importing 200,000 entries becomes a matter of (milli) seconds.