Search code examples
pythonpandasmysqlpandas-datareader

How to import a mysqldump into Pandas


I am interested if there is a simple way to import a mysqldump into Pandas.

I have a few small (~110MB) tables and I would like to have them as DataFrames.

I would like to avoid having to put the data back into a database since that would require installation/connection to such a data base. I have the .sql files and want to import the contained tables into Pandas. Does any module exist to do this?

If versioning matters the .sql files all list "MySQL dump 10.13 Distrib 5.6.13, for Win32 (x86)" as the system the dump was produced in.

Background in hindsight

I was working locally on a computer with no data base connection. The normal flow for my work was to be given a .tsv, .csv or json from a third party and to do some analysis which would be given back. A new third party gave all their data in .sql format and this broke my workflow since I would need a lot of overhead to get it into a format which my programs could take as input. We ended up asking them to send the data in a different format but for business/reputation reasons wanted to look for a work around first.

Edit: Below is Sample MYSQLDump File With two tables.

/*
MySQL - 5.6.28 : Database - ztest
*********************************************************************
*/


/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
CREATE DATABASE /*!32312 IF NOT EXISTS*/`ztest` /*!40100 DEFAULT CHARACTER SET latin1 */;

USE `ztest`;

/*Table structure for table `food_in` */

DROP TABLE IF EXISTS `food_in`;

CREATE TABLE `food_in` (
  `ID` int(11) NOT NULL AUTO_INCREMENT,
  `Cat` varchar(255) DEFAULT NULL,
  `Item` varchar(255) DEFAULT NULL,
  `price` decimal(10,4) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL,
  KEY `ID` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;

/*Data for the table `food_in` */

insert  into `food_in`(`ID`,`Cat`,`Item`,`price`,`quantity`) values 

(2,'Liq','Beer','2.5000','300'),

(7,'Liq','Water','3.5000','230'),

(9,'Liq','Soda','3.5000','399');

/*Table structure for table `food_min` */

DROP TABLE IF EXISTS `food_min`;

CREATE TABLE `food_min` (
  `Item` varchar(255) DEFAULT NULL,
  `quantity` decimal(10,0) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

/*Data for the table `food_min` */

insert  into `food_min`(`Item`,`quantity`) values 

('Pizza','300'),

('Hotdogs','200'),

('Beer','300'),

('Water','230'),

('Soda','399'),

('Soup','100');

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

Solution

  • No

    Pandas has no native way of reading a mysqldump without it passing through a database.

    There is a possible workaround, but it is in my opinion a very bad idea.

    Workaround (Not recommended for production use)

    Of course you could parse the data from the mysqldump file using a preprocessor.

    MySQLdump files often contain a lot of extra data we are not interested in when loading a pandas dataframe, so we need to preprocess it and remove noise and even reformat lines so that they conform.

    Using StringIO we can read a file, process the data before it is fed to the pandas.read_csv funcion

    from StringIO import StringIO
    import re
    
    def read_dump(dump_filename, target_table):
        sio = StringIO()
            
        fast_forward = True
        with open(dump_filename, 'rb') as f:
            for line in f:
                line = line.strip()
                if line.lower().startswith('insert') and target_table in line:
                    fast_forward = False
                if fast_forward:
                    continue
                data = re.findall('\([^\)]*\)', line)
                try:
                    newline = data[0]
                    newline = newline.strip(' ()')
                    newline = newline.replace('`', '')
                    sio.write(newline)
                    sio.write("\n")
                except IndexError:
                    pass
                if line.endswith(';'):
                    break
        sio.pos = 0
        return sio
    

    Now that we have a function that reads and formatts the data to look like a CSV file, we can read it with pandas.read_csv()

    import pandas as pd
    
    food_min_filedata = read_dump('mysqldumpexample', 'food_min')
    food_in_filedata = read_dump('mysqldumpexample', 'food_in')
    
    df_food_min = pd.read_csv(food_min_filedata)
    df_food_in = pd.read_csv(food_in_filedata)
    

    Results in:

            Item quantity
    0    'Pizza'    '300'
    1  'Hotdogs'    '200'
    2     'Beer'    '300'
    3    'Water'    '230'
    4     'Soda'    '399'
    5     'Soup'    '100'
    

    and

       ID    Cat     Item     price quantity
    0   2  'Liq'   'Beer'  '2.5000'    '300'
    1   7  'Liq'  'Water'  '3.5000'    '230'
    2   9  'Liq'   'Soda'  '3.5000'    '399'
    

    Note on Stream processing

    This approach is called stream processing and is incredibly streamlined, almost taking no memory at all. In general it is a good idea to use this approach to read csv files more efficiently into pandas.

    It is the parsing of a mysqldump file I advice against