Search code examples
pythonmysqlcsvinfluxdb

Why influx performance is so slow


I am storing some data in influx and it is quite confusing that influx is 4-5 times slow as Mysql. I try to test by inserting 10000 rows in mysql and then in influxdb. and the stats are below.

Mysql

real: 6m 39sec
user: 2.956sec  
sys: 0.504sec

Influxdb

real: 6m 17.193sec
user: 11.860sec
sys: 0.328sec

my code for influx is given below, I used the same pattern to store in mysql.

#!/usr/bin/env python
# coding: utf-8
import time
import csv
import sys
import datetime
import calendar
import pytz
from influxdb import client as influxdb
from datetime import datetime

host = 'localhost'
port = 8086
user = "admin"
password = "admin"
db_name = "testdatabase"
db = influxdb.InfluxDBClient(database=db_name)


def read_data():
    with open(file) as f:
        reader = f.readlines()[4:]
       for line in reader:
            yield (line.strip().split(','))


fmt = '%Y-%m-%d %H:%M:%S'
file = '/home/rob/mycsvfile.csv'

csvToInflux = read_data()
body = []
for metric in csvToInflux:
    timestamp = datetime.strptime(metric[0][1: len(metric[0]) - 1], fmt)

    new_value = float(metric[1])
    body.append({
        'measurement': 'mytable1',
        'time': timestamp,
        'fields': {
             'col1': metric[1],
             'col2': metric[2],
             'col3': metric[3],
             'col4': metric[4],
             'col5': metric[5],
             'col6': metric[6],
             'col7': metric[7],
             'col8': metric[8],
             'col9': metric[9]
        }
        })
    db.write_points(body)

Can someone give me an idea how can I improve it. I think it might be due to cache. is cache option is off by default in Influx db? and can someone guide me to do batch processing in influx. I try to look over SO and google but couldn't solve my problem. I am newbie to influx db. I am trying to make it faster. Thanks for any help or tips.


Solution

  • Inserting one by one into influxdb is slow, you should do it in batches. For example, trying with a CSV of 10000 lines (one by one):

    with open('/tmp/blah.csv') as f:
        lines = f.readlines()
    
    import influxdb
    
    inf = influxdb.InfluxDBClient('localhost', 8086, 'root', 'root', 'example1')
    
    for line in lines:
        parts = line.split(',')
        json_body = [{
            'measurement': 'one_by_one',
            'time': parts[0],
            'fields':{
                'my_value': int(parts[1].strip())
            }
        }]
        inf.write_points(json_body)
    

    This gives me a result of

    └─ $ ▶ time python influx_one.py
    
    real    1m43.655s
    user    0m19.547s
    sys     0m3.266s
    

    And doing a small change to insert all the lines of the CSV in one go:

    json_body = []
    for line in lines:
        parts = line.split(',')
        json_body.append({
            'measurement': 'one_batch',
            'time': parts[0],
            'fields':{
                'my_value': int(parts[1].strip())
            }
        })
    
    inf.write_points(json_body)
    

    The result is much much better:

    └─ $ ▶ time python influx_good.py
    
    real    0m2.693s
    user    0m1.797s
    sys     0m0.734s