Search code examples
pythonpython-3.xpandasdataframegenerator

AttributeError: 'generator' object has no attribute 'to_sql' While creating datframe using generator


I am trying to create a datafrmae from fixedwidth file and load into postgresql database. My input file is very huge (~16GB) and 20Million records. So if i create dataframe it is consuming most of the available RAM. It is taking long time to complete. So i thought of using chunksize(using python generator) option and commit records to table. But it is failing with 'AttributeError: 'generator' object has no attribute 'to_sql' error.

Inspired by this answer here https://stackoverflow.com/a/47257676/2799214

input file: test_file.txt

XOXOXOXOXOXO9
AOAOAOAOAOAO8
BOBOBOBOBOBO7
COCOCOCOCOCO6
DODODODODODO5
EOEOEOEOEOEO4
FOFOFOFOFOFO3
GOGOGOGOGOGO2
HOHOHOHOHOHO1

sample.py

import pandas.io.sql as psql
import pandas as pd
from sqlalchemy import create_engine

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
    for chunk in pd.read_fwf(filename, colspecs=[[0,12],[12,13]],index_col=False,header=None, iterator=True, chunksize=chunk_size):
        yield (chunk)

def _generator( engine, filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    chunk.to_sql('sample_table', engine, if_exists='replace', schema='sample_schema', index=False)
    yield row

if __name__ == "__main__":
    filename = r'test_file.txt'
    engine = create_engine('postgresql://ABCD:ABCD@ip:port/database')
    c = engine.connect()
    conn = c.connection
    generator = _generator(engine=engine, filename=filename)
    while True:
       print(next(generator))
    conn.close()

Error:

    chunk.to_sql('sample_table', engine, if_exists='replace', schema='sample_schema', index=False)
AttributeError: 'generator' object has no attribute 'to_sql'

My Primary goal is to improve performance. Please help me in resolving the issue or please suggest better approach. Thanks in advance.


Solution

  • Conclusion: to_sql method is not efficient to load large files. So i used copy_from method in package psycopg2 and used chunksize option while creating dataframe. Loaded 9.8 Million records(~17GB) with 98 columns each in 30mins.

    I have removed original refrences of my actual file ( iam using sample file in the original post).

    import pandas as pd
    import psycopg2
    import io
    
    def sql_generator(cur,con, filename, boundries, col_names, header=False,chunk_size = 2000000):
        frame = pd.read_fwf(filename,colspecs=boundries,index_col=False,header=None,iterator=True,chunksize=chunk_size,names=col_names)
        for chunk in frame:
            output = io.StringIO()
            chunk.to_csv(output, sep='|', quoting=3, escapechar='\\' , index=False, header=False,encoding='utf-8')
            output.seek(0)
            cur.copy_from(output, 'sample_schema.sample_table', null="",sep="|")
            yield con.commit()
    
    if __name__ == "__main__":
        boundries = [[0,12],[12,13]]
        col_names = ['col1','col2']
        filename = r'test_file.txt'  #Refer to sample file in the original post
        con = psycopg2.connect(database='database',user='username', password='pwd', host='ip', port='port')
        cur = con.cursor()
        for sql in sql_generator(cur,con, filename, boundries, col_names):
            print(sql)
        con.close()