Search code examples
pythonpostgresqlpsycopg2mogrify

psycopg2 - List index out of range for html (large string) field


I have a webcrawler that should save/insert the html page content to a PostgreSQL Database with some other meta data fields.

When inserting the html content field using mogrify I'll get the error message List index out of range. If I use a static dummy text for the html content e.g. "Hello World ö ü ä ß" (I am dealing with a german character set) the insert works fine.

This is my function:

def batch_insert(df, table, no_cols, conn=None, upsert_qry=""):
    """
    Using cursor.mogrify() to build the bulk insert query
    then cursor.execute() to execute the query
    """
    if conn is None:
        conn = get_connection()

    col_str = "(" + (no_cols-1)*"%s," + "%s)"
    # Create a list of tuples from the dataframe values
    tuples = [tuple(x) for x in df.to_numpy()]
    # Comma-separated dataframe columns
    cols = ','.join(list(df.columns))
    # SQL quert to execute
    cursor = conn.cursor()

    values = [cursor.mogrify(col_str, tup).decode('utf8') for tup in tuples]

    query = "INSERT INTO %s(%s) VALUES " % (table, cols) + ",".join(values) + upsert_qry
        #" ON CONFLICT (hk_portal_pid) DO UPDATE SET crawled_last=" + str(datetime.now())
        
    try:
        cursor.execute(query, tuples)
        conn.commit()
    except (Exception, psycopg2.DatabaseError) as error:
        print("Error: %s" % error)
        conn.rollback()
        cursor.close()
        return 1
    print("execute_mogrify() done")
    cursor.close()
    conn.close()

The postgres column field is declared as text as suggested in this post. The html content in my case has an average length of 40.000 characters, so there shouldn't be any storage issues.

I also checked if the the generated tuple in the function gets splitted for the html content field because of escape characters etc.. But this doesn't seem to be the case.

>>>>[len(a) for a in tuples]
    [9, 9, 9, 9] # 4 rows / 9 columns

Data type for the 8th column (page_content) is string:

>>>>type(tuples[0][7])
<class 'str'>

Target table for batch insert:

CREATE TABLE public.as_portal_parent_content (
    hk_page varchar(100) NULL,
    hk_offer varchar(100) NULL,
    portal_id varchar(3) NULL,
    pid varchar(10) null,
    page_category varchar(30) null,
    page_link bpchar(300) null,
    status varchar(3) null,
    page_content text null,
    last_crawled timestamp null
);

Any help much appreciated!

Added/Edit:

Traceback:

Traceback (most recent call last):
  File "C:/Users/PycharmProjects/pythonProject/project/playground.py", line 5, in <module>
    test.save_to_db(table='as_portal_parent_content')
  File "C:/Users/PycharmProjects/pythonProject/project/util/crawl.py", line 163, in save_to_db
    batch_insert(self.df, table)
  File "C:/Users/PycharmProjects/pythonProject/project/util/db.py", line 60, in batch_insert
    cursor.execute(query, tuples)
IndexError: list index out of range

Process finished with exit code 1

Solution

  • You shouldn't use tuples in cursor.execute(query, tuples).
    When you use mogrify, you are basically generating the after-VALUES part of the sql query. So, there's no need to pass query parameters(tuples in your case) to cur.execute again.