Search code examples
pythonpandasdataframedynamic-url

How can I troubleshoot why this Python loop won't properly load all files?


As a personal project, I am trying to load all of the VA pay data into a database. I wrote a code that will loop through a table that stores the state abbreviations and read this to populate the url for each file.

I use panda to read the url into a DF and then insert it into a database.

I was only able to successfully load the first file, but it looks like the loop stops after that.

How can I identify the issue?

#portion of script starts below
states = list()
states = cur.execute('SELECT abbr FROM ref_states')

for st in states:
    test = st[0]
    url = 'https://www.va.gov/OHRM/pay/2023/SSR/'+test+'.txt'

    # Open file as a dataframe
    data = pd.read_csv (url)
    df = pd.DataFrame(data)

    # Insert DataFrame to Table
    for row in df.itertuples():

        #define values that will be inserted
        values = (row[1],row[2],row[3],row[4],row[5],row[6],row[7],row[8],row[9],row[10])

        #print(values)
        cur.execute(insert_query, values)

conn.commit()

I wrote a loop that generate the URL for each file that I want to load.

I plan on looping through each file, read it into a DF, insert it into a DB, and loop onto the next file.


Solution

  • Here is an example how you can load all CSV from the URL, concatenate all dataframes to one final dataframe and store it under the table my_data:

    import sqlite3
    
    import pandas as pd
    
    # create sample memory sqlite DB
    conn = sqlite3.connect(":memory:")
    cursor = conn.cursor()
    
    create_table_query = """
        CREATE TABLE ref_states (
            abbr TEXT
        )
    """
    cursor.execute(create_table_query)
    conn.commit()
    insert_query = "INSERT INTO ref_states (abbr) VALUES ('AK'), ('AL'), ('AS');"
    cursor.execute(insert_query)
    
    # select abbreviation from ref_states
    cursor.execute("SELECT abbr FROM ref_states")
    rows = cursor.fetchall()
    
    all_dfs = []
    for row in rows:
        url = f"https://www.va.gov/OHRM/pay/2023/SSR/{row[0]}.txt"
        df = pd.read_csv(url)
    
        print(df.head(3))
        print()
    
        all_dfs.append(df)
    
    # create final dataframe and save it to DB:
    final_df = pd.concat(all_dfs)
    final_df.to_sql("my_data", conn, if_exists="replace")
    
    # test it:
    cursor.execute("SELECT * FROM my_data")
    rows = cursor.fetchall()
    
    for row in rows:
        print(row)