Search code examples
pythonmysqlexcelpandasextrapolation

Extrapolate data in an excel file using python script


I have a given dataset of some countries with the following fields:

DATE
SOURCE COMPANY 
DESTINATION COMPANY 
QUANTITY
UNITS

I have 2 days worth of data and I need to extrapolate the data for a year with dates and quantity values for each company and print the results in excel.

I need 1 row per company x delivered to per company y for 1 year. The dates can be from 2018 till today and the quantities can have difference of +/- 100.

Code

import pandas as pd 
import numpy as np 
import os 
import datetime 
import random 
import sys 
from datetime import timedelta 

df = pd.read_csv("Location") 
df_1 = df 
for i in range(1,366): 
  for i in range(0,df.shape[0]): 
    df['Quantity'][i] = df['Quantity'][i] + random.randint(-100,100)    
    df['Date'][i] = pd.to_datetime(df['Date'])[i] + timedelta(days = -1)

df_1 = pd.concat([df_1,df]) 
df_1.to_csv("Extrapolated_data.csv",sep = '')

but the time its consuming is unacceptable, how do I make it better


Solution

  • I think there are some issues with the code you posted in your comment to your question that you have to take care of:

    1. sep in the df_1.to_csv() parameter needs to be set to a character i.e. "," or ";" (see here), otherwise simply omit it if you don't want to use another separator then ","
    2. There is no .csv extension on your source file (which may be perfectly fine)
    3. Assigning df_1 = df will not only copy the columns/meta-data of the DataFrame, but the DataFrame including all data. This will result in a file containing the original data and generated data.

    Updated code

    As I am still new to pandas, I see no way to iterate over only the rows but instead had to iterate rows and columns which results in the below code. My sample file has about 40 rows in total, the result file has ~13.000 rows with a file size of 619 KB. Creation time is 60,007 seconds. Removing the time calculations and randomness from the code reduces the creation time to 54,38 seconds. This isn't super-fast but I wouldn't call this super-slow, either.

    import pandas as pd 
    import numpy as np 
    import os 
    import datetime 
    import random 
    import sys 
    from datetime import timedelta 
    from timeit import default_timer as timer
    
    df = pd.read_csv("orders.csv") 
    df_1 = pd.DataFrame(columns=df.columns) 
    start = timer()
    
    for d in range(1,366):
      for index, dfc in df.iterrows():
    
        dfc['Quantity'] = dfc['Quantity'] + random.randint(-100,100)    
        dfc['Date'] = pd.to_datetime(dfc['Date']) + timedelta(days = -d)
        df_1 = df_1.append(dfc, ignore_index=True)
    
      print("Loop ",d)  
    
    end = timer()
    print(end - start) 
    
    df_1.to_csv("Extrapolated_data.csv",sep = ',')