I have a given dataset of some countries with the following fields:
DATE
SOURCE COMPANY
DESTINATION COMPANY
QUANTITY
UNITS
I have 2 days worth of data and I need to extrapolate the data for a year with dates and quantity values for each company and print the results in excel.
I need 1 row per company x
delivered to per company y
for 1 year. The dates can be from 2018 till today and the quantities can have difference of +/- 100.
Code
import pandas as pd
import numpy as np
import os
import datetime
import random
import sys
from datetime import timedelta
df = pd.read_csv("Location")
df_1 = df
for i in range(1,366):
for i in range(0,df.shape[0]):
df['Quantity'][i] = df['Quantity'][i] + random.randint(-100,100)
df['Date'][i] = pd.to_datetime(df['Date'])[i] + timedelta(days = -1)
df_1 = pd.concat([df_1,df])
df_1.to_csv("Extrapolated_data.csv",sep = '')
but the time its consuming is unacceptable, how do I make it better
I think there are some issues with the code you posted in your comment to your question that you have to take care of:
sep
in the df_1.to_csv()
parameter needs to be set to a character i.e. "," or ";" (see here), otherwise simply omit it if you don't want to use another separator then ",".csv
extension on your source file (which may be perfectly fine)df_1 = df
will not only copy the columns/meta-data of the DataFrame, but the DataFrame including all data. This will result in a file containing the original data and generated data. Updated code
As I am still new to pandas, I see no way to iterate over only the rows but instead had to iterate rows and columns which results in the below code. My sample file has about 40 rows in total, the result file has ~13.000 rows with a file size of 619 KB. Creation time is 60,007 seconds. Removing the time calculations and randomness from the code reduces the creation time to 54,38 seconds. This isn't super-fast but I wouldn't call this super-slow, either.
import pandas as pd
import numpy as np
import os
import datetime
import random
import sys
from datetime import timedelta
from timeit import default_timer as timer
df = pd.read_csv("orders.csv")
df_1 = pd.DataFrame(columns=df.columns)
start = timer()
for d in range(1,366):
for index, dfc in df.iterrows():
dfc['Quantity'] = dfc['Quantity'] + random.randint(-100,100)
dfc['Date'] = pd.to_datetime(dfc['Date']) + timedelta(days = -d)
df_1 = df_1.append(dfc, ignore_index=True)
print("Loop ",d)
end = timer()
print(end - start)
df_1.to_csv("Extrapolated_data.csv",sep = ',')