What i'm trying to do in this notebook is reading the file "updated_coffee_sales_data.xls" with Pandas, cleaning the file, and then exporting it into another excel file. This work with Notebooks but I want to slowly learn how to do this in Python scripts.
I tried just normal export and removed all the #%% markdowns and all, when I try to run, it does now work
# # **DATA CLEANING, WRANGLING, AND MANIPULATION**
#
# **(Before Data Visualization)**
# ## Importing Pandas and creating DataFrames
#
# First I import necessary packages and create a DataFrame for each sheet.
import pandas as pd
import os
pwd = os.getcwd()
df1 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='transactions')
df2 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='employees')
df3 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='coffee')
# Then I check to see if there are any columns with mismatched Dtypes
df1.info()
df2.info()
df3.info()
# ## Merging tables
#
# Everything looks fine so I can proceed.
#
# In this case, I will join all of the tables together to create one DataFrame.
#
# The first step to achieving this is to rename the columns and make the match across all DataFrames.
df1 = df1.rename(columns=
{'Transaction ID' : 'trn_id', 'Emp_id' : 'emp_id',
'Employee Name' : 'emp_name', 'Date and Time' : 'date_time',
'Purchased Coffee Type' : 'coffee_type', 'Purchase Quantity' : 'purchase_qty',
'Payment Method' : 'pay_method', 'Customer Name' : 'customer_name'}
)
df2 = df2.rename(columns=
{'Emp_id' : 'emp_id', 'Emp_Name' : 'emp_name',
'Job Title' : 'job_title', 'Hourly Salary' : 'hr_salary'}
)
df3 = df3.rename(columns=
{'Coffee type' : 'coffee_type'}
)
# Now that everything is named properly, I can merge the three DataFrames with no problems.
df_merged = df1.merge(df2, on='emp_id', how='inner').merge(df3, on='coffee_type', how='inner')
df_merged.head()
# ## Data cleaning.
#
# Now it is finally time to clean the data.
#
# From what I can see, there are a few changes that I can make:
# 1. Shorten the transaction IDs to the first 8 string.
# 2. Drop 1 employee name column 'emp_name_y" and renaming "emp_name_x".
# 3. Separate the date and time then drop the column 'date_time'
# 4. Rearrange the columns.
# 5. Finishing touches.
df_cleaning = df_merged.copy() # Always best practice to create new copies of your dataframes.
df_cleaning['trn_id'] = df_cleaning['trn_id'].str[:8] # Shorten the transaction IDs to the first 8 string.
df_cleaning = df_cleaning.rename(columns={'emp_name_x' : 'emp_name'}) # Drop 1 employee name column 'emp_name_y" and renaming "emp_name_x".
df_cleaning = df_cleaning.drop(columns=['emp_name_y'])
df_cleaning['date'] = df_cleaning['date_time'].dt.date # Separate the date and time then drop the column 'date_time'
df_cleaning['time'] = df_cleaning['date_time'].dt.time
df_cleaning = df_cleaning.drop(columns=['date_time'])
df_cleaning.head()
# Rearrange the columns.
reorder = [
'trn_id', 'emp_id', 'hr_salary', 'job_title', 'emp_name',
'date', 'time', 'coffee_type', 'purchase_qty', 'price',
'total_price', 'pay_method', 'customer_name'
]
df_reorder = df_cleaning[reorder]
df_reorder.head()
df_reorder.info()
# Upon inspection, I noticed that date is Dtype Objects which will be a problem if not fixed.
#
# Let me fix that.
df_dt = df_reorder.copy() # Always best practice to create new copies of your dataframes.
df_dt['date'] = pd.to_datetime(df_dt['date'], format='%Y-%m-%d')
df_dt.info()
df_dt.head()
# ## Exporting to CSV
#
# Now that the data is clean and ready for visualization.
# I am going to export it into a csv file.
df_final = df_dt.copy()
df_final.to_csv('updated_coffee_sales_cleanedPY.csv', index=False)
I try running the import and the pwd = os.getcwd() in an interactive notebook first then try to run it again still did not work.
I can't copy the error in the terminal but what it's saying is:
[Errno 2]No such file or directory: 'D:\\Admin Files\\Desktop\DataS Proj\\Pproj_Coffee'
I found the issue... Python had a problem with locating file because of who I told it to locate the path:
import pandas as pd
import os
pwd = os.getcwd()
df1 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='transactions')
df2 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='employees')
df3 = pd.read_excel(pwd + r'\updated_coffee_sales_data.xlsx', sheet_name='coffee')
What I should've done is to use "os.path.join" and the best solution is to also make that entire file path into a variable.
import os
import pandas as pd
# Get the current working directory
pwd = os.getcwd()
# Construct file paths
file_path = os.path.join(pwd, 'updated_coffee_sales_data.xlsx')
# Read Excel files
df1 = pd.read_excel(file_path, sheet_name='transactions')
df2 = pd.read_excel(file_path, sheet_name='employees')
df3 = pd.read_excel(file_path, sheet_name='coffee')
This solution works for me. (I found another problem when I exported the file but fixed it using the same concept of actually defining the file path)