Search code examples
pythonexcelpandasxlsxxlsxwriter

Multi tab Excel sheets, unique entries in 1 column, create new file with data from another column as the name, all with headers


As usual, I've bitten off more than I can chew. I have a file, "list.xlsx". The file has 3 sheets, "current students", "finished" and "cancelled". The sheets all contain data under the following headers [StudentId, FirstName, Lastname, DoB, Nationality, CourseID, CourseName, Startdate, Finishdate, UnitID, UnitName, UnitCompetency]

I've created the below abomination that starts off what I need.

What I want it to do is:

1) Create a file with FirstName + Lastname.xlsx based off of the StudentId (unique) in the folder named after their sheet

2) In that file, take all information from the rest of the columns and append it to their file

    #python 3.8
import pandas as pd
import os
import shutil

file = "list.xlsx"
CS = "current student"
Fin = "finished"
Can = "cancelled"
TheList = {CS, Fin, Can}
CanXlsx = pd.read_excel(file, sheet_name = Can)
FinXlsx = pd.read_excel(file, sheet_name = Fin)
CSXlsx = pd.read_excel(file, sheet_name = CS)

if os.path.exists(CS):
    shutil.rmtree(CS)
os.mkdir(CS)
CSDir = '//current student//'
if os.path.exists(Fin):
    shutil.rmtree(Fin)
os.mkdir(Fin)
FinDir = '//finished//'
if os.path.exists(Can):
    shutil.rmtree(Can)
os.mkdir(Can)
CanDir = '//cancelled//'

CancelID = CanXlsx.StudentId.unique()
FinID = FinXlsx.StudentId.unique()
CSID = CSXlsx.StudentId.unique()

I thought I was getting better with for loops and the like, but can't seem to get my head around them. I can think about the logic, but it just doesn't come through with the code.

https://drive.google.com/file/d/134fqWx6veF7zp_12GqFYlbmPZnK8ihaV/view?usp=sharing


Solution

  • I think the approach needed for this is to create 3 data frames (probably is possible to do it with one, but I don't remember). 1) Then, on each data frame, you will need to extract a list of "First Name + Last Name", and after that, 2) you will need to create masks on the data frames to extract the information and save it.

    import os
    import shutil
    
    file = "list.xlsx"
    CS = "current student"
    Fin = "finished"
    Can = "cancelled"
    TheList = {CS, Fin, Can}
    CanXlsx = pd.read_excel(file, sheet_name = Can)
    FinXlsx = pd.read_excel(file, sheet_name = Fin)
    CSXlsx = pd.read_excel(file, sheet_name = CS)
    
    ## File Creation
    if os.path.exists(CS):
        shutil.rmtree(CS)
    os.mkdir(CS)
    CSDir = '//current student//'
    if os.path.exists(Fin):
        shutil.rmtree(Fin)
    os.mkdir(Fin)
    FinDir = '//finished//'
    if os.path.exists(Can):
        shutil.rmtree(Can)
    os.mkdir(Can)
    CanDir = '//cancelled//'
    
    # Create full names
    CanXlsx["Fullname"] = CanXlsx["StudentId"] + "_" + CanXlsx["First Name"] + "_" + CanXlsx["Last Name"]
    ## Same for the other dfs
    
    # Get a list of ids
    # canFullNames = list(CanXlsx["Fullname"]) Edit: Preferred approach through student Ids
    canIds = list(CanXlsx["StudentId"])
    ## Same for the other dfs
    
    # Loop over the list of full names to create your df
    for id in canIds:
        df1 = CanXlsx[CanXlsx["StudenId"] == id] # This will filter the rows by the id you want
        # Retrieve the full name
        name = df1.iloc[0]["Fullname"]
    
        # Create the filename
        filename = os.path.join(CanDir,name + ".xlsx")
    
        df1.drop(columns = ["First Name", "Last Name"] # I understand that these columns are not required on each file
        df1.to_excel(filename,header=True,index=False)
    
    ## Same for the other dfs
    

    Let me know if this helps, at least this is what I understood you want to achieve with your code. :D