Search code examples
pythonexcelcsvspyder

How to use python to seperate a one column CSV file if the columns have no headings, then save this into a new excel file?


So, I am quite new to python and have been googling a lot but have not found a good solution. What I am looking to do is automate text to columns using python in an excel document without headers.

Here is the excel sheet I have

it is a CSV file where all the data is in one column without headers

ex. hi ho loe time jobs barber jim joan hello

009 00487 08234 0240 2.0348 20.34829

delimeter is space and comma

What I want to come out is saved in another excel with the first two rows deleted and seperated into columns ( this can be done using text to column in excel but i would like to automate this for several excel sheets)

009 | 00487 | 08234 | 0240 | 2.0348 | 20.34829

the code i have written so far is like this:

    import pandas as pd
    import csv


    path = 'C:/Users/ionan/OneDrive - Universiteit Utrecht/Desktop/UCU/test_excel'

    os.chdir(path)

    for root, dirs, files in os.walk(path):


        for f in files:

            df = pd.read_csv(f, delimiter='\t' + ';', engine = 'python') 

Solution

  • Original file with name as data.xlsx:

    enter image description here

    This means all the data we need is under the column Data.

    Code to split data into multiple columns for a single file:

    import pandas as pd 
    import numpy as np 
    
    f = 'data.xlsx'
    
    # -- Insert the following code in your `for f in files` loop -- 
    file_data = pd.read_excel(f) 
    
    # Since number of values to be split is not known, set the value of `num_cols` to
    # number of columns you expect in the modified excel file
    num_cols = 20
    
    # Create a dataframe with twenty columns 
    new_file = pd.DataFrame(columns = ["col_{}".format(i) for i in range(num_cols)])
    
    # Change the column name of the first column in new_file to "Data"
    new_file = new_file.rename(columns = {"col_0": file_data.columns[0]})
    
    # Add the value of the first cell in the original file to the first cell of the 
    # new excel file
    new_file.loc[0, new_file.columns[0]] = file_data.iloc[0, 0]
    
    # Loop through all rows of original excel file
    for index, row in file_data.iterrows():
    
        # Skip the first row
        if index == 0:
            continue
    
        # Split the row by `space`. This gives us a list of strings.
        split_data = file_data.loc[index, "Data"].split(" ")
        print(split_data)
    
        # Convert each element to a float (a number) if we want numbers and not strings
        # split_data = [float(i) for i in split_data]
    
        # Make sure the size of the list matches to the number of columns in the `new_file` 
        # np.NaN represents no value. 
        split_data = [np.NaN]  + split_data + [np.NaN] * (num_cols - len(split_data) - 1)
    
        # Store the list at a given index using `.loc` method
        new_file.loc[index] = split_data
    
    # Drop all the columns where there is not a single number
    new_file.dropna(axis=1, how='all', inplace=True)
    
    # Get the original excel file name
    new_file_name = f.split(".")[0]
    
    # Save the new excel file at the same location where the original file is. 
    new_file.to_excel(new_file_name + "_modified.xlsx", index=False)
    

    This creates a new excel file (with a single sheet) of name data_modified.xlsx:

    enter image description here

    Summary (code without comments):

    import pandas as pd 
    import numpy as np 
    
    f = 'data.xlsx'
    
    file_data = pd.read_excel(f) 
    
    num_cols = 20
    new_file = pd.DataFrame(columns = ["col_{}".format(i) for i in range(num_cols)])
    new_file = new_file.rename(columns = {"col_0": file_data.columns[0]})
    new_file.loc[0, new_file.columns[0]] = file_data.iloc[0, 0]
    
    for index, row in file_data.iterrows():
    
        if index == 0:
            continue
    
        split_data = file_data.loc[index, "Data"].split(" ")
        split_data = [np.NaN]  + split_data + [np.NaN] * (num_cols - len(split_data) - 1)
        new_file.loc[index] = split_data
    
    new_file.dropna(axis=1, how='all', inplace=True)
    new_file_name = f.split(".")[0]
    new_file.to_excel(new_file_name + "_modified.xlsx", index=False)