Search code examples
pythonfunctionappend

Python function to merge columns into one column


I am trying to build a function, to apply to different columns with a csv file. The purpose of the function is to combine multiple columns into one, using the combine_first method described here: How to merge/combine columns in pandas?

I like to create a function, with parameters: csvName, ColName, NumberOfCol, and generate a new (combined) column for that csv file. Because there are different sets of columns that need to be grouped (and there are multiple csv files), I like to use a function on each set of columns.

Example columns (showing only the ones that need to be merged, the DF has many other columns that I need to retain):

DF = {'A1': [tree, grass, np.nan, np.nan, np.nan, np.nan],  
    'A2': [np.nan, np.nan, bird, fly, np.nan, np.nan],
    'A3': [np.nan, np.nan, np.nan, np.nan, cat, dog]}

DF["A"] = pd.concat([DF["A1"].\
                           combine_first(DF["A2"]).\
                               combine_first(DF["A3"])], 
            axis=1)
DF.rename(columns={"A1":"A"}, inplace=True)

The resulted column will become: DF = {'A': [tree, grass, bird, fly, cat, dog]}

I like to make this operation into a function, so I can enter the name of the csv, the column name (e.g., A), and number of columns that need to be merged

I have something like this at the moment:

   def column_merger(csvName, ColName, length:int):
        OriginalNameList = []
        for number in range(length):
            OriginalNameList.append(ColName + str(number))
    
        NewList = OriginalNameList[1:]

        for name in Newlist:
            NewList.append()
        command = []
        command.extend([.combine_first(SurveyName[NewList[name]]) for name in range(NewList)])

But it is not working. Any suggestion very much welcomed! Thank you!


Solution

  • I think this approach does what you're looking for, but note that I had to turn DF into a DataFrame object, since I assume that's what it's supposed to be

    import numpy as np
    import pandas as pd
    
    
    DF = pd.DataFrame({
        'A1': ['tree', 'grass', np.nan, np.nan, np.nan, np.nan],
        'A2': [np.nan, np.nan, 'bird', 'fly', np.nan, np.nan],
        'A3': [np.nan, np.nan, np.nan, np.nan, 'cat', 'dog']
    })
    
    # use 'stack()' to combine the columns, dropping the nan values
    # use 'reset_index()[0]' to get the desired data out of the stack as a Series
    # and then use 'to_frame('A')' to convert that series back into a DataFrame 
    # with the desired column name 'A'
    DF = DF.stack().reset_index()[0].to_frame('A')
    print(DF)
    
    #        A
    # 0   tree
    # 1  grass
    # 2   bird
    # 3    fly
    # 4    cat
    # 5    dog
    

    Note that you can do each of those steps separately (without chaining them) if you want to see the intermediate results

    DF = DF.stack()
    print(DF)
    DF = DF.reset_index()[0]
    print(DF)
    DF = DF.to_frame('A')
    print(DF)  # final result
    

    I should point out, however, that there are pending implementation changes to the stack function which will alter its behavior in Pandas 3.0 (NB: at time of writing, Pandas is currently at v2.2). You can read more about this in the Pandas 2.1 release notes