I am trying to build a function, to apply to different columns with a csv file. The purpose of the function is to combine multiple columns into one, using the combine_first method described here: How to merge/combine columns in pandas?
I like to create a function, with parameters: csvName, ColName, NumberOfCol, and generate a new (combined) column for that csv file. Because there are different sets of columns that need to be grouped (and there are multiple csv files), I like to use a function on each set of columns.
Example columns (showing only the ones that need to be merged, the DF has many other columns that I need to retain):
DF = {'A1': [tree, grass, np.nan, np.nan, np.nan, np.nan],
'A2': [np.nan, np.nan, bird, fly, np.nan, np.nan],
'A3': [np.nan, np.nan, np.nan, np.nan, cat, dog]}
DF["A"] = pd.concat([DF["A1"].\
combine_first(DF["A2"]).\
combine_first(DF["A3"])],
axis=1)
DF.rename(columns={"A1":"A"}, inplace=True)
The resulted column will become:
DF = {'A': [tree, grass, bird, fly, cat, dog]}
I like to make this operation into a function, so I can enter the name of the csv, the column name (e.g., A), and number of columns that need to be merged
I have something like this at the moment:
def column_merger(csvName, ColName, length:int):
OriginalNameList = []
for number in range(length):
OriginalNameList.append(ColName + str(number))
NewList = OriginalNameList[1:]
for name in Newlist:
NewList.append()
command = []
command.extend([.combine_first(SurveyName[NewList[name]]) for name in range(NewList)])
But it is not working. Any suggestion very much welcomed! Thank you!
I think this approach does what you're looking for, but note that I had to turn DF
into a DataFrame
object, since I assume that's what it's supposed to be
import numpy as np
import pandas as pd
DF = pd.DataFrame({
'A1': ['tree', 'grass', np.nan, np.nan, np.nan, np.nan],
'A2': [np.nan, np.nan, 'bird', 'fly', np.nan, np.nan],
'A3': [np.nan, np.nan, np.nan, np.nan, 'cat', 'dog']
})
# use 'stack()' to combine the columns, dropping the nan values
# use 'reset_index()[0]' to get the desired data out of the stack as a Series
# and then use 'to_frame('A')' to convert that series back into a DataFrame
# with the desired column name 'A'
DF = DF.stack().reset_index()[0].to_frame('A')
print(DF)
# A
# 0 tree
# 1 grass
# 2 bird
# 3 fly
# 4 cat
# 5 dog
Note that you can do each of those steps separately (without chaining them) if you want to see the intermediate results
DF = DF.stack()
print(DF)
DF = DF.reset_index()[0]
print(DF)
DF = DF.to_frame('A')
print(DF) # final result
I should point out, however, that there are pending implementation changes to the stack
function which will alter its behavior in Pandas 3.0 (NB: at time of writing, Pandas is currently at v2.2). You can read more about this in the Pandas 2.1 release notes