Search code examples
pythonpandasconcatenationnan

Introducing Nan's when concatenating list of dataframe


I am looping through a data frame, doing some calculations, and then depending on some logic adding the original series or a transformed version. For a MRE, I'll leave out the transformation part.

    # List of Tuples
students = [('Ankit', 22, 'A'),
           ('Swapnil', 22, 'B'),
           ('Priya', 22, 'B'),
           ('Shivangi', 22, 'B'),
            ]

# Create a DataFrame object
stu_df = pd.DataFrame(students, columns =['Name', 'Age', 'Section'],
                      index =['1', '2', '3', '4'])

returnList = []

for i, (colname, series) in enumerate(stu_df.iteritems()):
    returnList.append(pd.DataFrame(series))

a = pd.concat(returnList)

the index is consistent across the series, so why is is not being recognized cleanly? a looks like this, but it should line up to look like the original data?:

index Name Age Section
1 Ankit NaN
2 Swapnil NaN NaN
3 Priya NaN NaN
4 Shivangi NaN NaN
1 NaN 22.0 NaN
2 NaN 22.0 NaN
3 NaN 22.0 NaN
4 NaN 22.0 NaN
1 NaN NaN A
2 NaN NaN B
3 NaN NaN B
4 NaN NaN B

Solution

  • You need to add axis=1 to pd.concat():

    a = pd.concat(returnList, axis=1)
    

    Output:

    >>> a
           Name  Age Section
    1     Ankit   22       A
    2   Swapnil   22       B
    3     Priya   22       B
    4  Shivangi   22       B
    

    Explanation

    By default, pd.concat tries to concatenate the dataframes vertically (axis=0), i.e., adding the second df to the end of the first, and the third to the end of that. But since all the Series object in returnList have different column names, pandas adds them to the end, and fills in the missing spaces with NaN:

    >>> pd.concat(returnList)
           Name   Age Section
    1     Ankit   NaN     NaN  <--- first df of returnList starts here
    2   Swapnil   NaN     NaN
    3     Priya   NaN     NaN
    4  Shivangi   NaN     NaN
    1       NaN  22.0     NaN  <--- second df of returnList starts here
    2       NaN  22.0     NaN
    3       NaN  22.0     NaN
    4       NaN  22.0     NaN
    1       NaN   NaN       A  <--- third df of returnList starts here
    2       NaN   NaN       B
    3       NaN   NaN       B
    4       NaN   NaN       B