Search code examples
pythonpandasdataframemerging-data

How can I join 2 rows in a dataframe into 1 row in a new one?


I am filtering an external data source that is formatted as an excel file. There is nothing I can do to change how the file is generated. I need to filter out rows that are useless and combine pairs of rows into one. The process I have so far works on the filtering, but not on the joining of related data in two successive rows into one row.

The dataframes are not converted nicely for stackoverflow, but I have hand-tweaked them below.

Data Transformations

Convert downloads into useful formats

import pandas as pd
from pandas          import DataFrame
from pandas.io.excel import read_excel
cpath = os.path.join (download_path, classes_report)
print (pd.__version__)

df = pd.read_excel (cpath, sheetname=0, header=None)
df.to_string()

0.14.1

0 1 2 3 4 5 0 Session: 2014-2015 NaN NaN NaN NaN NaN 1 Class Information Age Enrolled Key Room NaN 2 Math 10 12 / 18 03396 110 09:00:00 3 Teacher: Joe M Teacher NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN 6 NaN NaN NaN NaN 7 NaN NaN NaN NaN NaN NaN 8 NaN NaN NaN NaN NaN NaN 9 Number of Classes: 1 Number of Students: 12 / 18 NaN NaN NaN NaN 10 Class Information Ages Enrolled Key Room NaN 11 Art 18 - 80 3 / 24 03330 110 10:00:00 12 Teacher: John A Instructor NaN NaN NaN NaN 13 NaN NaN NaN NaN NaN NaN 14 NaN NaN NaN NaN 15 NaN NaN NaN NaN

# Eliminate any rows where first column is NaN, contains 'Number of Classes', 'Class Information'
# or is blank
# The 5th column is tuition.

cf = df[df[0].notnull ()][1:]
cf = cf [~cf[0].str.contains ('Number of Classes')]
bf = cf[~cf[0].isin ([' ', 'Class Information'])]
bf.to_string()

0 1 2 3 4 5 2 Math 10 12 / 18 03396 110 09:00:00 3 Teacher: Joe M Teacher NaN NaN NaN NaN 11 Art 18 - 80 3 / 24 03330 110 10:00:00 12 Teacher: John A Instructor NaN NaN NaN NaN

left  = DataFrame(bf.values [::2], index=bf.index[::2])
right = DataFrame(bf.values [1::2], index=bf.index[1::2])
pd.concat([left, right], axis=1).to_string ()

0 1 2 3 4 5 0 1 2 3 4 5 2 Math 10 12 / 18 03396 110 09:00:00 NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN Teacher: Joe M Teacher NaN NaN NaN NaN 11 Art 18 - 80 3 / 24 03330 110 10:00:00 NaN NaN NaN NaN NaN NaN 12 NaN NaN NaN NaN NaN NaN Teacher: John A Instructor NaN NaN NaN NaN

The goal here was to have the last five columns of the "Math" row to contain the columns starting with "Teacher:", and similarly for the "Art" row, leaving a dataframe with two rows instead of four.


Solution

  • Your attempt to concat aligned the 2 df's by index thereby producing a disjointed df with 4 rows rather than 2 rows:

    right = DataFrame(bf.values [1::2], index=bf.index[1::2])
    

    The above creates a new df using the values from your df but you also took the index values also, seeing as the left and right df's have the same number of rows and you're want to concatenate them column-wise so that the indices align then you can just use the same index from the left df:

    right = DataFrame(bf.values [1::2], index=left.index)
    

    This will produce the desired concatenated df.