Search code examples
pythonpandasdataframemulti-index

Creating a multiindexed dataframe from other dataframe values with different indicies


I have a dataframe (dfA) composed from several files. dfA has three things, a date value which matches other date values in dfA, a name from a list of an unknown number of names that will be the same as other names in the dfA, and a concentration value which is unique. I want to create a new dataframe (dfB), where the rows are multiindexed by date, the columns are the names, and the values are the concentration from said name value. I've attempted to do this using the code below:

for name in nameList:
    dfB[str(name)] = dfA[dfA['Sample Name'] == str(name)]['Calculated Concentration']

However, I am returned with a dataframe where only the first row is filled, which I presume is due to the index values being different than those for the other column values:

              '5/0.5 uM'  '10/1 uM'  '15/1.5 uM'
 083021 14     4.7886      NaN        NaN
        15     4.5374      NaN        NaN 
        ... 
 090721 14     5.2840      NaN        NaN
        15     5.3050      NaN        NaN
        ...
 083121 57     5.2132      NaN        NaN
        58     4.8929      NaN        NaN
        ...

In the above output, the leftmost column is the multiindexed column of the values by date. The next column contains the indexes of the original dfA. Then '5/0.5 uM' is a value from 'name list' which contains values found from the original dfA with the 'Sample Name' '5/0.5 uM'.

How can I create dfB where the other columns have the correct values, or change the index values from dfA where they will match for each column?


Solution

  • You may find it easier down the line to reformat your dates to be datetime variables, rather than strings. With that said, set up test:

    dfA = pd.DataFrame({
        "date":["090721","083021","090721","083021","083121","083021","083121","083021","083121","083121"], 
        "name":["15/1.5 uM","15/1.5 uM","15/1.5 uM","15/1.5 uM","5/0.5 uM","5/0.5 uM","15/1.5 uM","5/0.5 uM","10/1 uM","10/1 uM"],
        "concentration":[0.430896003263987,0.354506173988912,0.679630142573769,0.816498937365697,0.134105578748551,0.364977945903234,0.602557291467607,0.633223692609049,2.59697726598929E-02,0.206455687221982]
    })
    dfA["index"] = dfA.index # copy index over to a column called index, for reasons that will be clearer soon.
    dfA
    

    starting test dataframe

    Note that there isn't necessarily a row for every combination of (date, name). To get what you've described:

    pd.pivot_table(dfA, values="concentration", index=["date", "index"], columns=["name"],aggfunc=np.mean)
    

    Output:

    pivotted but not aggregated

    You can see there are NaNs. These correspond to the "gaps" in the starting dataframe as described earlier. Note that we used an aggregation function, np.mean, but that doesn't make any difference, because we have chosen our re-indexing such that there will only be one value of concentration to "aggregate" in each row.

    You didn't ask for this, but you might also find it useful to take the mean of rows which have the same date and name? Simply drop the "index" from the multi-index specification:

    pd.pivot_table(dfA, values="concentration", index=["date"], columns=["name"],aggfunc=np.mean)
    

    Output:

    enter image description here