Search code examples
pythonpandassplit-apply-combine

Using split-apply-combine to remove some values with a customized function and combine what's left


So this isn't the dataset I need to work with but it's a template for a huge one I'm working with (~1.8 million data points) for a cancer research project, so I figured if I could get this to work with a smaller one, then I can adapt it for my large one! So as a sample, let's say I have the following data set:

import numpy as np
import pandas as pd
df = pd.DataFrame({
   'cond': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
   'Array':  ['S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','S', 'S', 'TT', 'TT','SS','TT'],
   'X':  [1, 2, 3, 1, 2 , 3, 4, 7.3, 5.1, 3.2, 1.4, 5.5, 9.9, 3.2, 1.1, 3.3, 1.2, 5.4],
   'Y':  [3.1, 2.2, 2.1, 1.2,  2.4, 1.2, 1.5, 1.33, 1.5, 1.6, 1.4, 1.3, 0.9, 0.78, 1.2, 4.0, 5.0, 6.0],
   'Marker':  [2.0, 1.2, 1.2, 2.01, 2.55, 2.05, 1.66, 3.2, 3.21, 3.04, 8.01, 9.1, 7.06, 8.1, 7.9, 5.12, 5.23, 5.15],
   'Area': [3.0, 2.0, 2.88, 1.33,  2.44, 1.25, 1.53, 1.0, 0.156, 2.0, 2.4, 6.3, 6.9, 9.78, 10.2, 15.0, 16.0, 19.0]
})
print(df)

This produces an output that looks like this:

   cond Array    X     Y  Marker    Area
0     A     S  1.0  3.10    2.00   3.000
1     A     S  2.0  2.20    1.20   2.000
2     A    TT  3.0  2.10    1.20   2.880
3     A    TT  1.0  1.20    2.01   1.330
4     A     S  2.0  2.40    2.55   2.440
5     A     S  3.0  1.20    2.05   1.250
6     A    TT  4.0  1.50    1.66   1.530
7     A    TT  7.3  1.33    3.20   1.000
8     A     S  5.1  1.50    3.21   0.156
9     B     S  3.2  1.60    3.04   2.000
10    B    TT  1.4  1.40    8.01   2.400
11    B    TT  5.5  1.30    9.10   6.300
12    B     S  9.9  0.90    7.06   6.900
13    B     S  3.2  0.78    8.10   9.780
14    B    TT  1.1  1.20    7.90  10.200
15    B    TT  3.3  4.00    5.12  15.000
16    B    SS  1.2  5.00    5.23  16.000
17    B    TT  5.4  6.00    5.15  19.000

Ok so now what I need to do is to split them based on two labels, "cond" and "Array". I did that using

g=df.groupby(['cond','Array'])['Marker']

This breaks it into 4 smaller sets split as the pairings A-S, A-TT, B-S, B-TT. Now I have a customized function to work with. This is part of the function and I'll explain how it works:

def num_to_delete(p,alpha,N):
    if p==0.950:
        if 1-alpha==0.90:
            if N<=60:
                m=1
            if 60<N<80:
                m=round(N/20-2)
            if 80<=N:
                m=2
        if 1-alpha==0.95:
            if N<=80:
                m=1
            if 80<N<=100:
                m=round(N/20 -3)
            if 100<N:
                m=2
    return m

Ok so the way it works is that I feed into it a "p" and "alpha" that I pick (the real function covers many more cases of p and alpha). The N that gets fed into it is the the number of elements of my smaller data set (in this case for A-S it's 5, for A-TT it's 4, etc.). So what I'm trying to have happen is that for each smaller data set, spit out a number of points to delete (in this example, the function will always give us 1, but I'm trying to code this with the function for application to a super large data set). Since it gives the number 1, then I want it to delete the 1 largest data point for that set, and tell me what the highest point is left.

So as an example, for the A-S coupling, I have 5 data points: 2.0, 1.2, 2.55, 2.05, and 3.21. Since there's 5 data points, my function tells me to delete 1 of them, so ignore the 3.21, and tell me what's the highest data point left which in this case is 2.55. I want to do this for each coupling, but in my real data set, I will have different numbers of elements so the function will tell me to delete a different number for each coupling.

My ultimate goal is to have a final table that looks like this:

   cond Array   NumDeleted p95/a05  p95/a10       
0     A     S  1.0      2.55   2.55
1     A    TT  1.0      2.01   2.01
2     B     S  1.0      7.06   7.06
3     B    TT  1.0      8.01   8.01

For the larger set, the values in the last 2 columns will be different because in the large data set, there's a lot more difference in the number of values that will be deleted, and hence the remaining values will differ. I will eventually need to alter a second dataset based on the values I get for p95/a05 and p95/a10

Anyway, I'm sorry that was such a long explanation, but if anyone can help, that would be amazing! I'm hoping it's a rather simple thing to do since I've been stuck on this for over a week now.


Solution

  • EDIT: more general solution

    First, it would help to make a closure to define your configurations. This is under the assumption that you will have more configurations in the future:

    def create_num_to_delete(p, alpha):
        """Create a num_to_delete function given p and alpha."""
        def num_to_delete(N):
            if p == 0.950:
                if 1 - alpha == 0.90:
                    if N <= 60:
                        m = 1
                    if 60 < N < 80:
                        m = round(N/20 - 2)
                    if 80 <= N:
                        m = 2
                if 1-alpha == 0.95:
                    if N <= 80:
                        m = 1
                    if 80 < N <= 100:
                        m = round(N/20 -3)
                    if 100 < N:
                        m = 2
            return m
    
        return num_to_delete
    

    You can then use this closure to define a dictionary of configurations:

    configurations = {
        'p95/a05': create_num_to_delete(0.95, 0.05),
        'p95/a10': create_num_to_delete(0.95, 0.10),
    }
    

    Then, define a function that summarizes your data. This function should rely on your configuration so that it remains dynamic.

    def summarize(x):
        # The syntax on the right-hand side is called list comprehension.
        # As you can probably guess, it's essentially a flattened for-loop that
        # produces a list. The syntax starting with "for" is your basic for loop
        # statement, and the syntax to the left of "for" is an expression that
        # that serves as the value of the resulting list for each iteration
        # of the loop.
        #
        # Here, we are looping through the "num_to_delete" functions we defined in
        # our `configurations` dictionary. And calling it in our group `x`.
        Ns = [num_to_delete(len(x)) for num_to_delete in configurations.values()]
    
        markers = x['Marker'].sort_values(ascending=False)
    
        highest_markers = []
        for N in Ns:
            if N == len(x):
                highest_markers.append(None)
            else:
                # Since we know that `markers` is already sorted in descending
                # order, all we need to get the highest remaining value is to get
                # the value in the *complete list* of values offset by the 
                # the number of values that need to be deleted (this is `N`).
                #
                # Since sequences are 0-indexed, simply indexing by `N` is enough.
                # For example, if `N` is 1, indexing by `N` would give us
                # the marker value *indexed by* 1, which is,
                # in a 0-sequenced index, simply the second value.
                highest_markers.append(markers.iloc[N])
    
        # Returning a list from an applied groupby function translates into
        # a DataFrame which the series index as the columns and the series values
        # as the row values. Index in this case is just the list of configuration
        # names we have in the `configurations` dictionary.
        return pd.Series(highest_markers, index=list(configurations.keys()))
    

    Lastly, apply the function to your data set and reset the index. This keeps cond and Array as columns:

    grouped = df.groupby(['cond', 'Array'])
    grouped.apply(summarize).reset_index()
    

    Output is:

        cond    Array   p95/a05 p95/a10
    0   A   S   2.55    2.55
    1   A   TT  2.01    2.01
    2   B   S   7.06    7.06
    3   B   SS  NaN NaN
    4   B   TT  8.01    8.01
    

    Hope this helps.