Search code examples
pythonpandasloopslinear-regression

Consider every possible pair, remove NaN entries and obtain the linear regression


I'm working with Python for the first time so I've few difficulties.

I've a csv file of data with 6 columns and 20 rows, some entries are 'NaN'. What I'm trying to do is an interaction that would consider the first column vs the others (A vs B,C,D,E and F), then the second column with all the others (B vs C,D,E and F) etc (i.e compare every possible pair of the 6 columns), for each pair remove the NaN entries (so the number of rows for each final pair would be different) and calculate the linear regression.

I can obtain the results I'm looking for by considering each pair "manually", but since I've quite a lot of DataFrames for which I've to do this procedure, it'd take a lot of time, so I'm hoping someone would help me with a faster method.

Part of my data:

| A      | B      | C       | D    | E             | F            |

| 70.385 | 1316.0 | NaN     | 1.84 | 1.021059e+37  | 1.284026e+41 |

| 13.183 | 800.0  | 11549.0 | 1.66 | 4.710032e+35  | Nan          |

| 9.750  | NaN    | NaN     | 1.55 |  1.437108e+36 | 5.070657e+40 |

| 12.302 | NaN    | 547.7   | 1.56 | 2.149507e+36  | 2.294859e+40 |

| NaN    | 2784.2 | 29984.4 | 1.87 | NaN           | 2.294859e+40 |

what I'm able to do is:

import pandas as pd
from scipy.stats import linregress

df=data1.dropna(subset=['A','B']) 
lr_AB=linregress(df['A'],df['B'])

and what I'm trying to obtain is: LinregressResult(slope= , intercept=, rvalue=, pvalue=, stderr=) for each pair. So, how do I iterate these two commands for every possible pair of columns?

My attempt was this, but without much results:

for i in range(len(data1.columns)-1):
    if data1.iloc[:, :] is 'NaN':
        df= data1.dropna()
        print(df)
        lr= linregress(df.iloc[:, i], df.iloc[:, i+1])
    else:
        print('no nan') 
        lr1= linregress(data1.iloc[:, i], data1.iloc[:, i+1])

Thank you so much.


Solution

  • Assuming you'd like to use scipy.stats's linregress, here is a way to do this. Not much more efficient than your code, as it also loops over the combinations of columns:

    from scipy.stats import linregress
    from itertools import combinations
    
    for a, b in combinations(df, 2):
        ab = df[[a, b]].dropna()
        lr[(a, b)] = dict(zip(
            'slope intercept r_value p_value std_err'.split(),
            linregress(ab[a], ab[b])
        ))
        lr[(a, b)]['n_points'] = ab.shape[0]
    
    result = pd.DataFrame(lr).T
    

    Sample run on your example data -

    >>> result
                slope     intercept   r_value   p_value       std_err  n_points
    A B  9.020664e+00  6.810806e+02  1.000000  0.000000  0.000000e+00       2.0
      C  1.248729e+04 -1.530709e+05  1.000000  0.000000  0.000000e+00       2.0
      D  4.316155e-03  1.538532e+00  0.942012  0.057988  1.087230e-03       4.0
      E  1.503028e+35 -4.016931e+35  0.984424  0.015576  1.898085e+34       4.0
    B C  9.291100e+00  4.116120e+03  1.000000  0.000000  0.000000e+00       2.0
      D  8.706011e-05  1.647796e+00  0.789115  0.421079  6.776741e-05       3.0
      E  1.887517e+34 -1.462913e+37  1.000000  0.000000  0.000000e+00       2.0
    C D  1.061987e-05  1.547701e+00  0.998360  0.036468  6.090086e-07       3.0
      E -1.525732e+32  2.233071e+36 -1.000000  0.000000  0.000000e+00       2.0
    D E  2.896150e+37 -4.429182e+37  0.869114  0.130886  1.165450e+37       4.0
    A F  1.525539e+39  2.034718e+40  0.957112  0.187122  4.617784e+38       3.0
    B F -7.182537e+37  2.229248e+41 -1.000000  0.000000  0.000000e+00       2.0
    C F  0.000000e+00  2.294859e+40  0.000000  1.000000  0.000000e+00       2.0
    D F  1.097658e+41 -1.308991e+41  0.382452  0.617548  1.875151e+41       4.0
    E F  1.061306e+04  1.854240e+40  0.946081  0.210009  3.633822e+03       3.0