Search code examples
pythongenomicranges

Iterate all value combinations in pairwise row comparison in python


I have a data frame with genomic bins in the following format. Each genomic range is represented as a row and the cell value corresponds to that start of the bin.

        0       1       2       3      4      5    ...   522  

0    9248    9249     NaN     NaN     NaN    NaN   ...   NaN
1   17291   17292   17293   17294   17295    NaN   ...   NaN
2   18404   18405   18406   18407     NaN    NaN   ...   NaN

[69 rows x 522 columns]

As you can see, many of the row values are incomplete because some genomic ranges are smaller than others.

I wish to make pairwise combination for each index across the entire row. It would be fine if each pairwise interaction was stored as a separate data frame (preferable, even).

I want something like this:

0 - 1 Pairwise:
0      1
9248   17291
9248   17292
9248   17293
9248   17294
9248   17295
9249   17291
9249   17292
9249   17293
9249   17294
9249   17295
[10 rows x 2 columns]

0 - 2 Pairwise:
0       2
9248   18404
9248   18405
9248   18406
9248   18407
9249   18404
9249   18405
9249   18406
9249   18407
[8 rows x 2 columns]

I need every value combination for each pairwise row combination. I think I need to use itertools.product() to do this sort of thing but cannot figure out how to write the appropriate loop. Any help is greatly appreciated!


Solution

  • Setup

    from pandas.tools.util import cartesian_product as cp
    
    df = pd.DataFrame({'0': {0: 9248, 1: 17291, 2: 18404},
     '1': {0: 9249, 1: 17292, 2: 18405},
     '2': {0: np.nan, 1: 17293.0, 2: 18406.0},
     '3': {0: np.nan, 1: 17294.0, 2: 18407.0},
     '4': {0: np.nan, 1: 17295.0, 2: np.nan},
     '5': {0: np.nan, 1: np.nan, 2: np.nan},
     '522': {0: np.nan, 1: np.nan, 2: np.nan}})
    

    Solution

    final={}
    # use cartesian_product to get all the combinations for each row with other rows and add the results to the final dictionary.
    df.apply(lambda x: [final.update({(x.name, i): np.r_[cp([x.dropna(), df.iloc[i].dropna()])].T}) for i in range(x.name+1,len(df))], axis=1)
    

    Verification

    for k, v in final.items():
        print(k)
        print(v)
    
    (0, 1)
    [[  9248.  17291.]
     [  9248.  17292.]
     [  9248.  17293.]
     ..., 
     [  9249.  17293.]
     [  9249.  17294.]
     [  9249.  17295.]]
    (1, 2)
    [[ 17291.  18404.]
     [ 17291.  18405.]
     [ 17291.  18406.]
     ..., 
     [ 17295.  18405.]
     [ 17295.  18406.]
     [ 17295.  18407.]]
    (0, 2)
    [[  9248.  18404.]
     [  9248.  18405.]
     [  9248.  18406.]
     ..., 
     [  9249.  18405.]
     [  9249.  18406.]
     [  9249.  18407.]]