Search code examples
python-3.xpandasdataframepython-polars

Get row based on lowest value in one column for each item in another column in Python Dataframe


My question is an extension of this question.

I have a dataframe of genetic variants tested for association with genes. Each gene can have multiple variants associated with it. There are 77 million rows. I want the best variant for each gene, ie with the lowest p value.

How can I best approach this? Should I first create a list of all unique genes and next get the best variant based on the p value column?

Here is a representative sample of the data:

    EnsemblID   VariantID   tss_distance    CAF_eQTL    ma_samples  ma_count    pval_nominal    Beta    SE  chromosome  ... Source  AverageMaximumPosteriorCall Info    AA_N    AB_N    BB_N    TotalN  MAF MissingDataProportion   HWE_P
0   ENSG00000187634 1:693731    -240519 0.134185    153 168 0.881492    0.005953    0.039917    1   ... HRCr11  0.902046    0.624501    1637.01 451.103 35.569  2124    0.122957    0.000075    0.544841
1   ENSG00000187634 1:714596    -219654 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.968278    0.577564    1987.73 135.495 0.677   2124    0.032217    0.000024    0.272504
2   ENSG00000187634 1:715367    -218883 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.975207    0.671960    1976.85 146.512 0.556   2124    0.034753    0.000019    0.176551
3   ENSG00000187634 1:717485    -216765 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.975183    0.670147    1978.35 145.041 0.517   2124    0.034388    0.000022    0.174817
4   ENSG00000187634 1:720381    -213869 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.974476    0.667536    1973.72 149.503 0.691   2124    0.035521    0.000020    0.110452

...

51006816    ENSG00000176383 12:121614338    -591979 0.398562    392 499 0.188786    0.035313    0.026839    12  ... 1000Gp3v5   0.998417    0.997017    782.742 992.043 349.155 2124    0.397929    0.000014    0.257526
8315020 ENSG00000138036 2:42957539  -834486 0.608626    385 490 0.099884    0.023363    0.014176    2   ... HRCr11  0.986869    0.976969    370.833 958.869 794.226 2124    0.400328    0.000017    0.006637
4355839 ENSG00000176444 1:155829511 558831  0.272364    295 341 0.139756    -0.032983   0.022305    1   ... HRCr11  0.992841    0.985842    1081.700    853.253 188.994 2124    0.289847    0.000012    0.291429
53045255    ENSG00000092199 14:22160099 920783  0.278754    302 349 0.600446    -0.008187   0.015622    14  ... HRCr11  0.980246    0.959707    1119.690    845.517 158.721 2124    0.273775    0.000017    0.956476
27329024    ENSG00000146085 6:50332764  885957  0.112620    139 141 0.832515    -0.009139   0.043195    6   ... HRCr11  0.972682    0.887525    1662.300    444.701 16.958  2124    0.112671    0.000011    0.021358

I would expect then something like this:

    EnsemblID   VariantID   tss_distance    CAF_eQTL    ma_samples  ma_count    pval_nominal    Beta    SE  chromosome  ... Source  AverageMaximumPosteriorCall Info    AA_N    AB_N    BB_N    TotalN  MAF MissingDataProportion   HWE_P
1   ENSG00000187634 1:714596    -219654 0.038339    48  48  0.834746    -0.015369   0.073635    1   ... HRCr11  0.968278    0.577564    1987.73 135.495 0.677   2124    0.032217    0.000024    0.272504

...

51006816    ENSG00000176383 12:121614338    -591979 0.398562    392 499 0.188786    0.035313    0.026839    12  ... 1000Gp3v5   0.998417    0.997017    782.742 992.043 349.155 2124    0.397929    0.000014    0.257526
8315020 ENSG00000138036 2:42957539  -834486 0.608626    385 490 0.099884    0.023363    0.014176    2   ... HRCr11  0.986869    0.976969    370.833 958.869 794.226 2124    0.400328    0.000017    0.006637
4355839 ENSG00000176444 1:155829511 558831  0.272364    295 341 0.139756    -0.032983   0.022305    1   ... HRCr11  0.992841    0.985842    1081.700    853.253 188.994 2124    0.289847    0.000012    0.291429
53045255    ENSG00000092199 14:22160099 920783  0.278754    302 349 0.600446    -0.008187   0.015622    14  ... HRCr11  0.980246    0.959707    1119.690    845.517 158.721 2124    0.273775    0.000017    0.956476
27329024    ENSG00000146085 6:50332764  885957  0.112620    139 141 0.832515    -0.009139   0.043195    6   ... HRCr11  0.972682    0.887525    1662.300    444.701 16.958  2124    0.112671    0.000011    0.021358


Solution

  • Here is a suggestion using Polars. I assume that your data is saved in the file data.parquet and EnsemblID is the gene id column.

    import polars as pl
    
    (
        pl.scan_parquet('data.parquet')
        .with_columns(pl.col('pval_nominal').min().over('EnsemblID').alias('min_pval'))
        .filter(pl.col('pval_nominal') == pl.col('min_pval'))
        .collect()
    )
    

    In the with_columns we compute the smallest p value for each gene and afterwards keep the row(s) where the p value is equal to the smallest.