Search code examples
pythonpandasbin

Fill bins with no coverage with 0


I need to generate a heatmap with the average coverage of positions within a bin from a determined number of bins, regardless of the number of bases in a transcriptome within each bin. In other words, if I want to have 10 bins, for one transcriptome, it may have 1000 bases to distribute among 10 bins, and another may have 2445 bases to distribute among 10 bins.

The problem is that in my coverage file, there are gaps that don't fall into any bin. For example, if I want 5 bins over 10 positions, I'll have: (0,2], (2,4], (4,6], (6,8], (8,10]. If my positions with coverage are 1, 5, 5, 5, 7, 7, 10, the bin "(2,4]" will be hidden, thus not appearing in the heatmap. What I want is for these bins without coverage to be filled with 0s so that they appear in the heatmap.

I'm using python with pandas, seaborn and matplot.pyplot libraries

In the image below, the first line is edges positions of my bins, and the dataframe is what bins have coverage: enter image description here

Input example:

chr17   1   1
chr17   5   1
chr17   5   2
chr17   5   2
chr17   7   1
chr17   7   5
chr17   10  1

Problem:

    chr                data_bin        avg
  chr17                   (0,2]          1
  chr17                   (4,6]       1.66
  chr17                   (4,6]       1.66
  chr17                   (4,6]       1.66
  chr17                   (6,8]          3
  chr17                   (6,8]          3
  chr17                  (8,10]          1

Expected:

    chr                data_bin        avg
  chr17                   (0,2]          1
  **chr17                   (2,4]          0**
  chr17                   (4,6]       1.66
  chr17                   (4,6]       1.66
  chr17                   (4,6]       1.66
  chr17                   (6,8]          3
  chr17                   (6,8]          3
  chr17                  (8,10]          1

The function I am using is:

def bins_calculator(path_txt:str, start:int,end:int):
    column_names =["chr", "pos", "cov"]
    data = pd.read_csv(path_txt, names = column_names, sep = '\t')
    step = int((end - start) / 10)
    n_bins = [start + i * step for i in range(11)]
    n_bins[-1] = end
    data["data_bin"] = pd.cut(data["pos"], bins = n_bins)
    data["avg"] = data.groupby("data_bin", observed = False)["cov"].transform("mean")
    filtered_data = data[["chr","data_bin","avg"]].drop_duplicates("data_bin")
    return filtered_data

Any questions about this problem, please let me know in the comments :)


Solution

  • IIUC you can use .merge to merge the missing categories, then fill any NaNs with values you want:

    df["data_bin"] = pd.cut(df["pos"], range(0, 12, 2))
    
    df = pd.merge(
        df,
        df["data_bin"].cat.categories.to_frame(),
        left_on="data_bin",
        right_on=0,
        how="outer",
    )[["chr", "data_bin", "cov"]]
    
    df["chr"] = df["chr"].ffill().bfill()
    df["cov"] = df["cov"].fillna(0)
    
    df["avg"] = df.groupby("data_bin")["cov"].transform("mean")
    print(df)
    

    Prints:

         chr     data_bin  cov       avg
    0  chr17   (0.0, 2.0]  1.0  1.000000
    1  chr17   (2.0, 4.0]  0.0  0.000000
    2  chr17   (4.0, 6.0]  1.0  1.666667
    3  chr17   (4.0, 6.0]  2.0  1.666667
    4  chr17   (4.0, 6.0]  2.0  1.666667
    5  chr17   (6.0, 8.0]  1.0  3.000000
    6  chr17   (6.0, 8.0]  5.0  3.000000
    7  chr17  (8.0, 10.0]  1.0  1.000000