Search code examples
pandasbioinformaticsgenome

Compare two cols of one file with another file of same cols and fetch the matches_large dataset_14GB


I have file1 of 650,000 rows with two cols, "Chr" and "Pos". I want to compare this file with dbsnp (file2) datadump and match with with Chr and Pos col present in dbSNP dump. Once matched, respective rsid's to be fetched. I tried using Python Panda's but my process is getting killed. When it tried for 50000 rows it worked.

How can I fetch rsid for whole dataset (file1 = 650k rows) from dbSNP (file2)

#Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
import pandas as pd
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
df3 = pd.merge(df1, df2, on='Chr''Pos', how='inner')
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)

Solution

  • As per Mohit's comment and reading through the Pandas 0.24.2 merge documentation, here is how I would go about it -

    # Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
    
    # import pandas
    import pandas as pd
    
    # read in data files
    df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
    df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
    
    # merge on matched columns 
    df3 = df1.merge(df2, on=['Chr', 'Pos'], how='inner')
    
    # export merged df to file
    export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)
    

    The on parameter in df.merge() takes in a single label or multiple labels as a list. Since you want to match on multiple columns, supplying a list of column names would work.

    Also, how is your process getting killed? Posting your error message would be more helpful.