I have file1 of 650,000 rows with two cols, "Chr" and "Pos". I want to compare this file with dbsnp (file2) datadump and match with with Chr and Pos col present in dbSNP dump. Once matched, respective rsid's to be fetched. I tried using Python Panda's but my process is getting killed. When it tried for 50000 rows it worked.
How can I fetch rsid for whole dataset (file1 = 650k rows) from dbSNP (file2)
#Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
import pandas as pd
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
df3 = pd.merge(df1, df2, on='Chr''Pos', how='inner')
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)
As per Mohit's comment and reading through the Pandas 0.24.2 merge
documentation, here is how I would go about it -
# Program to compare Chr and Pos of a sample with dBSNP and fetching RSIDs
# import pandas
import pandas as pd
# read in data files
df1 = pd.read_csv("v2_infi_chr_pos.csv",sep='\t',dtype='unicode')
df2 = pd.read_csv("dbsnp150_header.txt",sep='\t',dtype='unicode')
# merge on matched columns
df3 = df1.merge(df2, on=['Chr', 'Pos'], how='inner')
# export merged df to file
export_csv = df3.to_csv (r'rsids_infiniumv2_hg38.txt', index = None, header=True)
The on
parameter in df.merge()
takes in a single label or multiple labels as a list. Since you want to match on multiple columns, supplying a list of column names would work.
Also, how is your process getting killed? Posting your error message would be more helpful.