I have a very large file (containing dbSNP ID), containing 1 million rows, each containing a single string, and another larger file (.vcf) containing 600 million rows, each containing 7-8 columns.
I want to find the first occurrence of each row of the smaller file in the larger file, making the brute force complexity of my program to be 1,000,000 * 600,000,000 times. I want a faster and less memory intensive way of doing this. I'm new to multiprocessing or parallel programming in python and I'm not sure how I can solve this without using either.
I've tried doing something like this for a smaller subset of both files, using the numpy
and pandas
libraries:
import numpy as np
import pandas as pd
BigFile = pd.Series(arrayOfRowsOfBiggerFile)
SmallFile = pd.Series(arrayOfRowsOfSmallerFile)
FinalList = SmallFile.map(lambda x: np.where(A==x)[0][0]).tolist()
This takes forever to execute and I'm sure can be handled well with python multiprocessing.
If I understood correctly, you're actually performing a join
operation: you want all the rows in the VCF whose key (RSID in this case) appears in your "smaller" file. See docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
And your code would look something like this:
dbsnp = pd.read_csv('path/to/dbsnp', index_col='rsid', ...)
rsids_of_interest = pd.read_csv('path/to/smaller_file', ...)
subset_of_dbsnp = dbsnp.join(rsids_of_interest, how='inner', ...)