Search code examples
pythonnumpymultiprocessingbioinformaticsvcf-variant-call-format

How to find elements of a large (1 million elements) array in another larger array (600 million elements)


I have a very large file (containing dbSNP ID), containing 1 million rows, each containing a single string, and another larger file (.vcf) containing 600 million rows, each containing 7-8 columns.

I want to find the first occurrence of each row of the smaller file in the larger file, making the brute force complexity of my program to be 1,000,000 * 600,000,000 times. I want a faster and less memory intensive way of doing this. I'm new to multiprocessing or parallel programming in python and I'm not sure how I can solve this without using either.

I've tried doing something like this for a smaller subset of both files, using the numpy and pandas libraries:

import numpy as np
import pandas as pd

BigFile = pd.Series(arrayOfRowsOfBiggerFile)
SmallFile = pd.Series(arrayOfRowsOfSmallerFile)
FinalList = SmallFile.map(lambda x: np.where(A==x)[0][0]).tolist()

This takes forever to execute and I'm sure can be handled well with python multiprocessing.


Solution

  • If I understood correctly, you're actually performing a join operation: you want all the rows in the VCF whose key (RSID in this case) appears in your "smaller" file. See docs here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html

    And your code would look something like this:

    dbsnp = pd.read_csv('path/to/dbsnp', index_col='rsid', ...)
    rsids_of_interest = pd.read_csv('path/to/smaller_file', ...)
    
    subset_of_dbsnp = dbsnp.join(rsids_of_interest, how='inner', ...)