Search code examples
pythonpandasdataframelarge-data

Merge df: no error, but output only header line


I am trying to merge a small dataframe (dfSmall) that can fit into memory with a huge dataframe (dfLarge) that can't fit in memory. They're both too big to post here but look something like:

dfSmall:
ix,#CHROM,POS,sample,allele,pop,super_pop
0,1,1121557,rs112904239,HG00096,T,GBR,EUR
1,1,1213223,rs113095492,HG00096,T,GBR,EUR
2,1,1000894,rs114006445,HG00096,T,GBR,EUR
(5000 rows)

dfLarge:
#CHROM POS      ID          REF ALT QUAL FILTER
1      14719    rs527865771 C   A   100 PASS   ...
1      14728    rs547701710 C   A   100 PASS   ...
1      1213223  rs113095492 A   G   100 PASS   ...
...
(>1 million rows, >2000 columns)

#for just these three rows, my output would the row where 1, 1213223 match:
dfMerge:
#CHROM POS      ID          REF ALT QUAL FILTER
  1    1213223  rs113095492 A   G   100  PASS

Here's my code:

dfSmall = pd.read_table('small.csv', dtype='str', header=None, skiprows=1, names=['ix', '#CHROM', 'POS', 'ID', 'sample', 'allele', 'pop', 'superpop'])

def merge_it(c):
        return dfSmall.merge(c, on=['#CHROM', 'POS'], suffixes=('', '_y'))[header_line]

dfFull = pd.concat([merge_it(c) for c in pd.read_table(large.vcf.gz, header = None, names = header_line, dtype='str', engine = 'c',compression = 'gzip', skiprows=251, chunksize=40000, low_memory=False)])

match = re.search(r'ALL.(chr\d+)', chromosome)
dfFull.to_csv(r"{}.csv".format(match.group(1)))

where header_line = ['#CHROM','POS','ID','REF','ALT','QUAL','FILTER',..., 2500 strings]

When I run it, I get no errors, but my output file is only the header:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102     ...

I have manually checked a few of the entries, so I know there are rows from both files that visually match in both the #CHROM and POS columns.

I thought the problem of getting an output file with only the header might be because the column data types didn't match, which is why I explicitly set dtype='str'. However, checking the dtypes for dfLarge gives me dtype('O'), not str. Could they be mismatching on the #CHROM/POS columns because the dtypes are different? If that's not an issue, any other ideas?


Solution

  • I think your problem comes from the way you parse your file - dfSmall has commas in it. Here is what I get once I have removed the commas:

    df_m = pd.merge(dfSmall, dfLarge, on=['POS', 'CHROM'], how='inner')
    
    
    dfSmall
    Out[100]: 
       CHROM      POS       sample   allele pop super pop.1
    0      1  1121557  rs112904239  HG00096   T   GBR   EUR
    1      1  1213223  rs113095492  HG00096   T   GBR   EUR
    2      1  1000894  rs114006445  HG00096   T   GBR   EUR
    
    dfLarge
    Out[102]: 
       CHROM      POS           ID REF ALT  QUAL FILTER
    0      1    14719  rs527865771   C   A   100   PASS
    1      1    14728  rs547701710   C   A   100   PASS
    2      1  1213223  rs113095492   A   G   100   PASS
    
    df_m
    Out[103]: 
       CHROM      POS       sample   allele pop super pop.1           ID REF ALT  \
    0      1  1213223  rs113095492  HG00096   T   GBR   EUR  rs113095492   A   G   
    
       QUAL FILTER  
    0   100   PASS