Search code examples
pandasdataframelarge-data

Pandas: compare huge dataframe with smaller dataframe


I have two dataframes I want to compare in pandas, one is too large to fit in memory and the other is smaller and does fit in memory.

dfSmall:
cat1    cat2
foo     bar
foo     tiger
foo     spam
bar     spam
(5000 rows)

dfLarge:
cat1    cat2    cat3
foo     dog     green
foo     tiger   blue
foo     snake   green
foo     bird    pink
bar     dog     orange
...
(>1 million rows)

I have used dask.dataframe (dd.merge), but it takes a long time and a lot of fiddling, and it seems inefficient as one of my df will fit into memory. I have also used pandas.read_table with a set chunksize, but only to modify the file itself, not to compare it to another file.

The additional complication is I want the output file to only keep rows that match two columns - cat1 and cat2. In the above example, the output would just be one line - foo cat, because that's the only line where both columns match. If matching two columns is not possible, I wouldn't mind a solution that can only match on one column, I'd just have to modify the dataframes to combine cat1/cat2 into one column.

dfOutput:
cat1    cat2    cat3
foo     tiger   blue

Solution

  • Using read_table with chunksize, say you first makes a function that merges a chunk with the smaller table:

    dfSmall = pd.read_table(small_path, ...)
    
    def merge_it(c):
        return dfSmall.merge(c, on=['cat1', 'cat2'], suffixes=('', '_y'))[['cat1', 'cat2', 'cat3']]
    

    Note that when you merge, pandas will add suffixes on the common columns. The code above says not to add suffixes to the columns of the chunk (that is '' in the tuple), and to take the columns 'cat1', 'cat2', 'cat3'.

    Then you can concatenate chunk merges of the larger ones, like this:

    pd.concat([merge_it(c) for c in pd.read_table(large_path, ..., chunksize=100000)])
    

    Note that for each chunk, you get the merge of it with the small DataFrame. To get the full result, you need to concatenate the results.