I have two dataframes I want to compare in pandas, one is too large to fit in memory and the other is smaller and does fit in memory.
dfSmall:
cat1 cat2
foo bar
foo tiger
foo spam
bar spam
(5000 rows)
dfLarge:
cat1 cat2 cat3
foo dog green
foo tiger blue
foo snake green
foo bird pink
bar dog orange
...
(>1 million rows)
I have used dask.dataframe (dd.merge), but it takes a long time and a lot of fiddling, and it seems inefficient as one of my df will fit into memory. I have also used pandas.read_table with a set chunksize, but only to modify the file itself, not to compare it to another file.
The additional complication is I want the output file to only keep rows that match two columns - cat1 and cat2. In the above example, the output would just be one line - foo cat
, because that's the only line where both columns match. If matching two columns is not possible, I wouldn't mind a solution that can only match on one column, I'd just have to modify the dataframes to combine cat1/cat2 into one column.
dfOutput:
cat1 cat2 cat3
foo tiger blue
Using read_table
with chunksize, say you first makes a function that merges a chunk with the smaller table:
dfSmall = pd.read_table(small_path, ...)
def merge_it(c):
return dfSmall.merge(c, on=['cat1', 'cat2'], suffixes=('', '_y'))[['cat1', 'cat2', 'cat3']]
Note that when you merge, pandas will add suffixes on the common columns. The code above says not to add suffixes to the columns of the chunk (that is ''
in the tuple), and to take the columns 'cat1', 'cat2', 'cat3'
.
Then you can concatenate chunk merges of the larger ones, like this:
pd.concat([merge_it(c) for c in pd.read_table(large_path, ..., chunksize=100000)])
Note that for each chunk, you get the merge of it with the small DataFrame. To get the full result, you need to concatenate the results.