GPU vs CPU memory usage in RAPIDS

I understand that GPU and CPU have their own RAM, but what I dont understand is why the same dataframe, when loaded in pandas vs RAPIDS cuDF, have drastically different memory usage. Can somebody explain?

Solution

As noted in Josh Friedlander's comment, in cuDF the object data type is explicitly for strings. In pandas, this is the data type for strings and also arbitrary/mixed data types (such as lists, dicts, arrays, etc.). This can explain this memory behavior in many scenarios, but doesn't explain it if both columns are strings.

Assuming both columns are strings, there is still likely to be a difference. In cuDF, string columns are represented as a single allocation of memory for the raw characters, an associated null mask allocation to handle missing values, and an associated allocation to handle row offsets, consistent with the Apache Arrow memory specification. So, it’s likely that whatever is represented in these columns is more efficient in this data structure in cuDF than as the default string data structure in Pandas (which is going to be true essentially all the time).

The following example may be helpful:

import cudf
import pandas as pd

Xc = cudf.datasets.randomdata(nrows=1000, dtypes={"id": int, "x": int, "y": int})
Xp = Xc.to_pandas()

print(Xp.astype("object").memory_usage(deep=True), "\n")
print(Xc.astype("object").memory_usage(deep=True), "\n")
print(Xp.astype("string[pyarrow]").memory_usage(deep=True))
Index      128
id       36000
x        36000
y        36000
dtype: int64 

id       7487
x        7502
y        7513
Index       0
dtype: int64 

Index     128
id       7483
x        7498
y        7509
dtype: int64

Using the Arrow spec string dtype in pandas saves quite a bit of memory and generally matches cuDF.