I understand that GPU and CPU have their own RAM, but what I dont understand is why the same dataframe, when loaded in pandas vs RAPIDS cuDF, have drastically different memory usage. Can somebody explain?
As noted in Josh Friedlander's comment, in cuDF the object data type is explicitly for strings. In pandas, this is the data type for strings and also arbitrary/mixed data types (such as lists, dicts, arrays, etc.). This can explain this memory behavior in many scenarios, but doesn't explain it if both columns are strings.
Assuming both columns are strings, there is still likely to be a difference. In cuDF, string columns are represented as a single allocation of memory for the raw characters, an associated null mask allocation to handle missing values, and an associated allocation to handle row offsets, consistent with the Apache Arrow memory specification. So, it’s likely that whatever is represented in these columns is more efficient in this data structure in cuDF than as the default string data structure in Pandas (which is going to be true essentially all the time).
The following example may be helpful:
import cudf
import pandas as pd
Xc = cudf.datasets.randomdata(nrows=1000, dtypes={"id": int, "x": int, "y": int})
Xp = Xc.to_pandas()
print(Xp.astype("object").memory_usage(deep=True), "\n")
print(Xc.astype("object").memory_usage(deep=True), "\n")
print(Xp.astype("string[pyarrow]").memory_usage(deep=True))
Index 128
id 36000
x 36000
y 36000
dtype: int64
id 7487
x 7502
y 7513
Index 0
dtype: int64
Index 128
id 7483
x 7498
y 7509
dtype: int64
Using the Arrow spec string dtype in pandas saves quite a bit of memory and generally matches cuDF.