Search code examples
pythonpandasmemoryrapids

GPU vs CPU memory usage in RAPIDS


I understand that GPU and CPU have their own RAM, but what I dont understand is why the same dataframe, when loaded in pandas vs RAPIDS cuDF, have drastically different memory usage. Can somebody explain?

enter image description here

enter image description here


Solution

  • As noted in Josh Friedlander's comment, in cuDF the object data type is explicitly for strings. In pandas, this is the data type for strings and also arbitrary/mixed data types (such as lists, dicts, arrays, etc.). This can explain this memory behavior in many scenarios, but doesn't explain it if both columns are strings.

    Assuming both columns are strings, there is still likely to be a difference. In cuDF, string columns are represented as a single allocation of memory for the raw characters, an associated null mask allocation to handle missing values, and an associated allocation to handle row offsets, consistent with the Apache Arrow memory specification. So, it’s likely that whatever is represented in these columns is more efficient in this data structure in cuDF than as the default string data structure in Pandas (which is going to be true essentially all the time).

    The following example may be helpful:

    import cudf
    import pandas as pd
    
    Xc = cudf.datasets.randomdata(nrows=1000, dtypes={"id": int, "x": int, "y": int})
    Xp = Xc.to_pandas()
    
    print(Xp.astype("object").memory_usage(deep=True), "\n")
    print(Xc.astype("object").memory_usage(deep=True), "\n")
    print(Xp.astype("string[pyarrow]").memory_usage(deep=True))
    Index      128
    id       36000
    x        36000
    y        36000
    dtype: int64 
    
    id       7487
    x        7502
    y        7513
    Index       0
    dtype: int64 
    
    Index     128
    id       7483
    x        7498
    y        7509
    dtype: int64
    
    

    Using the Arrow spec string dtype in pandas saves quite a bit of memory and generally matches cuDF.