Search code examples
pythonpandasparquet

How to identify Pandas' backend for Parquet


I understand that Pandas can read and write to and from Parquet files using different backends: pyarrow and fastparquet.

I have a Conda distribution with the Intel distribution and "it works": I can use pandas.DataFrame.to_parquet. However I do not have pyarrow installed so I guess that fastparquet is used (which I cannot find either).

Is there a way to identify which backend is used?


Solution

  • One method would be to call show_versions() which will list the dependencies (plus other environment stuff):

    pd.show_versions()
    
    INSTALLED VERSIONS
    ------------------
    commit: None
    python: 3.6.0.final.0
    python-bits: 64
    OS: Windows
    OS-release: 7
    machine: AMD64
    processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
    byteorder: little
    LC_ALL: None
    LANG: None
    LOCALE: None.None
    
    pandas: 0.23.0
    pytest: 3.0.5
    pip: 9.0.3
    setuptools: 27.2.0
    Cython: 0.25.2
    numpy: 1.14.3
    scipy: 1.1.0
    pyarrow: None
    xarray: None
    IPython: 5.1.0
    sphinx: 1.5.1
    patsy: 0.4.1
    dateutil: 2.6.0
    pytz: 2016.10
    blosc: None
    bottleneck: 1.2.1
    tables: 3.4.3
    numexpr: 2.6.5
    feather: None
    matplotlib: 2.2.2
    openpyxl: 2.4.1
    xlrd: 1.0.0
    xlwt: 1.2.0
    xlsxwriter: 0.9.6
    lxml: 3.7.2
    bs4: 4.5.3
    html5lib: 0.9999999
    sqlalchemy: 1.1.5
    pymysql: None
    psycopg2: None
    jinja2: 2.9.4
    s3fs: None
    fastparquet: None
    pandas_gbq: None
    pandas_datareader: None
    

    Here incidentally I don't have either pyarrow or fastparquet installed

    Actually you can call pd.io.parquet.get_engine('auto'):

    In[193]:
    pd.io.parquet.get_engine('auto')
    
    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-193-929185e5aca8> in <module>()
    ----> 1 pd.io.parquet.get_engine('auto')
    
    C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parquet.py in get_engine(engine)
         27             pass
         28 
    ---> 29         raise ImportError("Unable to find a usable engine; "
         30                           "tried using: 'pyarrow', 'fastparquet'.\n"
         31                           "pyarrow or fastparquet is required for parquet "
    
    ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
    pyarrow or fastparquet is required for parquet support
    

    As I don't have either installed this raises an ImportError, presumably on your environment this will actually return the installed engine

    And after installing fastparquet I now get:

    In[194]:
    pd.io.parquet.get_engine('auto')
    
    Out[194]: <pandas.io.parquet.FastParquetImpl at 0xf5582b0>
    

    And if we look at the class:

    In[202]:
    impl = pd.io.parquet.get_engine('auto')
    impl.__class__
    
    Out[202]: pandas.io.parquet.FastParquetImpl
    

    it tells us which impl it is.

    If pyarrow is installed one would get:

    >>> pd.io.parquet.get_engine('auto')
    <pandas.io.parquet.PyArrowImpl object at 0xa13fb1ef0>
    >>> pd.io.parquet.get_engine('auto').__class__
    <class 'pandas.io.parquet.PyArrowImpl'>