Search code examples
pythonazurejupyter-notebookazure-machine-learning-service

Cannot read ".parquet" files in Azure Jupyter Notebook (Python 2 and 3)


I am currently trying to open parquet files using Azure Jupyter Notebooks. I have tried both Python kernels (2 and 3). After the installation of pyarrow I can import the module only if the Python kernel is 2 (not working with Python 3)

Here is what I've done so far (for clarity, I am not mentioning all my various attempts, such as using conda instead of pip, as it also failed):

!pip install --upgrade pip
!pip install -I Cython==0.28.5
!pip install pyarrow

import pandas  
import pyarrow
import pyarrow.parquet

#so far, so good

filePath_parquet = "foo.parquet"
table_parquet_raw = pandas.read_parquet(filePath_parquet, engine='pyarrow')

This works well if I'm doing that off-line (using Spyder, Python v.3.7.0). But it fails using an Azure Notebook.

 AttributeErrorTraceback (most recent call last)
<ipython-input-54-2739da3f2d20> in <module>()
      6 
      7 #table_parquet_raw = pd.read_parquet(filePath_parquet, engine='pyarrow')
----> 8 table_parquet_raw = pandas.read_parquet(filePath_parquet, engine='pyarrow')

AttributeError: 'module' object has no attribute 'read_parquet'

Any idea please?

Thank you in advance !

EDIT:

Thank you very much for your reply Peter Pan ! I have typed these statements, here is what I got:

1.

    print(pandas.__dict__)

=> read_parquet does not appear

2.

    print(pandas.__file__)

=> I get:

    /home/nbuser/anaconda3_23/lib/python3.4/site-packages/pandas/__init__.py
  1. import sys; print(sys.path) => I get:

    ['', '/home/nbuser/anaconda3_23/lib/python34.zip',
    '/home/nbuser/anaconda3_23/lib/python3.4',
    '/home/nbuser/anaconda3_23/lib/python3.4/plat-linux',
    '/home/nbuser/anaconda3_23/lib/python3.4/lib-dynload',
    '/home/nbuser/.local/lib/python3.4/site-packages',
    '/home/nbuser/anaconda3_23/lib/python3.4/site-packages',
    '/home/nbuser/anaconda3_23/lib/python3.4/site-packages/Sphinx-1.3.1-py3.4.egg',
    '/home/nbuser/anaconda3_23/lib/python3.4/site-packages/setuptools-27.2.0-py3.4.egg',
    '/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/extensions',
    '/home/nbuser/.ipython']
    

Do you have any idea please ?

EDIT 2:

Dear @PeterPan, I have typed both !conda update conda and !conda update pandas : when checking the Pandas version (pandas.__version__), it is still 0.19.2.

I have also tried with !conda update pandas -y -f, it returns: `Fetching package metadata ........... Solving package specifications: .

Package plan for installation in environment /home/nbuser/anaconda3_23:

The following NEW packages will be INSTALLED:

pandas: 0.19.2-np111py34_1`

When typing: !pip install --upgrade pandas

I get:

Requirement already up-to-date: pandas in /home/nbuser/anaconda3_23/lib/python3.4/site-packages Requirement already up-to-date: pytz>=2011k in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas) Requirement already up-to-date: numpy>=1.9.0 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas) Requirement already up-to-date: python-dateutil>=2 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas) Requirement already up-to-date: six>=1.5 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from python-dateutil>=2->pandas)

Finally, when typing:

!pip install --upgrade pandas==0.24.0

I get:

Collecting pandas==0.24.0 Could not find a version that satisfies the requirement pandas==0.24.0 (from versions: 0.1, 0.2b0, 0.2b1, 0.2, 0.3.0b0, 0.3.0b2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0rc1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0rc1, 0.8.0rc2, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0rc1, 0.19.0, 0.19.1, 0.19.2, 0.20.0rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0rc1, 0.21.0, 0.21.1, 0.22.0) No matching distribution found for pandas==0.24.0

Therefore, my guess is that the problem comes from the way the packages are managed in Azure. Updating a package (here Pandas), should lead to an update to the latest version available, shouldn't it?


Solution

  • I tried to reproduce your issue on my Azure Jupyter Notebook, but failed. There was no any issue for me without doing your two steps !pip install --upgrade pip & !pip install -I Cython==0.28.5 which I think not matter.

    Please run some codes below to check your import package pandas whether be correct.

    1. Run print(pandas.__dict__) to check whether has the description of read_parquet function in the output.
    2. Run print(pandas.__file__) to check whether you imported a different pandas package.
    3. Run import sys; print(sys.path) to check the order of paths whether there is a same named file or directory under these paths.

    If there is a same file or directory named pandas, you just need to rename it and restart your ipynb to re-run. It's a common issue which you can refer to these SO threads AttributeError: 'module' object has no attribute 'reader' and Importing installed package from script raises "AttributeError: module has no attribute" or "ImportError: cannot import name".

    In Other cases, please update your post for more details to let me know.


    The latest pandas version should be 0.23.4, not 0.24.0.

    I tried to find out the earliest version of pandas which support the read_parquet feature via search the function name read_parquet in the documents of different version from 0.19.2 to 0.23.3. Then, I found pandas supports read_parquet feature after the version 0.21.1, as below.

    enter image description here

    The new features shown in the What's New of version 0.21.1 enter image description here

    According to your EDIT 2 description, it seems that you are using Python 3.4 in Azure Jupyter Notebook. Not all pandas versions support Python 3.4 version.

    The versions 0.21.1 & 0.22.0 offically support Python 2.7,3.5, and 3.6, as below. enter image description here

    And the PyPI page for pandas also requires the Python version as below.

    enter image description here

    So you can try to install the pandas versions 0.21.1 & 0.22.0 in the current notebook of Python 3.4. if failed, please create a new notebook in Python 2.7 or >=3.5 to install pandas version >= 0.21.1 to use the function read_parquet.