I am currently trying to open parquet files using Azure Jupyter Notebooks. I have tried both Python kernels (2 and 3). After the installation of pyarrow I can import the module only if the Python kernel is 2 (not working with Python 3)
Here is what I've done so far (for clarity, I am not mentioning all my various attempts, such as using conda instead of pip, as it also failed):
!pip install --upgrade pip
!pip install -I Cython==0.28.5
!pip install pyarrow
import pandas
import pyarrow
import pyarrow.parquet
#so far, so good
filePath_parquet = "foo.parquet"
table_parquet_raw = pandas.read_parquet(filePath_parquet, engine='pyarrow')
This works well if I'm doing that off-line (using Spyder, Python v.3.7.0). But it fails using an Azure Notebook.
AttributeErrorTraceback (most recent call last)
<ipython-input-54-2739da3f2d20> in <module>()
6
7 #table_parquet_raw = pd.read_parquet(filePath_parquet, engine='pyarrow')
----> 8 table_parquet_raw = pandas.read_parquet(filePath_parquet, engine='pyarrow')
AttributeError: 'module' object has no attribute 'read_parquet'
Any idea please?
Thank you in advance !
EDIT:
Thank you very much for your reply Peter Pan ! I have typed these statements, here is what I got:
1.
print(pandas.__dict__)
=> read_parquet does not appear
2.
print(pandas.__file__)
=> I get:
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/pandas/__init__.py
import sys; print(sys.path) => I get:
['', '/home/nbuser/anaconda3_23/lib/python34.zip',
'/home/nbuser/anaconda3_23/lib/python3.4',
'/home/nbuser/anaconda3_23/lib/python3.4/plat-linux',
'/home/nbuser/anaconda3_23/lib/python3.4/lib-dynload',
'/home/nbuser/.local/lib/python3.4/site-packages',
'/home/nbuser/anaconda3_23/lib/python3.4/site-packages',
'/home/nbuser/anaconda3_23/lib/python3.4/site-packages/Sphinx-1.3.1-py3.4.egg',
'/home/nbuser/anaconda3_23/lib/python3.4/site-packages/setuptools-27.2.0-py3.4.egg',
'/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/extensions',
'/home/nbuser/.ipython']
Do you have any idea please ?
EDIT 2:
Dear @PeterPan, I have typed both !conda update conda
and !conda update pandas
: when checking the Pandas version (pandas.__version__
), it is still 0.19.2
.
I have also tried with !conda update pandas -y -f
, it returns:
`Fetching package metadata ...........
Solving package specifications: .
Package plan for installation in environment /home/nbuser/anaconda3_23:
The following NEW packages will be INSTALLED:
pandas: 0.19.2-np111py34_1`
When typing:
!pip install --upgrade pandas
I get:
Requirement already up-to-date: pandas in /home/nbuser/anaconda3_23/lib/python3.4/site-packages
Requirement already up-to-date: pytz>=2011k in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas)
Requirement already up-to-date: numpy>=1.9.0 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas)
Requirement already up-to-date: python-dateutil>=2 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas)
Requirement already up-to-date: six>=1.5 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from python-dateutil>=2->pandas)
Finally, when typing:
!pip install --upgrade pandas==0.24.0
I get:
Collecting pandas==0.24.0
Could not find a version that satisfies the requirement pandas==0.24.0 (from versions: 0.1, 0.2b0, 0.2b1, 0.2, 0.3.0b0, 0.3.0b2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0rc1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0rc1, 0.8.0rc2, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0rc1, 0.19.0, 0.19.1, 0.19.2, 0.20.0rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0rc1, 0.21.0, 0.21.1, 0.22.0)
No matching distribution found for pandas==0.24.0
Therefore, my guess is that the problem comes from the way the packages are managed in Azure. Updating a package (here Pandas), should lead to an update to the latest version available, shouldn't it?
I tried to reproduce your issue on my Azure Jupyter Notebook, but failed. There was no any issue for me without doing your two steps !pip install --upgrade pip
& !pip install -I Cython==0.28.5
which I think not matter.
Please run some codes below to check your import package pandas
whether be correct.
print(pandas.__dict__)
to check whether has the description of read_parquet
function in the output.print(pandas.__file__)
to check whether you imported a different pandas
package.import sys; print(sys.path)
to check the order of paths whether there is a same named file or directory under these paths.If there is a same file or directory named pandas
, you just need to rename it and restart your ipynb
to re-run. It's a common issue which you can refer to these SO threads AttributeError: 'module' object has no attribute 'reader' and Importing installed package from script raises "AttributeError: module has no attribute" or "ImportError: cannot import name".
In Other cases, please update your post for more details to let me know.
The latest pandas
version should be 0.23.4
, not 0.24.0
.
I tried to find out the earliest version of pandas
which support the read_parquet
feature via search the function name read_parquet
in the documents of different version from 0.19.2
to 0.23.3
. Then, I found pandas
supports read_parquet
feature after the version 0.21.1
, as below.
The new features shown in the What's New
of version 0.21.1
According to your EDIT 2
description, it seems that you are using Python 3.4 in Azure Jupyter Notebook. Not all pandas
versions support Python 3.4 version.
The versions 0.21.1
& 0.22.0
offically support Python 2.7,3.5, and 3.6, as below.
And the PyPI page for pandas
also requires the Python version as below.
So you can try to install the pandas
versions 0.21.1
& 0.22.0
in the current notebook of Python 3.4. if failed, please create a new notebook in Python 2.7
or >=3.5
to install pandas
version >= 0.21.1
to use the function read_parquet
.