Search code examples
pythonpandasopenpyxlxlrd

Python Pandas not using names field as it once did


I have a short script that's a reduction of my application written using "pandas>=0.25.3" which has been upgraded to "pandas==1.1.5" the latest version of our code. In this version of Pandas, the default engine does not parse xlsx so we've added the engine="openpyxl". However, there's a new issue. The read_excel no longer seems to respect the names argument and has strange behavior.

import pandas

filename = "... .xlsx"

names = ["foo", "bar", "baz"]

data_frame = pandas.read_excel(
    filename,
    header=None,
    names=names,
    engine="openpyxl",
    skiprows=3,
    sheet_name=0,
)
print(data_frame.iloc[3])

Running the script with the new Pandas I get this output:

foo   NaN
bar   NaN
baz   NaN
Name: (FxK2,SMin, 2066.125), dtype: float64

But previously in pandas-0.25.3 parsing with the xlrd engine by default I got what I expected which was this:

foo   FxK2
bar   SMin
baz   2066.125

The names field gave names to the columns and then I could reference data_frame.iloc[0].baz and get 2066.125. Now for some reason, then entire thing ends up in the optional name field for the data frame.

How can I get the behavior I was used to and is this potentially a bug or just new interface I'm not used to? The pandas-1.1.5 certainly seem to reference the names argument in the same way as I was used to using it.


Solution

  • Updated (after comment threads):

    I got the same problem for the second excel file when using engine='openpyxl'. When reading with engine='xlrd' I didn't get the problem, though. But it's probably a version-related issue with xlrd, since I'm using an older version '1.2.0' (currently it's 2.0.1), and there's a big warning in xlrd docummentation saying it no longer read anything other than .xls files.

    Therefore, as we discussed in the comments, a possible work-around to use engine='openpyxl' is to specify the columns with usecols=range(3), although it requires a previous knowledge of the columns in the file. The issue might be at some hiding formatting/styling/etc in those "empty" cells.

    For example, in the same problematic file, if I select and copy the cells with values to another Sheet (Sheet1) in the same file, and read it with:

    pd.read_excel('redux2.xlsx', sheet_name='Sheet1', header=None, names=['foo','bar','baz'], engine='openpyxl', skiprows=1)
    

    then it works as expected.


    Original Answer:

    It's working as expected for me. Probably it's something with your data. Try the following code. If it works, then you know it's something with your data, not pandas.

    Sample data

    import string
    
    df = pd.DataFrame({
        'col1': list(string.ascii_letters[:10]),
        'col2': list(string.ascii_letters[10:20]),
        'col3': range(1,11)
    })
    
    df.to_excel('file.xlsx', index=False)
    print(df)
    
      col1 col2  col3
    0    a    k     1
    1    b    l     2
    2    c    m     3
    3    d    n     4
    4    e    o     5
    5    f    p     6
    6    g    q     7
    7    h    r     8
    8    i    s     9
    9    j    t    10
    

    Testing

    filename = 'file.xlsx'
    names = ["foo", "bar", "baz"]
    
    data_frame = pd.read_excel(
        filename,
        header=None,
        names=names,
        engine="openpyxl",
        skiprows=3,
        sheet_name=0,
    )
    
    print(data_frame.iloc[3])
    
    foo    f
    bar    p
    baz    6
    Name: 3, dtype: object
    

    Note: My pandas version is '1.1.5' as well.