Search code examples
pythonexcelpandastableau-apixlsx

How to detect the number of the rows to skip in reading excel file on pandas


I'd like to read a .xlsx using python pandas. The problem is the at the beginning of the excel file, it has some additional data like title or description of the table and tables contents starts. That introduce the unnamed columns because pandas DataReader takes it as the columns. But tables contents starts after few lines later.

A                              B                     C
this is description
last updated: Mar 18th,2014
                               Table content
Country                        Year                 Product_output
Canada                         2017                 3002
Bulgaria                       2016                 2201
...

The table content starts in line 4. And columns must be "Country", "year", "proudct_output" instead "this is description", "unnamed", "unnamed". For this specific case, setting skiprows parameter to 3 solved the issue(from Mikhail Venkov). But I have to deal with many excel files and I don't know how many rows to skip in advance. I think there might be a solution since each table column header has a filter.


Solution

  • If you know that a specific text (like country) must be in the first column, you can do the following:

    import xlrd
    xl_work = xlrd.open_workbook("Classeur1.xlsx")
    mySheet = xl_work.sheet_by_index(0)
    
    nl = 0
    while mySheet.cell_value(nl,0) != "Country" :
        nl += 1
    
    line_with_headers = nl
    

    Then use skiprows with nl instead of 3.