Search code examples
pythonpandasdataframeweb-scrapingvalueerror

Pandas error in Python: columns must be same length as key


I am webscraping some data from a few websites, and using pandas to modify it.

On the first few chunks of data it worked well, but later I get this error message:

Traceback(most recent call last):
  File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2326, in __setitem__ self._setitem_array(key,value)
  File "/home/web/.local/lib/python2.7/site-packages/pandas/core/frame.py, line 2350, in _setitem_array
raise ValueError("Columns must be same length as key')  ValueError: Columns must be same length as key

My code is here:

df2 = pd.DataFrame(datatable, columns = cols)
df2[['STATUS_ID_1','STATUS_ID_2']] = df2['STATUS'].str.split(n=1, expand=True)

My data looks like below:

                  STATUS
2       Landed   8:33 AM
3       Landed   9:37 AM
..         ...       ...
316    Delayed   5:00 PM
341    Delayed   4:32 PM
..         ...       ...
397    Delayed   5:23 PM
..         ...       ...

[240 rows x 2 columns]

Solution

  • You need a bit modify solution, because sometimes it return 2 and sometimes only one column:

    df2 = pd.DataFrame({'STATUS':['Estimated 3:17 PM','Delayed 3:00 PM']})
    
    
    df3 = df2['STATUS'].str.split(n=1, expand=True)
    df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
    print (df3)
      STATUS_ID1 STATUS_ID2
    0  Estimated    3:17 PM
    1    Delayed    3:00 PM
    
    df2 = df2.join(df3)
    print (df2)
                  STATUS STATUS_ID1 STATUS_ID2
    0  Estimated 3:17 PM  Estimated    3:17 PM
    1    Delayed 3:00 PM    Delayed    3:00 PM
    

    Another possible data - all data have no whitespaces and solution working too:

    df2 = pd.DataFrame({'STATUS':['Canceled','Canceled']})
    

    and solution return:

    print (df2)
         STATUS STATUS_ID1
    0  Canceled   Canceled
    1  Canceled   Canceled
    

    All together:

    df3 = df2['STATUS'].str.split(n=1, expand=True)
    df3.columns = ['STATUS_ID{}'.format(x+1) for x in df3.columns]
    df2 = df2.join(df3)