Search code examples
pythonhtmlpython-3.xpandashtml-parsing

How to recognize dates in pandas index and header?


I am extracting a table from a website using the HTML parser which is something like below:

Current Output:

          1 Sep         2 Sep       3 Sep        4 Sep  
00:00     11             47          54           10 
01:00     12             49          46           22
...
23:00     15             34          22           40

Expected Output:

          2017-09-0    2017-09-02    2017-09-03    2017-09-04   
00:00:00     11             47          54           10 
01:00:00     12             49          46           22
...
23:00:00     15             34          22           40

but both the index and header are being recognized as objects instead of date and hour. So how do I fix this?

Below is the code that i am using:

url = r'https://www.nratrafficdata.ie/c2/tfmonthreport.asp?sgid=ZvyVmXU8jBt9PJE$c7UXt6&spid=NRA_000000001508&reportdate=2017-09-01&enddate=2017-09-30&sidebar=1'
tables = pd.read_html(url, header=[1], index_col=0)
sp500_table = tables[0]

Solution

  • Use Series.str.contains with regex pattern for match dd:dd values first, then filter out columns from list and last convert columns to datetimes and index to timedeltas:

    cols = sp500_table.columns.difference(['Workday','7 Day','Count'], sort=False)
    sp500_table = sp500_table.loc[sp500_table.index.str.contains('^\d{2}:\d{2}$', na=False), cols]
    sp500_table.columns = pd.to_datetime(sp500_table.columns + ' 2017', format='%d %b %Y')
    sp500_table.index = pd.to_timedelta(sp500_table.index + ':00') 
    print (sp500_table.head())
             2017-09-01 2017-09-02 2017-09-03 2017-09-04 2017-09-05 2017-09-06  \
    00:00:00       1362       1576       1589       1321       1198       1218   
    01:00:00        998       1105       1359        859        722        881   
    02:00:00        751        613        844        550        470        600   
    03:00:00        822        813        808        749        627        672   
    04:00:00       1461       1357       1108       1466       1497       1386   
    
             2017-09-07 2017-09-08 2017-09-09 2017-09-10  ... 2017-09-21  \
    00:00:00       1319       1422       1717       1684  ...       1310   
    01:00:00        802        987       1234       1377  ...        807   
    02:00:00        525        659        651        834  ...        531   
    03:00:00        697        714        796        786  ...        661   
    04:00:00       1397       1412       1417       1123  ...       1319   
    
             2017-09-22 2017-09-23 2017-09-24 2017-09-25 2017-09-26 2017-09-27  \
    00:00:00       1615       1807       1600       1219       1131       1283   
    01:00:00       1046       1263       1322        760        732        888   
    02:00:00        554        738        842        471        487        528   
    03:00:00        754        823        734        595        608        633   
    04:00:00       1514       1315       1091       1451       1287       1376   
    
             2017-09-28 2017-09-29 2017-09-30  
    00:00:00       1357       1579       1858  
    01:00:00        968       1025       1146  
    02:00:00        554        642        704  
    03:00:00        767        739        727  
    04:00:00       1378       1420       1243  
    
    [5 rows x 30 columns]