Search code examples
pythonpython-3.xpandasurllib

converting http directory listing into a dataframe and access path in each row


I am able to access an http url and retrieve the directory listing. Then i go line by line and check if each url has a .txt extension and accessing it using requests.content and decode the txt file. However I want to be able to filter the listing based on the date. The directory listing is like so:

<HTML><HEAD><TITLE>IP Address/Log</TITLE>
<BODY>
<H1>Log</H1><HR>

<PRE><A HREF="/Main/">[To Parent Directory]</A>
  23/11/18    19:07          314 <A HREF="/Log/Alarm_231118.txt">Alarm_231118.txt</A>
  23/11/16    23:59       150516 <A HREF="/Log/Temperature%20Detail_Data%20Log_231116.txt">Temperature Detail_Data Log_231116.txt</A>
  23/11/28    15:22          450 <A HREF="/Log/Alarm_231128.txt">Alarm_231128.txt</A>
  23/11/17     0:00       450536 <A HREF="/Log/Temperature%20Detail_Data%20Log.log">Temperature Detail_Data Log.log</A>
  23/11/16    23:59       110148 <A HREF="/Log/Water%20Temp%20Trend_Data%20Log_231116.txt">Water Temp Trend_Data Log_231116.txt</A>
</PRE><HR></BODY></HTML>

Im only interested in the rows that have the data, time size and the HREF link. I want to create a dataframe where the first column is the date, second is the time, third is the size and the 4th is the link. To access each link i use the following code:

for line in lines:
    if ".txt" in line:
        filename = line.split('"')[1]
        if filename.startswith(file_prefix_all) and filename.endswith(".txt"):
             file_url = url_root + filename
             print(file_url)
             file_response = requests.get(file_url, auth=auth)
             if file_response.status_code == 200:
                # Read the CSV content into a Pandas DataFrame
                file_content = file_response.content.decode('utf-8')
                df = pd.read_csv(StringIO(file_content), encoding='utf-8', sep='\t')
                files_dataframes.append(df)

Will I be able to use the same method once i get the listing into a dataframe? Any help/suggestions would be greatly appreciated!


Solution

  • You can use regular expression to parse the text, for example:

    import re
    
    import pandas as pd
    
    text = """\
    <HTML><HEAD><TITLE>IP Address/Log</TITLE>
    <BODY>
    <H1>Log</H1><HR>
    
    <PRE><A HREF="/Main/">[To Parent Directory]</A>
      23/11/18    19:07          314 <A HREF="/Log/Alarm_231118.txt">Alarm_231118.txt</A>
      23/11/16    23:59       150516 <A HREF="/Log/Temperature%20Detail_Data%20Log_231116.txt">Temperature Detail_Data Log_231116.txt</A>
      23/11/28    15:22          450 <A HREF="/Log/Alarm_231128.txt">Alarm_231128.txt</A>
      23/11/17     0:00       450536 <A HREF="/Log/Temperature%20Detail_Data%20Log.log">Temperature Detail_Data Log.log</A>
      23/11/16    23:59       110148 <A HREF="/Log/Water%20Temp%20Trend_Data%20Log_231116.txt">Water Temp Trend_Data Log_231116.txt</A>
    </PRE><HR></BODY></HTML>"""
    
    
    df = pd.DataFrame(
        map(
            re.Match.groupdict,
            re.finditer(
                r'(?P<date>\d+/\d+/\d+).*?(?P<time>\d+:\d+).*?(?P<size>\d+).*?HREF="(?P<filename>[^"]+)"',
                text,
            ),
        )
    )
    print(df)
    

    Prints:

           date   time    size                                         filename
    0  23/11/18  19:07     314                            /Log/Alarm_231118.txt
    1  23/11/16  23:59  150516  /Log/Temperature%20Detail_Data%20Log_231116.txt
    2  23/11/28  15:22     450                            /Log/Alarm_231128.txt
    3  23/11/17   0:00  450536         /Log/Temperature%20Detail_Data%20Log.log
    4  23/11/16  23:59  110148  /Log/Water%20Temp%20Trend_Data%20Log_231116.txt
    

    Then to filter the dataframe:

    print(df[df.filename.str.endswith(".txt")])
    

    Prints:

           date   time    size                                         filename
    0  23/11/18  19:07     314                            /Log/Alarm_231118.txt
    1  23/11/16  23:59  150516  /Log/Temperature%20Detail_Data%20Log_231116.txt
    2  23/11/28  15:22     450                            /Log/Alarm_231128.txt
    4  23/11/16  23:59  110148  /Log/Water%20Temp%20Trend_Data%20Log_231116.txt