I am able to access an http url and retrieve the directory listing. Then i go line by line and check if each url has a .txt
extension and accessing it using requests.content
and decode the txt file.
However I want to be able to filter the listing based on the date. The directory listing is like so:
<HTML><HEAD><TITLE>IP Address/Log</TITLE>
<BODY>
<H1>Log</H1><HR>
<PRE><A HREF="/Main/">[To Parent Directory]</A>
23/11/18 19:07 314 <A HREF="/Log/Alarm_231118.txt">Alarm_231118.txt</A>
23/11/16 23:59 150516 <A HREF="/Log/Temperature%20Detail_Data%20Log_231116.txt">Temperature Detail_Data Log_231116.txt</A>
23/11/28 15:22 450 <A HREF="/Log/Alarm_231128.txt">Alarm_231128.txt</A>
23/11/17 0:00 450536 <A HREF="/Log/Temperature%20Detail_Data%20Log.log">Temperature Detail_Data Log.log</A>
23/11/16 23:59 110148 <A HREF="/Log/Water%20Temp%20Trend_Data%20Log_231116.txt">Water Temp Trend_Data Log_231116.txt</A>
</PRE><HR></BODY></HTML>
Im only interested in the rows that have the data, time size and the HREF link. I want to create a dataframe where the first column is the date, second is the time, third is the size and the 4th is the link. To access each link i use the following code:
for line in lines:
if ".txt" in line:
filename = line.split('"')[1]
if filename.startswith(file_prefix_all) and filename.endswith(".txt"):
file_url = url_root + filename
print(file_url)
file_response = requests.get(file_url, auth=auth)
if file_response.status_code == 200:
# Read the CSV content into a Pandas DataFrame
file_content = file_response.content.decode('utf-8')
df = pd.read_csv(StringIO(file_content), encoding='utf-8', sep='\t')
files_dataframes.append(df)
Will I be able to use the same method once i get the listing into a dataframe? Any help/suggestions would be greatly appreciated!
You can use regular expression to parse the text, for example:
import re
import pandas as pd
text = """\
<HTML><HEAD><TITLE>IP Address/Log</TITLE>
<BODY>
<H1>Log</H1><HR>
<PRE><A HREF="/Main/">[To Parent Directory]</A>
23/11/18 19:07 314 <A HREF="/Log/Alarm_231118.txt">Alarm_231118.txt</A>
23/11/16 23:59 150516 <A HREF="/Log/Temperature%20Detail_Data%20Log_231116.txt">Temperature Detail_Data Log_231116.txt</A>
23/11/28 15:22 450 <A HREF="/Log/Alarm_231128.txt">Alarm_231128.txt</A>
23/11/17 0:00 450536 <A HREF="/Log/Temperature%20Detail_Data%20Log.log">Temperature Detail_Data Log.log</A>
23/11/16 23:59 110148 <A HREF="/Log/Water%20Temp%20Trend_Data%20Log_231116.txt">Water Temp Trend_Data Log_231116.txt</A>
</PRE><HR></BODY></HTML>"""
df = pd.DataFrame(
map(
re.Match.groupdict,
re.finditer(
r'(?P<date>\d+/\d+/\d+).*?(?P<time>\d+:\d+).*?(?P<size>\d+).*?HREF="(?P<filename>[^"]+)"',
text,
),
)
)
print(df)
Prints:
date time size filename
0 23/11/18 19:07 314 /Log/Alarm_231118.txt
1 23/11/16 23:59 150516 /Log/Temperature%20Detail_Data%20Log_231116.txt
2 23/11/28 15:22 450 /Log/Alarm_231128.txt
3 23/11/17 0:00 450536 /Log/Temperature%20Detail_Data%20Log.log
4 23/11/16 23:59 110148 /Log/Water%20Temp%20Trend_Data%20Log_231116.txt
Then to filter the dataframe:
print(df[df.filename.str.endswith(".txt")])
Prints:
date time size filename
0 23/11/18 19:07 314 /Log/Alarm_231118.txt
1 23/11/16 23:59 150516 /Log/Temperature%20Detail_Data%20Log_231116.txt
2 23/11/28 15:22 450 /Log/Alarm_231128.txt
4 23/11/16 23:59 110148 /Log/Water%20Temp%20Trend_Data%20Log_231116.txt