I have a string with multiple urls extracted using BeautifulSoup
and I want to split all of these urls to extract dates and year (the urls have dates and year in them).
print(dat)
http://www.foo.com/2016/01/0124
http://www.foo.com/2016/02/0122
http://www.foo.com/2016/02/0426
http://www.foo.com/2016/03/0129
.
.
I tried the following but it only retrieves the first:
import urlparse
parsed = urlparse(dat)
path = parsed[2] #defining after www.foo.com/
pathlist = path.split("/")
['', '2016', '01', '0124']
So I am only getting result for the first element of the string. How can I retrieve these parses for all of the urls, and store them so I can extract information? I would like know how many of the links there are for year and month.
Also strangely after doing this, when I do print(dat)
I only get the first element http://www.foo.com/2016/01/0124
, it seems that urlparse
is not working for multiple urls.
Based on your question, it looks like you have a list of URLs separated by new lines. In that case you can use a for
loop to iterate over them:
list_pathlist = []
for url in dat.split('\n'):
parsed = urlparse(url)
path = parsed[2] #defining after www.foo.com/
pathlist = path.split("/")
list_pathlist.append(pathlist)
In which case I suspect the result (list_pathlist
) will be something like:
[['', '2016', '01', '0124'],['', '2016', '02', '1222'],...]
so a list of lists.
Or you can put it into a nice one-liner using list-comprehension:
list_pathlist = [urlparse(url)[2].split('/') for url in dat.split('\n')]