Search code examples
pythonurlparse

split multiple urls using urlparse in python


I have a string with multiple urls extracted using BeautifulSoup and I want to split all of these urls to extract dates and year (the urls have dates and year in them).

print(dat)
http://www.foo.com/2016/01/0124
http://www.foo.com/2016/02/0122
http://www.foo.com/2016/02/0426
http://www.foo.com/2016/03/0129
.
.

I tried the following but it only retrieves the first:

import urlparse
parsed = urlparse(dat)
path = parsed[2] #defining after www.foo.com/
pathlist = path.split("/")

['', '2016', '01', '0124']

So I am only getting result for the first element of the string. How can I retrieve these parses for all of the urls, and store them so I can extract information? I would like know how many of the links there are for year and month.

Also strangely after doing this, when I do print(dat) I only get the first element http://www.foo.com/2016/01/0124, it seems that urlparse is not working for multiple urls.


Solution

  • Based on your question, it looks like you have a list of URLs separated by new lines. In that case you can use a for loop to iterate over them:

    list_pathlist = []
    for url in dat.split('\n'):
        parsed = urlparse(url)
        path = parsed[2] #defining after www.foo.com/
        pathlist = path.split("/")
        list_pathlist.append(pathlist)
    

    In which case I suspect the result (list_pathlist) will be something like:

    [['', '2016', '01', '0124'],['', '2016', '02', '1222'],...]
    

    so a list of lists.

    Or you can put it into a nice one-liner using list-comprehension:

    list_pathlist = [urlparse(url)[2].split('/') for url in dat.split('\n')]