Search code examples
pythonpython-3.xloopsprocessing-efficiency

What is the most efficient way to run multiple tests in a single loop? Python


Goal: Visit a list of blog pages. On each blog page and find the social links (Instagram, Facebook, Twitter) for that blog page.

Assumption: The first occurrence of each social link will be the right one. Occurrences later in the page are more likely to refer to someone else's account.

The desirable social URL format is www.social_network_name.com/username

There are some formats of URL that are not desirable (e.g. instagram.com/abc/)

def check_instagram(url):
   if 'instagram.com/' in url and "instagram.com/abc/" not in url::
      return True

def check_facebook(url):
   if 'facebook.com/' in url and "facebook.com/abc/" not in url::
      return True

#my list of pages t be parsed
pages_to_check = ['www.url1.com', 'www.url2.com', ... 'www.urn_n.com']

#iterate through my list of pages t be parsed
for page in pages_to_check :

   #get all the links on the page
   page_links = *<selenium code to get all links on page>*

I tried...

  For link in page_links:

     #when first Instagram handle found
     if check_instagram(url):
        *code to write to a dataframe here*            
        break

     #when first Instagram handle found
     if check_facebook(url):
        *code to write to a dataframe here*
        break

Problem: As soon as I matched one social URL, it breaks out of the loop and doesn't continue to look for the other network handles.

Some options I can think if are not very good. I'm a Noob. I'd really appreciate some advice here.

Option #1 - Loop through all the links and test for first match of ONE social network. Loop through all the links and test for first match of NEXT social network. (Hate this)

Option #2 - Create variables for each social URL. Create some marker for match and only update the variable of match is not set. (Better but I'm still going to keep iterating after I have filled all the variables)

Option #3 - Any suggestions or advice welcome. How would you approach this?


Solution

  • Suggestion:

    Keep a tracker and update any social media URL that's been processed. Once they've all been processed, then break out of the loop.

    Code:

    tracker = dict.fromkeys(['facebook', 'instagram'], False)
    
    for link in page_links:
        # if all the values of the tracker are true, then break out of the loop
        if all(v for v in tracker.values()):
            break
        # when first Instagram handle found
        if check_instagram(url):
            *code to write to a dataframe here*
            tracker['instagram'] = True
         # when first Facebook handle found
         if check_facebook(url):
            *code to write to a dataframe here*
            tracker['facebook'] = True
    

    I hope this proves useful.