pythonpandasdataframepython-multithreading

unexpected behavior of for after using Map and Partial method


I'm using Partial method to pass 2 parameters which are not iterables, thus i shouldn't use that in the Map() function. I'm also using ThreadPoolExecutor for I\O bound task that i have here. the problem is that inside of the get_the_text_par() function, i have a for loop which should go through all the rows and send the requests for each row (link) but it's doing it only for the first row and skips the other rows. How can i fix the issue or what am i missing here.

    get_the_text_par = partial(get_the_text,_link_column=link,_firms=firms)
    with ThreadPoolExecutor() as executor:
        #chunk_size = len(results) // 10
        chunk_size= len(results) if len(results)<10 else len(results) // 10
        chunks=[results.iloc[i:i + chunk_size] for i in range(0, len(results),chunk_size)]
        result = list(executor.map(get_the_text_par,chunks))

Get_the_Text implementation:

def get_the_text(_df,_firms:list,_link_column:str):
  '''
  sending a request to recieve the Text of the Articles

  Parameters
  ----------
  _df : DataFrame
  
  Returns
  -------
  dataframe with the text of the articles
  '''  
  _df.reset_index(inplace=True)
  print(_df)
  for k,link in enumerate(_df[[f'{_link_column}']]):
        print(k,'\n',_df.loc[k,f'{_link_column}'])
        if link:
            website_text=list()
            # print(link,'\n','K:',k)         
            try:                                            
                page_status_code,page_content,page_url =  send_two_requests(_df.loc[k,f'{_link_column}']) 
                ......
                .....
                ...
                ..
                .

to import the data :

data = {
    'index': [1366, 4767, 6140, 11898],
    'DATE': ['2014-01-12', '2014-01-12', '2014-01-12', '2014-01-12'],
    'SOURCES': ['go.com', 'bloomberg.com', 'latimes.com', 'usatoday.com'],
    'SOURCEURLS': [
        'http://abcnews.go.com/Business/wireStory/mercedes-recalls-372k-suvs-21445846',
        'http://www.bloomberg.com/news/2014-01-12/vw-patent-application-shows-in-car-gas-heater.html',
        'http://www.latimes.com/business/autos/la-fi-hy-autos-recall-mercedes-20140112-story.html',
        'http://www.usatoday.com/story/money/cars/2014/01/12/mercedes-recall/4437279/'
    ],
    'Tone': [-0.375235, -1.842752, 1.551724, 2.521008],
    'Positive_Score': [2.626642, 1.228501, 3.275862, 3.361345],
    'Negative_Score': [3.001876, 3.071253, 1.724138, 0.840336],
    'Polarity': [5.628518, 4.299754, 5.0, 4.201681],
    'Activity_Reference_Density': [22.326454, 18.918919, 22.931034, 19.327731],
    'Self_Group_Reference_Density': [0.0, 0.0, 0.344828, 0.840336],
    'Year': [2014, 2014, 2014, 2014],
    'Month': [1, 1, 1, 1],
    'Day': [12, 12, 12, 12],
    'Hour': [0, 0, 0, 0],
    'Minute': [0, 0, 0, 0],
    'Second': [0, 0, 0, 0],
    'Mentioned_firms': ['mercedes', 'vw', 'mercedes', 'mercedes'],
    'text': ['', '', '', '']
}

# Creating a DataFrame
df = pd.DataFrame(data)

Solution

  • The problem you're encountering may be attributed to how you're employing the enumerate function within the loop. While iterating over _df[[f'{_link_column}']], you are actually traversing a DataFrame, not a series of links. Consequently, the loop is not correctly extracting the links, leading to only the first row being processed.

    Alternatively you can use the method itertuples to fix your issue :

    def get_the_text(_df, _firms: list, _link_column: str):
        _df.reset_index(inplace=True)
        print(_df)
        for row in _df.itertuples(index=False):
            link = getattr(row, f'{_link_column}')
            print(link)
            if link:
                website_text = list()
                try:
                    page_status_code, page_content, page_url = send_two_requests(link)
                    # Your remaining code here...