Search code examples
pythonloopsscrapyscrapy-item

python scrapy building same item from multiple parse functions : calling second parse function in loop


I am trying to build an item from many parsing functions because am getting data from multiple urls, I try to iterate a dictionary (that i built using 2 for loops) that's why am using 2 for loops to get the needed variable to generate the URL then for every variable i call the second parse function passing the needed URL this is where i want to call the second parse function from my main parse

   for r in [1,2]:
      for t in [1,2]:
        dataName = 'lane'+str(r)+"Player"+str(t)+"Name"
        dataHolder = 'lane'+str(r)+"Player"+str(t)
        nameP = item[dataName]
        print('before parse ==> lane = ' + str(r) + "  team = " + str(t))
        urlP = 'https://www.leagueofgraphs.com/summoner/euw/'+nameP+'#championsData-soloqueue'
        yield Request( urlP, callback=self.parsePlayer , meta={'item': item , "player" : dataHolder} )  

I am using those prints() to see in output how my code is executing same in my second parsing function which is as following

def parsePlayer( self , response ):
  item = response.meta['item']
  player = response.meta['player']
  print('after parse ====> ' + player)
  mmr = response.css('.rank .topRankPercentage::text').extract_first().strip().lower()
  mmrP = player+"Mmr"
  item[mmrP] = mmr
  # yield item after the last iteration

( i know i did not explain every detail in the code but i think its not needed to see my problem , not after u see what am getting from those prints )

result i get

expected result

also for some reason everytime i run the spyder i get diffrent random order of prints this is confusing i think it s something about the yield i hope someone can help me with that


Solution

  • Scrapy works asynchronously (as explained clearly in their official documentation), which is why the order of your prints seem random. Besides the order, the expected output looks exactly the same as the result you get. If you can explain why the order is relevant, we might be able to answer your question better.

    If you want to yield 1 item with data of all 4 players in there, the following structure can be used:

        def start_requests(self):
            # prepare the urls & players:
            urls_dataHolders = []
            for r in [1, 2]:
                for t in [1, 2]:
                    dataName = 'lane' + str(r) + "Player" + str(t) + "Name"
                    dataHolder = 'lane' + str(r) + "Player" + str(t)
                    urlP = 'https://www.leagueofgraphs.com/summoner/euw/' + dataName\
                           + '#championsData-soloqueue'
                    urls_dataHolders.append((urlP, dataHolder))
    
            # get the first url & dataholder
            url, dataHolder = urls_dataHolders.pop()
            yield Request(url,
                          callback=self.parsePlayer,
                          meta={'urls_dataHolders': urls_dataHolders,
                                'player': dataHolder})
    
        def parsePlayer(self, response):
            item = response.meta.get('item', {})
            urls_dataHolders = response.meta['urls_dataHolders']
            player = response.meta['player']
            mmr = response.css(
                '.rank .topRankPercentage::text').extract_first().strip().lower()
            mmrP = player + "Mmr"
            item[mmrP] = mmr
            try:
                url, dataHolder = urls_dataHolders.pop()
            except IndexError:
                # list of urls is empty, so we yield the item
                yield item
            else:
                # still urls to go through
                yield Request(url,
                              callback=self.parsePlayer,
                              meta={'urls_dataHolders': urls_dataHolders,
                                    'item': item,
                                    'player': dataHolder})