Search code examples
pythonweb-scrapingscrapy

Scrapy code skipping data on the json output


I have created a scrapy code that parses data on football fixtures. The code is working almost ok, but some reason the data scraped is not complete. For example on the url: https://www.fcf.cat/acta/2223/futbol-11/divisio-honor-cadet/grup-1/hc/barcelona-fc-a/hc/damm-cf-a

the json output does not include the goal scored in minute 34, and I cannot see why. Can anyone help tell me why?

case "Gols":
                        for row in table.css("tbody tr"):
                            #player_name = row.css("td a::text").get().strip()
                            player_name = player_name = row.xpath("string(td/a)").get().strip()
                            
                            timestamp_acta = ""
                            tipus_gol = ""
                            
                            
                            if row.css(".faf-pilota_base.p-a.stat-center.gol-normal"):
                                tipus_gol = "Normal"

                            if row.css(".faf-pilota_base.p-a.stat-center.gol-propia"):
                                tipus_gol = "Propia"

                            if row.css(".faf-pilota_base.p-a.stat-center.gol-penal"):
                                tipus_gol = "Penal"
                            
                            
                            # special selector for 👇 selecting last of it's kind
                            timestamp = row.css("td:last-child::text").get()
                            
                            table_data[player_name] = {
                                "Minut": timestamp,
                                "Tipus": tipus_gol}
                            
                    
                    case "Estadi":
                        table_data = []
                        table_data.append(
                            table.css("tr a::text").get()
                        )
                        table_data.append(
                            table.css("tr td.uppercase::text").get()
                        )
                    
                    case "Comparativa":
                        team1 = response.css(".td-comparativa .comparativa-equip1 span::text").get()
                        team2 = response.css(".td-comparativa .comparativa-equip2 span::text").get()
                        
                        table_data["Local"] = team1
                        table_data["Visitant"] = team2
    
                dt[table_heading] = table_data

I would like to get someone to help understand why the code is skipping the row, so I can address the code. Being able to scrape all the data is really important.


Solution

  • The goal at minute 34 is not appearing in your results because you are storing the goal data in a dictionary keyed by player name. So if a player scores multiple goals, you will only record the last one that they score.

    Maybe you want to think about storing a list of goals indexed by player name instead.

    Instead of this:

    table_data[player_name] = {
        "Minut": timestamp,
        "Tipus": tipus_gol}
    

    You could use something like this:

    goal_info = { "Minut": timestamp, "Tipus": tipus_gol}
    if player_name in table_data:
        table_data = { **table_data, player_name: table_data[player_name] + [goal_info]}
    else:
        table_data[player_name] = [goal_info]