I have a big data set with multiple football matches. Right now the format is wide and I want to count winning streaks, drawing streaks and loosing streaks and teams by their match.
In this case I have the following variables:
To give a simple example, my data looks something like this:
Home_team <- c("Peru","France","England","Senegal", "Chile", "Colombia","France","Spain","Colombia", "Angola", "Ecuador", "France",
"Peru")
Away_team <- c("Brasil","Germany","Togo","Egypt", "Ecuador", "Argentina","Netherlands","Burkina Faso","New Zealand", "Venezuela", "Portugal", "Canada",
"United States")
Results <- c("HW","HW","AW","D","AW","HW","HW","AW","HW","D","D","HW","D")
df_example <- data.frame(Home_team,Away_team,Results)
df_example
So in this example the following things happened:
I was thinking that an easier way to do this is put everything in long format and count "Wins", "Losses" and "Draws". And every time the streak stops the counting starts again. But I am not sure if that is the best approach.
The big picture is that I would like to know if streaks (winning, loosing or even drawing) has an effect on the result of the next match.
Any help would be highly appreciated.
Here is one approach to try.
First, put data into long form. Using case_when
you can determine the outcome for each each (e.g., the home team with "HW" game gets a "Win", while the away team gets a "Loss").
For each team, you can group the streaks using rleid
from data.table
. Every time the outcome changes, it moves on to a new streak.
Then, you can count up the number of games in a streak for a given team and outcome. It will be the row_number()
for a given streak, team, and outcome.
Finally, you put back into wider form if desired. The new columns indicate the streak (number of games, and outcome) for the home and away teams, coming into the current game.
library(tidyverse)
library(data.table)
df_example %>%
mutate(Game = row_number()) %>%
pivot_longer(cols = c(Home_team, Away_team), names_to = "Location", values_to = "Team") %>%
mutate(Outcome = case_when(
Results == "HW" & Location == "Home_team" ~ "Win",
Results == "HW" & Location == "Away_team" ~ "Loss",
Results == "AW" & Location == "Home_team" ~ "Loss",
Results == "AW" & Location == "Away_team" ~ "Win",
Results == "D" ~ "Draw",
TRUE ~ NA_character_
)) %>%
group_by(Team) %>%
mutate(Change = rleid(Outcome)) %>%
group_by(Change, .add = T) %>%
mutate(Streak = row_number()) %>%
group_by(Team) %>%
mutate(Last = paste(lag(Streak, default = 0), lag(Outcome, default = "-"))) %>%
pivot_wider(id_cols = c(Game, Results), names_from = Location, values_from = c(Team, Last))
Outcome
Game Results Team_Home_team Team_Away_team Last_Home_team Last_Away_team
<int> <chr> <chr> <chr> <chr> <chr>
1 1 HW Peru Brasil 0 - 0 -
2 2 HW France Germany 0 - 0 -
3 3 AW England Togo 0 - 0 -
4 4 D Senegal Egypt 0 - 0 -
5 5 AW Chile Ecuador 0 - 0 -
6 6 HW Colombia Argentina 0 - 0 -
7 7 HW France Netherlands 1 Win 0 -
8 8 AW Spain Burkina Faso 0 - 0 -
9 9 HW Colombia New Zealand 1 Win 0 -
10 10 D Angola Venezuela 0 - 0 -
11 11 D Ecuador Portugal 1 Win 0 -
12 12 HW France Canada 2 Win 0 -
13 13 D Peru United States 1 Win 0 -