TL;DR: most of this post consists of example that I've included to make as clear as possible, but the core of the question is contained in the middle section "The Actual question" where examples are reduced to the bone.
I have a database which contains data about football matches from which I am trying to extract some stats.
The database contains just one table, called 'allMatches', in which eache entry represent a match, the fields (I am just including the fields that are absolutely necessary to give a sense of what the problem is) of the table are:
For each entry in the database I have to extract some stats about both away and home team. This can be achieved very easily when you are considering stats about ALL previous matches, for example, to obtain goal scored and conceded stats, first I run this query:
singleTeamAllMatches=
select ID as MatchID,
Date as Date,
HT as Team,
HG as Scored,
AG as Conceded
from allMatches
UNION ALL
select ID as MatchID,
Date as Date,
AT as Team,
AG as Scored,
HG as Conceded
from allMatches;
This is not absolutely necessary, since it simply transform the orginal table in this way:
this row in allMatches:
|ID |Date | HT |AT |HG | AG|
|42 |2011-05-08 |Genoa |Sampdoria | 2 | 1 |
"becomes" two rows in singleTeamAllMatches:
|MatchID |Date |Team |Scored|Conceded|
|42 |2011-05-08 |Genoa | 2 | 1 |
|42 |2011-05-08 |Sampdoria | 1 | 2 |
but allows me to get the stats I need with a very simple query:
select a.MatchID as MatchID,
a.Team as Team,
Sum(b.Scored) as totalScored,
Sum(b.Conceded) as totalConceded
from singleTeamAllMatches a, singleTeamAllMatches b
where a.Team == b.Team AND b.Date < a.Date
I end up with a query that, when runned, returns:
In other words, if in this last query I obtain:
|MatchID| Team |totalScored|totalConceded|
|42 | Genoa |38 | 40 |
|42 | Sampdoria |30 | 42 |
It means that Genoa and Sampdoria played against each other in the match with ID 42 and, before that match Genoa had scored 38 goals and conceded 40, while Sampdoria had scored 30 and conceded 42.
Now, this is very easy because I consider ALL previous matches, what I have no idea how to accomplish is how to obtain the exact same stats considering only the 6 previous matches. For example, let's say that in singleTeamAllMatches I have:
|MatchID |Date |Team |Scored|Conceded|
|1 |2011-05-08 |TeamA | 1 | 5 |
|2 |2011-06-08 |TeamA | 0 | 2 |
|3 |2011-07-08 |TeamA | 3 | 0 |
|4 |2011-08-08 |TeamA | 4 | 0 |
|5 |2011-09-08 |TeamA | 1 | 0 |
|6 |2011-10-08 |TeamA | 0 | 1 |
|7 |2011-11-08 |TeamA | 0 | 1 |
|8 |2011-12-08 |TeamA | 1 | 1 |
I need to find a way to obtain something like this:
|MatchID| Team |totalScored|totalConceded|
|1 | TeamA |0 | 0 |
|2 | TeamA |1 | 5 |
|3 | TeamA |1 | 7 |
|4 | TeamA |4 | 7 |
|5 | TeamA |8 | 7 |
|6 | TeamA |9 | 7 |
|7 | TeamA |9 | 8 |
|8 | TeamA |8 | 4 |
Let's have a look at the last two rows in this query:
Row 7 means that in the last 6 matches before match 7 (matches 1-6) teamA scored 9 goals and conceded 8.
Row 8 isn't affected by the goals scored in match 1, because it just tells us that in the last 6 matches before match 8 (matches 2-7) teamA scored 8 goals and conceded 4.
Is there a way to obtain this with sql through the sqldf package?
(edit: actually any solution will do, with the dplyr package the task is almost trivial and efficiently accomplished)
At the moment the only thing I could come up with was to import the data in R and loop over all the rows in allMatches using sql 'LIMIT' and the sqldf R package.
The following is an adaptation to the example I've used here of the code I use. It is just a sample that gets the stats only for the home team, but the complete code is quite long and wouldn't be useful here.
allMatches and singleTeamAllMatches are dataframes whose structure and content is the same of the table and query I've described above.
lastMatchesData <- NULL
for(match in (1:nrow(allMatches))){
matchRow <- allMatches[match,]
T <- matchRow$HT
Date <- matchRow$Date
ID <- matchRow$ID
lastMatches <- singleTeamAllMatches[singleTeamAllMatches$T == T & singleTeamAllMatches$Date < Date ,]
TPerformance <- sqldf("select sum(Scored) as Scored,
sum(Conceded) as Conceded
from
(select * from lastMatches order by Date DESC limit 6)")
newRow <- cbind(ID,TPerformance)
lastMatchesData <- rbind(lastMatchesData,newRow)
}
I don't like this solution for two reason: first of all, it is really ugly and messy, remember that this is just a sample, but in the future I think I'm going to modify this code, and an all-sql solution would be much better. The second reaso is that it is slow, and I mean really slow, again an all-sql solution would be much better.
Here is one solution I came up with using dplyr:
library(dplyr)
df <- df %>% group_by(Team) %>% mutate(cumScored = cumsum(Scored), totalScored = cumScored - ifelse(row_number() >= 7, lag(cumScored, 6), 0), cumConceded = cumsum(Conceded), totalConceded = cumConceded - ifelse(row_number() >= 7, lag(cumConceded, 6), 0)) %>% select(-cumScored, -cumConceded)
The idea is to first compute cumulative sums of scores and concessions, and then to retain only the last six matches, subtract out the 6th lag of the cumulative sum from current cumulative sum so you get the partial cumulative sum over the last six lags of matches. I could not find a way to do cumulative sum on the fly over arbitrary number of lags. So, using a trick of adding a new column and then deselecting it. Hope this helps.