I'm working on a project that studies to see if there is a trend to how much a student studies on average each semester of each year during university. Each dataframe was listed as year1
, year2
, etc. when read in. The data frame I started with has the following format:
studentID | region | year | semester | week | numHours |
---|---|---|---|---|---|
1 | West | 1 | Fall | 1 | 4-6 |
1 | West | 1 | Fall | 2 | 7-9 |
1 | West | 1 | Fall | 3 | 7-9 hour(s) |
This, of course, only lists some of the entries.
The region, year, and week columns were removed. The year because there was a separate csv file for each year and the week column because the particular week didn't matter, just the number of hours. The new dataframe was given the name year1Reduced
, year2Reduced
, etc.
The next thing after removing the columns I did was use the reshape
function in order to change the data from "long" to "wide" format so I could have three new columns, one for each semester (Fall, Spring, Summer) and the number of hours each student studied every week during whichever semester. This is how I wrote it:
year1Reduced.wide.Reshape <- reshape(data = year1Reduced,
idvar = c("studentID"),
timevar = "semester",
direction = "wide",
v.names = c("numHours")
)
It worked, sort of. I got an unnecessary column, which I removed, but I also got the separate columns for Fall, Spring, and Summer, but instead of listing all of the number of hours each student studied for each week, it only took the first element under numHours
for each semester for each student. This is what it looked like:
studentID | numHours.Fall | numHours.Spring | numHours.Summer |
---|---|---|---|
1 | 4-6 | Less than 1 | 1-3 hour(s) |
2 | 4-6 hour(s) | 1-3 | 1-3 hour(s) |
3 | 4-6 hour(s) | less than 1 | 1-3 hour(s) |
What I'm trying to figure out is how to have all of the numHours
entries included instead of just one entry per student per semester. If it's just a small change in the reshape code I wrote, please let me know.
The format I want would look like this:
studentID | numHours.Fall | numHours.Spring | numHours.Summer |
---|---|---|---|
1 | 4-6 | Less than 1 | 1-3 hour(s) |
1 | 5-7 hour(s) | 1-4 | 1-2 hour(s) |
1 | 4-6 hour(s) | 5-6 | 1-4 hour(s) |
Any help would be appreciated. I'm pretty new at R, so try to explain it as though I know nothing if possible.
We don't have enough data to test this but you can try :
library(dplyr)
library(tidyr)
year1Reduced %>%
select(studentID, semester, numHours) %>%
group_by(studentID, semester) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = semester, values_from = numHours) %>%
select(-row)