I have a large dataframe, where a variable id (first column) recurs with different values in the second column. My idea is to order the dataframe, to split it into a list and then lapply a function which cbinds the sequence 1:nrows(variable id) to each group. My code so far:
DF <- DF[order(DF[,1]),]
DF <- split(DF,DF[,1])
DF <- lapply(1:length(DF), function(i) cbind(DF[[i]], 1:length(DF[[i]])))
But this gives me an error: arguments imply different number of rows.
Can you elaborate?
> head(DF, n=50)
cell area
1 1 121.2130
2 2 81.3555
3 3 81.5862
4 4 83.6345
...
33 1 121.3270
34 2 80.7832
35 3 81.1816
36 4 83.3340
DF <- DF[order(DF$cell),]
What I want is:
> head(DF, n=50)
cell area counter
1 1 121.213 1
33 1 121.327 2
65 1 122.171 3
97 1 122.913 4
129 1 123.697 5
161 1 124.474 6
...and so on.
This is my code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF <- splitstackshape::getanID(DF, "cell")[] # thanks to akrun's answer
ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = cell)) +
geom_line(aes(group = cell)) + geom_point(size=0.1)
}
And the plot looks like this:
Most cells increase in area, only some decrease. This is only a first try to visualize my data, so what you can't see very well is that the areas drop down periodically due to cell division.
Additional question:
There is a problem I didn't take into account beforehand, which is that after a cell division a new cell is added to the data.frame and is handed the initial index 1 (you see in the image that all cells start from .id=1, not later), which is not what I want - it needs to inherit the index of its creation time. First thing that comes into my mind is that I could use a parsing mechanism that does this job for a newly added cell variable:
DF$.id[DF$cell != temporary.cellindex] <- max(DF$.id[DF$cell != temporary.cellindex])
Do you have a better idea? Thanks.
There is a boundary condition which may ease the problem: fixed number of cells at the beginning (32). Another solution would be to cut away all data before the last daughter cell is created.
Update: Additional question solved, here's the code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF$.id <- c(0, cumsum(diff(DF$cell) < 0)) + 1L # Indexing
title <- getwd()
myplot <- ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = factor(cell))) +
geom_line(aes(group = cell)) + geom_line(size=0.1) + theme(legend.position="none") + ggtitle(title)
#save the plot
image=myplot
ggsave(file="cell_areas_time.svg", plot=image, width=10, height=8)
}
We can use getanID
from splitstackshape
library(splitstackshape)
getanID(DF, "cell")[]