I am trying to process a "segmentation file" called .TextGrid
(generated by Praat program). )
The original format looks like this:
File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0
xmax = 243.761375
tiers? <exists>
size = 17
item []:
item [1]:
class = "IntervalTier"
name = "phones"
xmin = 0
xmax = 243.761
intervals: size = 2505
intervals [1]:
xmin = 0
xmax = 0.4274939687384032
text = "_"
intervals [2]:
xmin = 0.4274939687384032
xmax = 0.472
text = "v"
intervals [3]:
[...]
(This is then repeted to EOF, with intervals[3 to n] for n Item (layer of annotation) in a file.
Somebody proposed a solution using rPython R package.
Unfortunately :
Right now my aim is to segment this file into multiple data frame. Each dataframe should contain one item (layer of annotation).
# Load the Data
txtgrid <- read.delim("./xxx_01_xx.textgrid", sep=c("=","\n"), dec=".", header=FALSE)
# Erase White spaces (use stringr package)
txtgrid[,1] <- str_trim(txtgrid[,1])
# Convert row.names to numeric
num.row<- as.numeric(row.names(txtgrid))
# Redefine the original textgrid and add those rows (I want to "keep them in case for later process)
txtgrid <- data.frame(num.row,txtgrid)
colnames(txtgrid) <- c("num.row","object", "value")
head(txtgrid)
The output of head(txtgrid)
is very raw, so here is the first 20 lines of the textgrid txtgrid[1:20,]
:
num.row object value
1 1 File type ooTextFile
2 2 Object class TextGrid
3 3 xmin 0
4 4 xmax 243.761375
5 5 tiers? <exists>
6 6 size 17
7 7 item []:
8 8 item [1]:
9 9 class IntervalTier
10 10 name phones
11 11 xmin 0
12 12 xmax 243.761
13 13 intervals: size 2505
14 14 intervals [1]:
15 15 xmin 0
16 16 xmax 0.4274939687384032
17 17 text _
18 18 intervals [2]:
19 19 xmin 0.4274939687384032
20 20 xmax 0.472
Now that I pre-processed it, I can :
# Find the number of the rows where I want to split (i.e. Item)
tier.begining <- txtgrid[grep("item", txtgrid$object, perl=TRUE), ]
# And save those numbers in a variable
x <- as.numeric(row.names(tier.begining))
This variable x
gives me the numbers-1 where my Data should be splitted in several dataframes.
I have 18 items -1 (the first item is item[] and include all the other items. So vector x
is :
x
[1] 7 8 10034 14624 19214 22444 25674 28904 31910 35140 38146 38156 38566 39040 39778 40222 44800
[18] 45018
How can I tell to R : to segment this dataframe in multiple dataframes textgrids$nameoftheItem
in such a way that I get as many data frame as I have of items?, for example :
textgrid$phones
item [1]:
class = "IntervalTier"
name = "phones"
xmin = 0
xmax = 243.761
intervals: size = 2505
intervals [1]:
xmin = 0
xmax = 0.4274939687384032
text = "_"
intervals [2]:
xmin = 0.4274939687384032
xmax = 0.472
text = "v"
[...]
intervals [n]:
textgrid$syllable
item [2]:
class = "IntervalTier"
name = "syllable"
xmin = 0
xmax = 243.761
intervals: size = 1200
intervals [1]:
xmin = 0
xmax = 0.500
text = "ve"
intervals [2]:
[...]
intervals [n]:
textgrid$item[n]
I wanted to use
txtgrid.new <- split(txtgrid, f=x)
But this message is right :
Warning message: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : data length is not a multiple of split variable
I don't get the desired outputed, it seems that row numbers don't follow each other and that the file is all mixed up.
I have also tried some which
, daply
(from plyr
) & subset
functions but never got them to work properly!
I am welcoming any idea to structure this data properly & efficiently. Ideally I should be able to link items (layers of annotation) between them (xmin & xmax of different layers), as well as multiple textgrid files, this is just the beginning.
The length of the split
vector should be equal to the number of rows in the data.frame
.
Try the following:
txtgrid.sub <- txtgrid[-(1:grep("item", txtgrid$object)[1]), ]
grep("item", txtgrid.sub$object)[-1]
splits <- unlist(mapply(rep, seq_along(grep("item", txtgrid.sub$object)),
diff(c(grep("item", txtgrid.sub$object),
nrow(txtgrid.sub) + 1))))
df.list <- split(txtgrid.sub, list(splits))
EDIT:
You could then simplify the data by doing something like this:
l <- lapply(df.list, function(x) {
tmp <- as.data.frame(t(x[, 3, drop=FALSE]), stringsAsFactors=FALSE)
names(tmp) <- make.unique(make.names(x[, 2]))
tmp
})
library(plyr)
do.call(rbind.fill, l)
item..1.. class name xmin xmax intervals..size
1 <NA> IntervalTier phones 0 243.761 2505
2 <NA> IntervalTier syllable 0 243.761 2505
intervals..1.. xmin.1 xmax.1 text intervals..2..
1 <NA> 0 0.4274939687384032 _ <NA>
2 <NA> 0 0.4274939687384032 _ <NA>
xmin.2 xmax.2
1 0.4274939687384032 0.472
2 <NA> <NA>
NB: I've used dummy data for the above.