Search code examples
rdata-structuressplitsubsetpraat

R Dynamic split/subset of dataframe by selected rownumbers- Parsing textgrid praat


I am trying to process a "segmentation file" called .TextGrid (generated by Praat program). )

The original format looks like this:

File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0 
xmax = 243.761375 
tiers? <exists> 
size = 17 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "phones" 
        xmin = 0 
        xmax = 243.761 
        intervals: size = 2505 
        intervals [1]:
            xmin = 0 
            xmax = 0.4274939687384032 
            text = "_" 
        intervals [2]:
            xmin = 0.4274939687384032 
            xmax = 0.472 
            text = "v" 
        intervals [3]:
[...]

(This is then repeted to EOF, with intervals[3 to n] for n Item (layer of annotation) in a file.

Somebody proposed a solution using rPython R package.

Unfortunately :

  • I don't have a good knowledge of Python
  • The version of rPython is not available for R.3.0.2 (which I am using).
  • My aim is to develop this parser for my analysis exclusively under R environment.

Right now my aim is to segment this file into multiple data frame. Each dataframe should contain one item (layer of annotation).

# Load the Data
txtgrid <- read.delim("./xxx_01_xx.textgrid", sep=c("=","\n"), dec=".", header=FALSE)
# Erase White spaces (use stringr package)
txtgrid[,1] <- str_trim(txtgrid[,1])
# Convert row.names to numeric 
num.row<- as.numeric(row.names(txtgrid))
# Redefine the original textgrid and add those rows (I want to "keep them in case for later process)
txtgrid <- data.frame(num.row,txtgrid)
colnames(txtgrid) <- c("num.row","object", "value")
head(txtgrid)

The output of head(txtgrid) is very raw, so here is the first 20 lines of the textgrid txtgrid[1:20,]:

   num.row          object                value
1        1       File type           ooTextFile
2        2    Object class             TextGrid
3        3            xmin                   0 
4        4            xmax          243.761375 
5        5 tiers? <exists>                     
6        6            size                  17 
7        7        item []:                     
8        8       item [1]:                     
9        9           class        IntervalTier 
10      10            name              phones 
11      11            xmin                   0 
12      12            xmax             243.761 
13      13 intervals: size                2505 
14      14  intervals [1]:                     
15      15            xmin                   0 
16      16            xmax  0.4274939687384032 
17      17            text                   _ 
18      18  intervals [2]:                     
19      19            xmin  0.4274939687384032 
20      20            xmax               0.472 

Now that I pre-processed it, I can :

# Find the number of the rows where I want to split (i.e. Item)
tier.begining <- txtgrid[grep("item", txtgrid$object, perl=TRUE), ]
# And save those numbers in a variable
x <- as.numeric(row.names(tier.begining))

This variable x gives me the numbers-1 where my Data should be splitted in several dataframes.

I have 18 items -1 (the first item is item[] and include all the other items. So vector x is :

     x
    [1]     7     8 10034 14624 19214 22444 25674 28904 31910 35140 38146 38156 38566 39040 39778 40222 44800
[18] 45018

How can I tell to R : to segment this dataframe in multiple dataframes textgrids$nameoftheItem in such a way that I get as many data frame as I have of items?, for example :

textgrid$phones
         item [1]:
            class = "IntervalTier" 
            name = "phones" 
            xmin = 0 
            xmax = 243.761 
            intervals: size = 2505 
            intervals [1]:
            xmin = 0 
            xmax = 0.4274939687384032 
            text = "_" 
            intervals [2]:
            xmin = 0.4274939687384032 
            xmax = 0.472 
            text = "v" 
            [...]
            intervals [n]:
textgrid$syllable
    item [2]:
            class = "IntervalTier" 
            name = "syllable" 
            xmin = 0 
            xmax = 243.761 
            intervals: size = 1200
            intervals [1]:
            xmin = 0 
            xmax = 0.500
            text = "ve" 
            intervals [2]:
            [...]
            intervals [n]:
    textgrid$item[n]

I wanted to use

txtgrid.new <- split(txtgrid, f=x)

But this message is right :

Warning message: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : data length is not a multiple of split variable

I don't get the desired outputed, it seems that row numbers don't follow each other and that the file is all mixed up.

I have also tried some which, daply (from plyr) & subset functions but never got them to work properly!

I am welcoming any idea to structure this data properly & efficiently. Ideally I should be able to link items (layers of annotation) between them (xmin & xmax of different layers), as well as multiple textgrid files, this is just the beginning.


Solution

  • The length of the split vector should be equal to the number of rows in the data.frame.

    Try the following:

    txtgrid.sub <- txtgrid[-(1:grep("item", txtgrid$object)[1]), ]
    
    grep("item", txtgrid.sub$object)[-1]
    
    splits <- unlist(mapply(rep, seq_along(grep("item", txtgrid.sub$object)),
                            diff(c(grep("item", txtgrid.sub$object), 
                                   nrow(txtgrid.sub) + 1))))
    
    df.list <- split(txtgrid.sub, list(splits))
    

    EDIT:

    You could then simplify the data by doing something like this:

    l <- lapply(df.list, function(x) {
      tmp <- as.data.frame(t(x[, 3, drop=FALSE]), stringsAsFactors=FALSE)
      names(tmp) <- make.unique(make.names(x[, 2]))
      tmp
    })
    
    library(plyr)
    do.call(rbind.fill, l)
    
    
      item..1..        class     name xmin    xmax intervals..size
    1      <NA> IntervalTier   phones    0 243.761            2505
    2      <NA> IntervalTier syllable    0 243.761            2505
      intervals..1.. xmin.1             xmax.1 text intervals..2..
    1           <NA>      0 0.4274939687384032    _           <NA>
    2           <NA>      0 0.4274939687384032    _           <NA>
                  xmin.2 xmax.2
    1 0.4274939687384032  0.472
    2               <NA>   <NA>
    

    NB: I've used dummy data for the above.