Search code examples
rconcatenationbioinformaticsbioconductorgenome

Concatenating positions into genomic segments


I would like to concatenate all rows which have more than 0.955 of similarity score. The Aboand Bel columns represents the similarity score with above and below rows, respectively. In the following input df I have 10 genomic probes (NAME column) which is concatenated in just 4 genomic segments (dfout).

df <- " NAME Abo  Bel Chr GD Position
 BovineHD0100009217 NA 1.0000000   1  0  31691781
 BovineHD0100009218 1.0000000 0.6185430   1  0  31695808
 BovineHD0100019600 0.6185430 0.9973510   1  0  69211537
 BovineHD0100019601 0.9973510 1.0000000   1  0  69213650
 BovineHD0100019602 1.0000000 1.0000000   1  0  69214650
 BovineHD0100019603 1.0000000 0.6600000   1  0  69217942
 BovineHD0100047112 0.6600000 1.0000000   1  0  93797691
 BovineHD0100026604 1.0000000 1.0000000   1  0  93815774
 BovineHD0100026605 1.0000000 0.4649007   1  0  93819471
 BovineHD0100029861 0.4649007 NA   1  0 105042452"
df <- read.table(text=df, header=T)

My expected output dfout:

dfout <- "Chr start end startp endp nprob 
           1  31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
           1  69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
           1  93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
           1  105042452 105042452 BovineHD0100029861 BovineHD0100029861 1"
dfout <- read.table(text=dfout, header=T)

Any ideas?


Solution

  • I couldn't think of any pretty solution using basic dataframe manipulation, so here's a bad-looking one that works:

    First, add stringsAsFactors to df creation:

    df <- read.table(text=df, header=T, stringsAsFactors = FALSE)
    
    start <- df$Position[1]
    end <- integer()
    output <- NULL
    count <- 1
    for (i in 1:(nrow(df)-1)) {
      if(df$Bel[i] < 0.955)  {
        end <- df$Position[i]
        output <- rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count))
        start <- df$Position[i+1]
        count <- 0
      } 
      count <- count + 1
    }
    end <- df$Position[nrow(df)]
    output <- as.data.frame(rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count)))
    colnames(output) <- c("start", "end", "startp", "endp", "nprob")
    

    The basic idea here is looping through the rows and checking if the next should be added to the current segment (Bel > 0.955) or if a new segment should start (Bel <= 0.955). When a new sequence has to be started, the endrow is defined, the respective row added to the output and the new starting segment also defined. A count is used to add the number of rows used to create the segment (nprob).

    Finally the last segment is added, outside the for loop, and the output receives its column names and is converted to a dataframe. I did not use Chr because 1. They are all equal, 2. if they weren't you didn't give any way to choose/summarize them.

    Result:

    > output
          start       end             startp               endp nprob
    1  31691781  31695808 BovineHD0100009217 BovineHD0100009218     2
    2  69211537  69217942 BovineHD0100019600 BovineHD0100019603     4
    3  93797691  93819471 BovineHD0100047112 BovineHD0100026605     3
    4 105042452 105042452 BovineHD0100029861 BovineHD0100029861     1
    

    I'm pretty sure that you or someone else can work on this to make it shorter and more concise.