Search code examples
rbashbioinformaticsblast

Transpose data using an identifier


I have text file (blast software output) with a single column and about 40,000 rows, which looks as below.

Essentially, I would like to use R or terminal to convert this to multiple columns with first columns containing Query name and other columns containing query hits with each hit appended to a new column

Input is this:

Query1
result1
result2
result3

Query2
result1
result2
result3
result4
result5   

Query3
result1
result2
result3
result4

Expected output

Query1 result1 result2 result3 
Query2 result1 result2 result3 result4 result5
Query3 result1 result2 result3 result4

Solution

  • Consider running readLines() to read the text file line by line, building a large list of character vectors. Below also iteratively maps the section header (i.e. Query1, Query2) to names of the individual character vectors:

    con <- file("/path/to/text/file.txt", open="r")
    
    datalist <-  c()
    while (length(line <- readLines(con, n=1, warn = FALSE)) > 0) {
    
      if (grepl("Query", line)==TRUE){
        query <- c()                                              # RESET VECTOR
        qName <- line                                             # CAPTURE QUERY NAME
      }
      else if (grepl("([A-Za-z])", line)==TRUE){
        query <- c(query, line)                                   # APPEND LINE TO VECTOR
      }
      else if (line == ""){
        datalist <- c(datalist, setNames(list(query), qName))     # APPEND NAMED VECTOR TO LIST
      }
    }
    
    datalist <- c(datalist, setNames(list(query), qName))         # REMAINING LAST SECTION
    close(con)
    
    datalist
    
    # $Query1
    # [1] "result1" "result2" "result3"
    
    # $Query2
    # [1] "result1" "result2" "result3" "result4" "result5"
    
    # $Query3
    # [1] "result1" "result2" "result3" "result4"