I have text file (blast software output) with a single column and about 40,000 rows, which looks as below.
Essentially, I would like to use R or terminal to convert this to multiple columns with first columns containing Query name and other columns containing query hits with each hit appended to a new column
Input is this:
Query1
result1
result2
result3
Query2
result1
result2
result3
result4
result5
Query3
result1
result2
result3
result4
Expected output
Query1 result1 result2 result3
Query2 result1 result2 result3 result4 result5
Query3 result1 result2 result3 result4
Consider running readLines()
to read the text file line by line, building a large list of character vectors. Below also iteratively maps the section header (i.e. Query1, Query2) to names of the individual character vectors:
con <- file("/path/to/text/file.txt", open="r")
datalist <- c()
while (length(line <- readLines(con, n=1, warn = FALSE)) > 0) {
if (grepl("Query", line)==TRUE){
query <- c() # RESET VECTOR
qName <- line # CAPTURE QUERY NAME
}
else if (grepl("([A-Za-z])", line)==TRUE){
query <- c(query, line) # APPEND LINE TO VECTOR
}
else if (line == ""){
datalist <- c(datalist, setNames(list(query), qName)) # APPEND NAMED VECTOR TO LIST
}
}
datalist <- c(datalist, setNames(list(query), qName)) # REMAINING LAST SECTION
close(con)
datalist
# $Query1
# [1] "result1" "result2" "result3"
# $Query2
# [1] "result1" "result2" "result3" "result4" "result5"
# $Query3
# [1] "result1" "result2" "result3" "result4"