Hello,
I am working on setting up data plan for my lab who is going to be starting a blinded clinical trial starting in January. Part of this task is to set up some data processing pipelines so that once the data has all been collected we can run the code quickly.
One outcome measure that we are utilizing is a behavior test. Someone has developed a javascript program that scores the test automatically; however, the output mirrors 5 tables stacked on top of eachother. With the help of some stackoverflow users I was able to develop a pipeline that restructures a single txt file to a dataframe that could then be analyzed. Where I am now getting stuck is how to process all the files at the same time.
My thought was to load all the files into a list and then manipulate each element in the list with either map.list or lapply. However, I am getting two issues that I will outline below.
First, this is the code and data that works well for manipulating a single dataframe.
input <- c("Cognitive Screen", "Subtest/Section\t\t\tScore\tT-Score",
"1. Line Bisection\t\t9\t53", "2. Semantic Memory\t\t8\t51",
"3. Word Fluency\t\t\t1\t56*", "4. Recognition Memory\t\t40\t59",
"5. Gesture Object Use\t\t2\t68", "6. Arithmetic\t\t\t5\t49",
"Cognitive TOTAL\t\t\t65", "", "Language Battery", "Part 1: Language Comprehension",
"Spoken Language\t\t\tScore\tT-Score", "7. Spoken Words\t\t\t17\t45*",
"9. Spoken Sentences\t\t25\t53*", "11. Spoken Paragraphs\t\t4\t60",
"Spoken Language TOTAL\t\t46\t49*", "", "Written Language\t\tScore\tT-Score",
"8. Written Words\t\t14\t45*", "10. Written Sentences\t\t21\t48*",
"Written Language TOTAL\t\t35\t46*", "", "Part 2: Expressive Language",
"Repetition\t\t\tScore\tT-Score", "12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*",
"14. Nonwords\t\t\t10\t58", "15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63",
"Repetition TOTAL\t\t62\t57*", "", "Spoken Language\t\t\tScore\tT-Score",
"17. Naming Objects\t\t30\t55*", "18. Naming Actions\t\t36\t63",
"3. Word Fluency\t\t\t12\t56*", "Naming TOTAL\t\t\t56\t57*",
"", "Spoken Picture Description\tScore\tT-Score", "19. Spoken Picture Description\t\t",
"", "Reading Aloud\t\t\tScore\tT-Score", "20. Words\t\t\t25\t50*",
"21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t")
input <- read_lines('Example_data')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
scores3 <- mutate(scores, row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
data <- left_join(scores,
header,
by = 'row'
)
#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title, "Subtest/Section", "cognition")),
subtest = data$V3,
score = data$V4,
t.score = data$V5)
head(data2)
# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp, readLines)
myfiles <- list(c("Cognitive Screen", "Subtest/Section\t\t\tScore\tT-Score",
"1. Line Bisection\t\t9\t53", "2. Semantic Memory\t\t8\t51",
"3. Word Fluency\t\t\t1\t56*", "4. Recognition Memory\t\t40\t59",
"5. Gesture Object Use\t\t2\t68", "6. Arithmetic\t\t\t5\t49",
"Cognitive TOTAL\t\t\t65", "", "Language Battery", "Part 1: Language Comprehension",
"Spoken Language\t\t\tScore\tT-Score", "7. Spoken Words\t\t\t17\t45*",
"9. Spoken Sentences\t\t25\t53*", "11. Spoken Paragraphs\t\t4\t60",
"Spoken Language TOTAL\t\t46\t49*", "", "Written Language\t\tScore\tT-Score",
"8. Written Words\t\t14\t45*", "10. Written Sentences\t\t21\t48*",
"Written Language TOTAL\t\t35\t46*", "", "Part 2: Expressive Language",
"Repetition\t\t\tScore\tT-Score", "12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*",
"14. Nonwords\t\t\t10\t58", "15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63",
"Repetition TOTAL\t\t62\t57*", "", "Spoken Language\t\t\tScore\tT-Score",
"17. Naming Objects\t\t30\t55*", "18. Naming Actions\t\t36\t63",
"3. Word Fluency\t\t\t12\t56*", "Naming TOTAL\t\t\t56\t57*",
"", "Spoken Picture Description\tScore\tT-Score", "19. Spoken Picture Description\t\t",
"", "Reading Aloud\t\t\tScore\tT-Score", "20. Words\t\t\t25\t50*",
"21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"), c("Cognitive Screen",
"Subtest/Section\t\t\tScore\tT-Score", "1. Line Bisection\t\t9\t53",
"2. Semantic Memory\t\t8\t51", "3. Word Fluency\t\t\t1\t56*",
"4. Recognition Memory\t\t40\t59", "5. Gesture Object Use\t\t2\t68",
"6. Arithmetic\t\t\t5\t49", "Cognitive TOTAL\t\t\t65", "", "Language Battery",
"Part 1: Language Comprehension", "Spoken Language\t\t\tScore\tT-Score",
"7. Spoken Words\t\t\t17\t45*", "9. Spoken Sentences\t\t25\t53*",
"11. Spoken Paragraphs\t\t4\t60", "Spoken Language TOTAL\t\t46\t49*",
"", "Written Language\t\tScore\tT-Score", "8. Written Words\t\t14\t45*",
"10. Written Sentences\t\t21\t48*", "Written Language TOTAL\t\t35\t46*",
"", "Part 2: Expressive Language", "Repetition\t\t\tScore\tT-Score",
"12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", "14. Nonwords\t\t\t10\t58",
"15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", "Repetition TOTAL\t\t62\t57*",
"", "Spoken Language\t\t\tScore\tT-Score", "17. Naming Objects\t\t30\t55*",
"18. Naming Actions\t\t36\t63", "3. Word Fluency\t\t\t12\t56*",
"Naming TOTAL\t\t\t56\t57*", "", "Spoken Picture Description\tScore\tT-Score",
"19. Spoken Picture Description\t\t", "", "Reading Aloud\t\t\tScore\tT-Score",
"20. Words\t\t\t25\t50*", "21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"), c("Cognitive Screen",
"Subtest/Section\t\t\tScore\tT-Score", "1. Line Bisection\t\t9\t53",
"2. Semantic Memory\t\t8\t51", "3. Word Fluency\t\t\t1\t56*",
"4. Recognition Memory\t\t40\t59", "5. Gesture Object Use\t\t2\t68",
"6. Arithmetic\t\t\t5\t49", "Cognitive TOTAL\t\t\t65", "", "Language Battery",
"Part 1: Language Comprehension", "Spoken Language\t\t\tScore\tT-Score",
"7. Spoken Words\t\t\t17\t45*", "9. Spoken Sentences\t\t25\t53*",
"11. Spoken Paragraphs\t\t4\t60", "Spoken Language TOTAL\t\t46\t49*",
"", "Written Language\t\tScore\tT-Score", "8. Written Words\t\t14\t45*",
"10. Written Sentences\t\t21\t48*", "Written Language TOTAL\t\t35\t46*",
"", "Part 2: Expressive Language", "Repetition\t\t\tScore\tT-Score",
"12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", "14. Nonwords\t\t\t10\t58",
"15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", "Repetition TOTAL\t\t62\t57*",
"", "Spoken Language\t\t\tScore\tT-Score", "17. Naming Objects\t\t30\t55*",
"18. Naming Actions\t\t36\t63", "3. Word Fluency\t\t\t12\t56*",
"Naming TOTAL\t\t\t56\t57*", "", "Spoken Picture Description\tScore\tT-Score",
"19. Spoken Picture Description\t\t", "", "Reading Aloud\t\t\tScore\tT-Score",
"20. Words\t\t\t25\t50*", "21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62",
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "",
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52",
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68",
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score",
"27. Written Picture Description\t\t"))
I have tried to use lapply and list.map in the rlist package. First, lapply does not seem to like pipe functions so I am attempting to work in steps. I also tried creating a function for this step.
list_header <- lapply(myfiles, as.tibble)
list_header2 <- lapply(list_header, str_match(list_header, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
"Error in match.fun(FUN) : 'str_match(list_header, "^(.?)\s+Score.")[, 2, drop = FALSE]' is not a function, character or symbol In addition: Warning message: In stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) : argument is not an atomic vector; coercing"
drop_rows <- function(df) {
new_df <- str_match_all(df[[1:3]]$value, "^(.*?)\\s+Score.*")
}
list_header2 <- lapply(list_header, drop_rows)
"Error in match.fun(FUN) : 'str_match(list_header, "^(.?)\s+Score.")[, 2, drop = FALSE]' is not a function, character or symbol In addition: Warning message: In stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) : argument is not an atomic vector; coercing"
The code provided works well for when a single txt file is loaded. However, when I attempt to run code to batch process multiple lists I run into trouble. Should anyone be able to provide some insight as to how to fix this error ** i think ** I will be able to finish the rest. However, if you feel so inclined to aid in implementing the remainder of the code I won't argue with that.
Rather than trying to debug your code I decided to try to find a solution that works with your example data. The following seems to work with individual vectors and lists of vectors:
library(tidyverse)
text_to_tibb <- function(char_vec){
str_split(char_vec, "\t") %>%
map_dfr(~ .[nchar(.) > 0] %>% matrix(., nrow = T) %>%
as_tibble
) %>%
filter(!is.na(V2), !str_detect(V1, "TOTAL")) %>%
mutate(title = str_detect(V1, "^\\d+\\.", negate = T),
group = cumsum(title)
) %>%
group_by(group) %>%
mutate(domain = first(V1)) %>%
filter(!title) %>%
ungroup() %>%
select(domain, V1, V2, V3, -title, -group) %>%
mutate(V1 = str_remove(V1, "^\\d+\\. "),
domain = str_replace(domain, "Subtest.*", "Cognition")) %>%
rename(subtest = V1, score = V2, t_score = V3)
}
If you run it on your input
variable you should get a clean tibble:
text_to_tibb(input)
#### OUTPUT ####
# A tibble: 26 x 4
domain subtest score t_score
<chr> <chr> <chr> <chr>
1 Cognition Line Bisection 9 53
2 Cognition Semantic Memory 8 51
3 Cognition Word Fluency 1 56*
4 Cognition Recognition Memory 40 59
5 Cognition Gesture Object Use 2 68
6 Cognition Arithmetic 5 49
7 Spoken Language Spoken Words 17 45*
8 Spoken Language Spoken Sentences 25 53*
9 Spoken Language Spoken Paragraphs 4 60
10 Written Language Written Words 14 45*
# … with 16 more rows
It also works on the list of vectors you included above. Just use lapply
or purrr::map
:
map(myfiles, text_to_tibb)
If you think there might be some inconsistencies in some table you might want to give safely
a try:
safe_text_to_tibb <- safely(text_to_tibb)
map(myfiles, safe_text_to_tibb)