Search code examples
rdplyrtidydata-munging

Manipulating data from .txt files within in R


Introduction to the problem

Hello,

I am working on setting up data plan for my lab who is going to be starting a blinded clinical trial starting in January. Part of this task is to set up some data processing pipelines so that once the data has all been collected we can run the code quickly.

One outcome measure that we are utilizing is a behavior test. Someone has developed a javascript program that scores the test automatically; however, the output mirrors 5 tables stacked on top of eachother. With the help of some stackoverflow users I was able to develop a pipeline that restructures a single txt file to a dataframe that could then be analyzed. Where I am now getting stuck is how to process all the files at the same time.

My thought was to load all the files into a list and then manipulate each element in the list with either map.list or lapply. However, I am getting two issues that I will outline below.

First, this is the code and data that works well for manipulating a single dataframe.

input <- c("Cognitive Screen", "Subtest/Section\t\t\tScore\tT-Score", 
"1. Line Bisection\t\t9\t53", "2. Semantic Memory\t\t8\t51", 
"3. Word Fluency\t\t\t1\t56*", "4. Recognition Memory\t\t40\t59", 
"5. Gesture Object Use\t\t2\t68", "6. Arithmetic\t\t\t5\t49", 
"Cognitive TOTAL\t\t\t65", "", "Language Battery", "Part 1: Language Comprehension", 
"Spoken Language\t\t\tScore\tT-Score", "7. Spoken Words\t\t\t17\t45*", 
"9. Spoken Sentences\t\t25\t53*", "11. Spoken Paragraphs\t\t4\t60", 
"Spoken Language TOTAL\t\t46\t49*", "", "Written Language\t\tScore\tT-Score", 
"8. Written Words\t\t14\t45*", "10. Written Sentences\t\t21\t48*", 
"Written Language TOTAL\t\t35\t46*", "", "Part 2: Expressive Language", 
"Repetition\t\t\tScore\tT-Score", "12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", 
"14. Nonwords\t\t\t10\t58", "15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", 
"Repetition TOTAL\t\t62\t57*", "", "Spoken Language\t\t\tScore\tT-Score", 
"17. Naming Objects\t\t30\t55*", "18. Naming Actions\t\t36\t63", 
"3. Word Fluency\t\t\t12\t56*", "Naming TOTAL\t\t\t56\t57*", 
"", "Spoken Picture Description\tScore\tT-Score", "19. Spoken Picture Description\t\t", 
"", "Reading Aloud\t\t\tScore\tT-Score", "20. Words\t\t\t25\t50*", 
"21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62", 
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "", 
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52", 
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68", 
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score", 
"27. Written Picture Description\t\t")  

After creating the input file here is the code I utilize to create a dataframe (I know that the dataframe is in characters - will fix that later)

input <- read_lines('Example_data')

# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'

# add index to the list so we can match the scores that come after
header <- header %>%
  mutate(row = row_number()) %>%
  fill(title)  # copy title down

# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
  mutate(row = row_number())
scores3 <- mutate(scores, row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]

# merge the header with the scores to give each section
data <- left_join(scores,
                   header,
                   by = 'row'
)

#create correct header in new dataframe
data2 <- data.frame(domain = as.vector(str_replace(data$title, "Subtest/Section", "cognition")),
                                   subtest = data$V3,
                                   score = data$V4,
                                   t.score = data$V5)

head(data2) 

Okay so now for multiple data files. My plan is to have all the txt files in a single folder then to make a list including all the files like so:

# library(rlist)
# setwd("C:/Users/Brahma/Desktop/CAT TEXT FILES/Data")
# temp = list.files(pattern = "*Example")
# myfiles = lapply(temp, readLines)

Reproducible example file:

myfiles <- list(c("Cognitive Screen", "Subtest/Section\t\t\tScore\tT-Score", 
"1. Line Bisection\t\t9\t53", "2. Semantic Memory\t\t8\t51", 
"3. Word Fluency\t\t\t1\t56*", "4. Recognition Memory\t\t40\t59", 
"5. Gesture Object Use\t\t2\t68", "6. Arithmetic\t\t\t5\t49", 
"Cognitive TOTAL\t\t\t65", "", "Language Battery", "Part 1: Language Comprehension", 
"Spoken Language\t\t\tScore\tT-Score", "7. Spoken Words\t\t\t17\t45*", 
"9. Spoken Sentences\t\t25\t53*", "11. Spoken Paragraphs\t\t4\t60", 
"Spoken Language TOTAL\t\t46\t49*", "", "Written Language\t\tScore\tT-Score", 
"8. Written Words\t\t14\t45*", "10. Written Sentences\t\t21\t48*", 
"Written Language TOTAL\t\t35\t46*", "", "Part 2: Expressive Language", 
"Repetition\t\t\tScore\tT-Score", "12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", 
"14. Nonwords\t\t\t10\t58", "15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", 
"Repetition TOTAL\t\t62\t57*", "", "Spoken Language\t\t\tScore\tT-Score", 
"17. Naming Objects\t\t30\t55*", "18. Naming Actions\t\t36\t63", 
"3. Word Fluency\t\t\t12\t56*", "Naming TOTAL\t\t\t56\t57*", 
"", "Spoken Picture Description\tScore\tT-Score", "19. Spoken Picture Description\t\t", 
"", "Reading Aloud\t\t\tScore\tT-Score", "20. Words\t\t\t25\t50*", 
"21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62", 
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "", 
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52", 
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68", 
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score", 
"27. Written Picture Description\t\t"), c("Cognitive Screen", 
"Subtest/Section\t\t\tScore\tT-Score", "1. Line Bisection\t\t9\t53", 
"2. Semantic Memory\t\t8\t51", "3. Word Fluency\t\t\t1\t56*", 
"4. Recognition Memory\t\t40\t59", "5. Gesture Object Use\t\t2\t68", 
"6. Arithmetic\t\t\t5\t49", "Cognitive TOTAL\t\t\t65", "", "Language Battery", 
"Part 1: Language Comprehension", "Spoken Language\t\t\tScore\tT-Score", 
"7. Spoken Words\t\t\t17\t45*", "9. Spoken Sentences\t\t25\t53*", 
"11. Spoken Paragraphs\t\t4\t60", "Spoken Language TOTAL\t\t46\t49*", 
"", "Written Language\t\tScore\tT-Score", "8. Written Words\t\t14\t45*", 
"10. Written Sentences\t\t21\t48*", "Written Language TOTAL\t\t35\t46*", 
"", "Part 2: Expressive Language", "Repetition\t\t\tScore\tT-Score", 
"12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", "14. Nonwords\t\t\t10\t58", 
"15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", "Repetition TOTAL\t\t62\t57*", 
"", "Spoken Language\t\t\tScore\tT-Score", "17. Naming Objects\t\t30\t55*", 
"18. Naming Actions\t\t36\t63", "3. Word Fluency\t\t\t12\t56*", 
"Naming TOTAL\t\t\t56\t57*", "", "Spoken Picture Description\tScore\tT-Score", 
"19. Spoken Picture Description\t\t", "", "Reading Aloud\t\t\tScore\tT-Score", 
"20. Words\t\t\t25\t50*", "21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62", 
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "", 
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52", 
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68", 
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score", 
"27. Written Picture Description\t\t"), c("Cognitive Screen", 
"Subtest/Section\t\t\tScore\tT-Score", "1. Line Bisection\t\t9\t53", 
"2. Semantic Memory\t\t8\t51", "3. Word Fluency\t\t\t1\t56*", 
"4. Recognition Memory\t\t40\t59", "5. Gesture Object Use\t\t2\t68", 
"6. Arithmetic\t\t\t5\t49", "Cognitive TOTAL\t\t\t65", "", "Language Battery", 
"Part 1: Language Comprehension", "Spoken Language\t\t\tScore\tT-Score", 
"7. Spoken Words\t\t\t17\t45*", "9. Spoken Sentences\t\t25\t53*", 
"11. Spoken Paragraphs\t\t4\t60", "Spoken Language TOTAL\t\t46\t49*", 
"", "Written Language\t\tScore\tT-Score", "8. Written Words\t\t14\t45*", 
"10. Written Sentences\t\t21\t48*", "Written Language TOTAL\t\t35\t46*", 
"", "Part 2: Expressive Language", "Repetition\t\t\tScore\tT-Score", 
"12. Words\t\t\t24\t55*", "13. Complex Words\t\t8\t52*", "14. Nonwords\t\t\t10\t58", 
"15. Digit Strings\t\t8\t55", "16. Sentences\t\t\t12\t63", "Repetition TOTAL\t\t62\t57*", 
"", "Spoken Language\t\t\tScore\tT-Score", "17. Naming Objects\t\t30\t55*", 
"18. Naming Actions\t\t36\t63", "3. Word Fluency\t\t\t12\t56*", 
"Naming TOTAL\t\t\t56\t57*", "", "Spoken Picture Description\tScore\tT-Score", 
"19. Spoken Picture Description\t\t", "", "Reading Aloud\t\t\tScore\tT-Score", 
"20. Words\t\t\t25\t50*", "21. Complex Words\t\t8\t51*", "22. Function Words\t\t3\t62", 
"23. Nonwords\t\t\t6\t51*", "Reading TOTAL\t\t\t42\t50*", "", 
"Writing\t\t\t\tScore\tT-Score", "24. Writing: Copying\t\t26\t52", 
"25. Writing Picture Names\t14\t53*", "26. Writing to Dictation\t28\t68", 
"Writing TOTAL\t\t\t68\t58*", "", "Written Picture Description\tScore\tT-Score", 
"27. Written Picture Description\t\t")) 

Here is where the trouble starts

I have tried to use lapply and list.map in the rlist package. First, lapply does not seem to like pipe functions so I am attempting to work in steps. I also tried creating a function for this step.

creating a tibble. this works!

list_header <- lapply(myfiles, as.tibble)

Errors to come - attempting to start data manipulation

list_header2 <- lapply(list_header, str_match(list_header, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])

This line of code provides the following error:

"Error in match.fun(FUN) : 'str_match(list_header, "^(.?)\s+Score.")[, 2, drop = FALSE]' is not a function, character or symbol In addition: Warning message: In stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) : argument is not an atomic vector; coercing"

So I tried making a function to put here:

drop_rows <- function(df) {
  new_df <- str_match_all(df[[1:3]]$value, "^(.*?)\\s+Score.*")
}

list_header2 <- lapply(list_header, drop_rows)

Now I get this error:

"Error in match.fun(FUN) : 'str_match(list_header, "^(.?)\s+Score.")[, 2, drop = FALSE]' is not a function, character or symbol In addition: Warning message: In stri_match_first_regex(string, pattern, opts_regex = opts(pattern)) : argument is not an atomic vector; coercing"

Summary:

The code provided works well for when a single txt file is loaded. However, when I attempt to run code to batch process multiple lists I run into trouble. Should anyone be able to provide some insight as to how to fix this error ** i think ** I will be able to finish the rest. However, if you feel so inclined to aid in implementing the remainder of the code I won't argue with that.


Solution

  • Rather than trying to debug your code I decided to try to find a solution that works with your example data. The following seems to work with individual vectors and lists of vectors:

    library(tidyverse)
    
    text_to_tibb <- function(char_vec){
        str_split(char_vec, "\t") %>% 
            map_dfr(~ .[nchar(.) > 0] %>% matrix(., nrow = T) %>%
                        as_tibble
                    ) %>% 
            filter(!is.na(V2), !str_detect(V1, "TOTAL")) %>%
            mutate(title = str_detect(V1, "^\\d+\\.", negate = T),
                   group = cumsum(title)
                   ) %>% 
            group_by(group) %>%
            mutate(domain = first(V1)) %>% 
            filter(!title) %>% 
            ungroup() %>% 
            select(domain, V1, V2, V3, -title, -group) %>% 
            mutate(V1 = str_remove(V1, "^\\d+\\. "),
                   domain = str_replace(domain, "Subtest.*", "Cognition")) %>% 
            rename(subtest = V1, score = V2, t_score = V3)
    }
    

    If you run it on your input variable you should get a clean tibble:

    text_to_tibb(input)
    
    #### OUTPUT ####
    # A tibble: 26 x 4
       domain           subtest            score t_score
       <chr>            <chr>              <chr> <chr>  
     1 Cognition        Line Bisection     9     53     
     2 Cognition        Semantic Memory    8     51     
     3 Cognition        Word Fluency       1     56*    
     4 Cognition        Recognition Memory 40    59     
     5 Cognition        Gesture Object Use 2     68     
     6 Cognition        Arithmetic         5     49     
     7 Spoken Language  Spoken Words       17    45*    
     8 Spoken Language  Spoken Sentences   25    53*    
     9 Spoken Language  Spoken Paragraphs  4     60     
    10 Written Language Written Words      14    45*    
    # … with 16 more rows
    

    It also works on the list of vectors you included above. Just use lapply or purrr::map:

    map(myfiles, text_to_tibb)
    

    If you think there might be some inconsistencies in some table you might want to give safely a try:

    safe_text_to_tibb <- safely(text_to_tibb)
    
    map(myfiles, safe_text_to_tibb)