Search code examples
rtidyversefilenamestibbletxt

How to Create a Tibble from many .txt Files, Preserve File Names in a Column, and Use File Names to Sort Files into Categories?


I have 584 .txt files that I would like to merge into one 584 x 4 tibble.

Important Background Info:

The files can be divided into three categories according to the labels embedded in the file names. Thus:

A_1_COD.txt, A_23_COD.txt, A_235_COD,..., A_457_COD -> Belong in Category A;

B_3_COD.txt, B_19_COD.txt, B_189_COD,..., B_355_COD -> Belong in Category B;

C_5_COD.txt, C_11_COD.txt, C_196_COD,..., C_513_COD -> Belong in Category C;

The file names shown in this section have been modified for ease of comprehension. Examples of the real file names are: ENTITY_117_MOR.txt; INCREMENTAL_208_MOR.txt; MODERATE_173_MOR.txt. The real categories are:ENTITY, INCREMENTAL, & MODERATE.

What the resulting tibble structure should be like:

A tibble: 584 x 4

row filename
<?>
category
<fct>
text
<chr>
1 A_1_COD A "Lorem ipsu-
2 B_2_COD B "Lorem ipsu-
3 C_3_COD C "Lorem ipsu-
. . . .
. . . .
. . . .
584 A_584_COD A "Lorem ipsu-

What I have managed to do so far: Thanks to @awaji98, I managed to get three of the four columns I intend to have by using the following code:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"

  dat <- 
  folder %>% 
 # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
 # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
 # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$"),
         category = as.factor(str_extract(doc_id, "^."))) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") 

# if you prefer a tibble output
dat %>% tibble()

The result can be seen in the image below:

The picture shows the resulting table with all the data except for category

Remaining problem to be solved: I need to get R to extract the categories embedded in the file names (i.e., ENTITY, INCREMENTAL, MODERATE) to fill the category column with the respective values.

@awaji98 suggested two possible paths. Here's the first one:

> dat <- folder %>% 
+     # get full path names for each text
+     dir(pattern = "*.txt", 
+         full.names = T) %>% 
+     # map readtext function to each path name into a dataframe
+     map_df(., readtext) %>% 
+     # add and change columns as desired
+     mutate(filename= str_remove(doc_id, ".txt$")) %>% 
+     tidyr::extract(filename, into = "category", regex = "^([A-Z]+)_", remove = FALSE) %>% 
+     mutate(category = factor(category)) %>% 
+     select(filename,category,text) %>% 
+     rowid_to_column(var = "row") %>% 
+     tibble()

, which resulted in a column filled with red "NAs."

The second one,

> dat <- ## Use tidy::extract to create two new columns from doc_id
+     folder %>% 
+     # get full path names for each text
+     dir(pattern = "*.txt", 
+         full.names = T) %>% 
+     # map readtext function to each path name into a dataframe
+     map_df(., readtext) %>% 
+     # add and change columns as desired
+     mutate(filename= str_remove(doc_id, ".txt$")) %>% 
+     tidyr::extract(doc_id, into = c("category","filename"), regex = "^([A-Z]+)_(.*).txt$") %>% 
+     mutate(category = factor(category)) %>% 
+     select(filename,category,text) %>% 
+     rowid_to_column(var = "row") %>% 
+     tibble()

as shown in the photo below, produced two columns filled with red "NAs."
image shows tibble with two columns containing red "NAs," which was not the expected output.

Final Solution

@awaji98 realized that the problem was with the regex. As it turned out, the file names had a trailing whitespace. The solution was to add a space to the front of each regex in the answer. Thus, the code that delivered the expected result was:

library(tidyverse)
library(readtext)

folder <- "path_to_folder_of_texts"  
  
dat <-folder %>% 
  # get full path names for each text
  dir(pattern = "*.txt", 
      full.names = T) %>% 
  # map readtext function to each path name into a dataframe
  map_df(., readtext) %>% 
  # add and change columns as desired
  mutate(filename= str_remove(doc_id, ".txt$")) %>% 
  extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>% 
 mutate(category = factor(category)) %>% 
  select(filename,category,text) %>% 
  rowid_to_column(var = "row") %>% 
  tibble()

The final result is shown in the following photo:

This picture shows the successful final result Kind regards,
Á_C


Solution

  • You can use a combination of some common tidyverse functions and the useful readtext() from the package with the same name:

    library(tidyverse)
    library(readtext)
    
    folder <- "path_to_folder_of_texts"
    
      dat <- 
      folder %>% 
     # get full path names for each text
      dir(pattern = "*.txt", 
          full.names = T) %>% 
     # map readtext function to each path name into a dataframe
      map_df(., readtext) %>% 
     # add and change columns as desired
      mutate(filename= str_remove(doc_id, ".txt$"),
             category = as.factor(str_extract(doc_id, "^."))) %>% 
      select(filename,category,text) %>% 
      rowid_to_column(var = "row") 
    
    # if you prefer a tibble output
    dat %>% tibble()
    

    UPDATED:

    Perhaps one of the following will get what you need. The first example keeps the filename column with the category at the front of each value:

    folder %>% 
      # get full path names for each text
      dir(pattern = "*.txt", 
          full.names = T) %>% 
     # map readtext function to each path name into a dataframe
      map_df(., readtext) %>% 
    # add and change columns as desired
      mutate(filename= str_remove(doc_id, ".txt$")) %>% 
      extract(filename, into = "category", regex = "^ ([A-Z]+)_", remove = FALSE) %>% 
     mutate(category = factor(category)) %>% 
      select(filename,category,text) %>% 
      rowid_to_column(var = "row") %>% 
      tibble()
      
      
    

    The second one uses tidyr::extract to create two columns from the doc_id, so filename drops the category part:

      ## Use tidy::extract to create two new columns from doc_id
      folder %>% 
        # get full path names for each text
        dir(pattern = "*.txt", 
            full.names = T) %>% 
        # map readtext function to each path name into a dataframe
        map_df(., readtext) %>% 
        # add and change columns as desired
        mutate(filename= str_remove(doc_id, ".txt$")) %>% 
        extract(doc_id, into = c("category","filename"), regex = "^ ([A-Z]+)_(.*).txt$") %>% 
        mutate(category = factor(category)) %>% 
        select(filename,category,text) %>% 
        rowid_to_column(var = "row") %>% 
        tibble()