Search code examples
rdataframenames

Finding names in txt file


I have a long text in txt file (T1.txt). I would like to find all the names (in English) in the txt file and the 2 preceding words and the 2 following words after the name. For instance I have the following text:

    "Hello world!, my name is Mr. A.B. Morgan (in short) and it is nice to meet you."
Orange Silver paid 100$ for his gift.
I'll call Dina H. in two hours.

I would like to get the following dataframe:

   > df1
       Before         Name         After
  1   name is     A. B. Morgan  in short
  2               Orange Silver paid 100$
  3   I'll call   Dina H.       in two

Solution

  • This is not perfect nor is it pretty, but it's a start:

    text1 <- c("Hello world!, my name is Mr. A.B. Morgan (in short) and it is nice to meet you.")
    text2 <- c("Orange Silver paid 100$ for his gift.")
    text3 <- c("I'll call Dina H. in two hours.")
    
    library(stringr)
    
    find_names_and_BA <- function(x) {
      matches <- str_extract_all(str_sub(x, 2), "[A-Z]\\S+")[[1]]
    
      if (length(matches) < 2) { matches <- str_extract_all(x, "[A-Z]\\S+")[[1]] }
          name_match <- paste(matches, collapse = " ")
        beg_of_match <- str_locate(x, name_match)[1]
        end_of_match <- str_locate(x, name_match)[2]
    
         start_words <- str_extract_all(str_sub(x, , beg_of_match), "\\w+")[[1]]
           end_words <- str_extract_all(str_sub(x, end_of_match), "\\w+")[[1]]
    
              before <- paste(tail(start_words, 3)[1:2], collapse = " ")
               after <- paste(head(end_words, 3)[2:3], collapse = " ")
      return( data.frame(Before = before, Name = name_match, After = after) )
    }
    
    dplyr::bind_rows(find_names_and_BA(text1),
                     find_names_and_BA(text2),
                     find_names_and_BA(text3))
    
    # Source: local data frame [3 x 3]
    # 
    #    Before            Name     After
    #     (chr)           (chr)     (chr)
    # 1 name is Mr. A.B. Morgan  in short
    # 2    O NA   Orange Silver  paid 100
    # 3 ll call         Dina H. two hours