Search code examples
rregextokenizesentence

Separate sentences ending with a scientific reference number in r


I am working on a project where one of the steps is to separate text of scientific articles into sentences. For this, I am using textrank which I understands it looks for . or ? or ! etc. to identify end of the sentence of tokenization.

The problem I am running into is sentences that end with a period followed directly by a reference number (that also might be in brackets). The examples below represent the patterns I identified and collected so far.


xx = c ("hello.1 World", "hello.1,2 World",  "hello.(1) world", "hello.(1,2) World", "hello.[1,2] World", "hello.[1] World")

I did some search and it looks like "Sentence boundary detection" is a science by itself that can get complex and domain specific.

The only way I can think of to fix this problem (in my case at least), is to write a regex that adds a space after the period so the textrank can identify it using its usual pattern.

any suggestions how to do that with regex in R? I tried my best to search online but I could not find an answer.

This question explains how to add space between lower case followed by upper case. Add space between two letters in a string in R in my case, I believe I will need to add space between letter followed by period and number /bracket.

My expected output is something like:

("hello. 1 World", "hello. 1,2 World",  "hello. (1) world", "hello. (1,2) World", "hello. [1,2] World", "hello. [1] World")

Thank you


Solution

  • For the exact sample inputs you gave us, you may do a regex search on the following pattern:

    \.(?=\d+|\(\d+(?:,\d+)*\)|\[\d+(?:,\d+)*\])
    

    and then replace with dot followed by a single space. Sample script:

    xx <- c("hello.1 World", "hello.1,2 World", "hello.(1) world", "hello.(1,2) World",
            "hello.[1,2] World", "hello.[1] World")
    output <- gsub("\\.(?=\\d+|\\(\\d+(?:,\\d+)*\\)|\\[\\d+(?:,\\d+)*\\])", ". ", xx, perl=TRUE)
    output
    
    [1] "hello. 1 World"     "hello. 1,2 World"   "hello. (1) world"
    [4] "hello. (1,2) World" "hello. [1,2] World" "hello. [1] World"