I am working on a project where one of the steps is to separate text of scientific articles into sentences. For this, I am using textrank
which I understands it looks for .
or ?
or !
etc. to identify end of the sentence of tokenization.
The problem I am running into is sentences that end with a period followed directly by a reference number (that also might be in brackets). The examples below represent the patterns I identified and collected so far.
xx = c ("hello.1 World", "hello.1,2 World", "hello.(1) world", "hello.(1,2) World", "hello.[1,2] World", "hello.[1] World")
I did some search and it looks like "Sentence boundary detection" is a science by itself that can get complex and domain specific.
The only way I can think of to fix this problem (in my case at least), is to write a regex that adds a space after the period so the textrank
can identify it using its usual pattern.
any suggestions how to do that with regex in R? I tried my best to search online but I could not find an answer.
This question explains how to add space between lower case followed by upper case. Add space between two letters in a string in R in my case, I believe I will need to add space between letter followed by period and number /bracket.
My expected output is something like:
("hello. 1 World", "hello. 1,2 World", "hello. (1) world", "hello. (1,2) World", "hello. [1,2] World", "hello. [1] World")
Thank you
For the exact sample inputs you gave us, you may do a regex search on the following pattern:
\.(?=\d+|\(\d+(?:,\d+)*\)|\[\d+(?:,\d+)*\])
and then replace with dot followed by a single space. Sample script:
xx <- c("hello.1 World", "hello.1,2 World", "hello.(1) world", "hello.(1,2) World",
"hello.[1,2] World", "hello.[1] World")
output <- gsub("\\.(?=\\d+|\\(\\d+(?:,\\d+)*\\)|\\[\\d+(?:,\\d+)*\\])", ". ", xx, perl=TRUE)
output
[1] "hello. 1 World" "hello. 1,2 World" "hello. (1) world"
[4] "hello. (1,2) World" "hello. [1,2] World" "hello. [1] World"