I am preparing strings that hold document titles for use as search terms in the US Patent website using Python 3.
1) It is beneficial to keep long phrases, but
2) searches do not do well when they include many words that are 3 or fewer characters in length, so I need to eliminate them.
I have tried the regex, "\b\w[1:3}\b *" to split on one to three letter words with or without a trailing space, but have not had success. But then, I'm no expert in regex.
for pubtitle in df_tpdownloads['PublicationTitleSplit']:
pubtitle = pubtitle.lower() # make lower case
pubtitle = re.split("[?:.,;\"\'\-()]+", pubtitle) # tokenize and remove punctuation
#print(pubtitle)
for subArray in pubtitle:
print(subArray)
subArray = subArray.strip()
subArray = re.split("(\b\w{1:3}\b) *", subArray) # split on words that are < 4 letters
print(subArray)
The code above steps through a pandas Series and cleans out punctuation, but fails to split on word length.
I expect to see something like the examples below.
Examples:
So,
" and training requirements for selected salt applications"```
becomes
['training requirements', 'selected salt applications']
.
And,
"december 31"
becomes
['december']
.
And,
"experimental system for salt in an emergence research and applications in process heat"
becomes
['experimental system', 'salt', 'emergence research', 'applications', 'process heat']
.
But the split doesn't capture the small words, and I'm not able to tell if the problem is the regex, the re.split command, or both.
I can probably do a brute force approach, but would like an elegant solution. Any help would be appreciated.
You may use
list(filter(None, re.split(r'\s*\b\w{1,3}\b\s*|[^\w\s]+', pubtitle.strip().lower())))
to obtain the result you want. See the regex demo.
The r'\s*\b\w{1,3}\b\s*|[^\w\s]+'
regex splits the lowercased (with .lower()
) string without leading and trailing whitespaces (due to .strip()
) into tokens that have no punctuation ([^\w\s]+
does that) and no 1-3 word char words (\s*\b\w{1,3}\b\s*
does that).
Pattern details
\s*
- 0+ whitespaces\b
- a word boundary\w{1,3}
- 1, 2 or 3 word chars (if you do not want to match _
use [^\W_]+
)\b
- a word boundary\s*
- 0+ whitespace|
- or[^\w\s]+
- 1 or more chars other than word and whitespace chars.See the Python demo:
import re
df_tpdownloads = [" and training requirements for selected salt applications",
"december 31",
"experimental system for salt in an emergence research and applications in process heat"]
#for pubtitle in df_tpdownloads['PublicationTitleSplit']:
for pubtitle in df_tpdownloads:
result = list(filter(None, re.split(r'\s*\b\w{1,3}\b\s*|[^\w\s]+', pubtitle.strip().lower())))
print(result)
Output:
['training requirements', 'selected salt applications']
['december']
['experimental system', 'salt', 'emergence research', 'applications', 'process heat']