Search code examples
pythonregexpandastextsplit

pandas: split on dot unless there is a number or a character before dot


I have a dataframe as follows:

import pandas as pd
df = pd.DataFrame({'text':['she is a. good 15. year old girl. she goes to school on time.', 'she is not an A. level student. This needs to be discussed.']})

to split and explode on (.), I have done the following:

df = df.assign(text=df['text'].str.split('.')).explode('text')

However I do not want to split after every dot. so I would like to split on dot, unless dot is surrounded by number (e,g. 22., 3.4) or a single character surrounding the dot (e.g, a. ,a.b., b.d

desired_output:

   text
'she is a. good 15. year old girl'
'she goes to school on time'
'she is not an A. level student'
'This needs to be discussed.'

so, i also tried the following pattern hoping to ignore the single characters and number, but it removes the last letter from the final words of the sentences.

df.assign(text=df['text'].str.split(r'(?:(?<!\.|\s)[a-z]\.|(?<!\.|\s)[A-Z]\.)+')).explode('text')

I edited the pattern, so now it matched all types of dot that come after number or single letter: r'(?:(?<=.|\s)[[a-zA-Z]].|(?<=.|\s)\d+)+' so, i guess i only need to somehow figure out how to split on dot, except this last pattern


Solution

  • #!/usr/bin/python3
    # -*- coding: utf-8 -*-
    
    import re
    
    input = 'she is a. good 15. year old girl. she goes to school on time. she is not an A. level student. This needs to be discussed.'
    
    sentences = re.split(r'\.', input)
    
    output = []
    text = ''
    for v in sentences:
        text = text + v
    
        if(re.search(r'(^|\s)([a-z]{1}|[0-9]+)$', v, re.IGNORECASE)):
            text = text + "."
        else:
            text = text.strip()
            if text != '':
                output.append(text)
            text = ''
    
    print(output)
    

    Output:

    ['she is a. good 15. year old girl', 'she goes to school on time', 'she is not an A. level student', 'This needs to be discussed']