Search code examples
pythonnlptext-processing

removing strings until a condition is matched in python


I have these string vectors

text1 = "  SPEECH Remarks at the European Economics and Financial Centre Remarks by Luis de Guindos, Vice-President of the ECB, at the European Economics and Financial Centre London, 2 March 2020 I am delighted to be here today at the European Economics and F'
text2 = "  SPEECH  The ECB’s response to the COVID-19 pandemic Remarks by Isabel Schnabel, Member of the Executive Board of the ECB, at a 24-Hour Global Webinar co-organised by the SAFE Policy Center on “The COVID-19 Crisis and Its Aftermath: Corporate Governance Implications and Policy Challenges” Frankfurt am Main, 16 April 2020 The COVID-19 pandemic is a shock of unprecedented intensity and severity. Th"

How I can remove all text before based on what dates I see in the text?

The expected result should be:

text1 = "I am delighted to be here today at the European Economics and F"

text2 = "The COVID-19 pandemic is a shock of unprecedented intensity and severity. Th"

IMPORTANT

Please note because I am handling a large number of similar documents, knowing all dates is not possible. I think the ideal solution should be able to identify the dates in order to remove the unnecessary text in the beginning.


Solution

  • Using regular expression

    Code

    import re
    
    def remove_predate(text):
      '''Detect full and abbreviated dates i.e. 02 January 2020 and 02 Jan 2020'''
    
      date_pattern = r'(.*?)(\d{1,2}\s+(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{4})'
    
      regex_detect = re.compile(date_pattern)
      m = regex_detect.match(text)
    
      if m:
        span = m.span(0)
        return text[span[1]:]  # skips text before and including date
    
      return text
    

    Tests

    print(remove_predate(text1))
    print(remove_predate(text2))
    

    Output

     I am delighted to be here today at the European Economics and F
     The COVID-19 pandemic is a shock of unprecedented intensity and severity. Th