Search code examples
dataframemachine-learningnlpdata-sciencedata-cleaning

Text data extraction between keywords in a string


I have text data that looks like the following after extracting from a file and cleaning. I want to put the data into a pandas dataframe where the columns are ('EXAMINATION', 'TECHNIQUE', 'COMPARISON', 'FINDINGS', 'IMPRESSION'), and each cell in each row contains the extracted data related to the column name (i.e. the keyword).

'FINAL REPORT EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection TECHNIQUE: Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process'

For example, under the column TECHNIQUE there should be a cell containing "Chest PA and lateral", and under the column IMPRESSION, there should be a cell containing "No acute cardiopulmonary process".


Solution

  • Solution as follows, please note the following assumptions:

    1. Keywords as presented are located in that order within the sample text.
    2. The keywords are not contained within the text to be extracted.
    3. Each keyword is followed by a ": " (the colon and whitespace is removed).

    Solution

    import pandas as pd
    
    sample = "FINAL REPORT EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection TECHNIQUE: Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process"
    
    keywords = ["EXAMINATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]
    
    
    # Create function to extract text between each of the keywords
    def extract_text_using_keywords(clean_text, keyword_list):
        extracted_texts = []
        for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
            prev_kw_index = clean_text.index(prev_kw)
            current_kw_index = clean_text.index(current_kw)
            extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
            # Extract the text after the final keyword in keyword_list (i.e. "IMPRESSION")
            if current_kw == keyword_list[-1]:
                extracted_texts.append(clean_text[current_kw_index + len(current_kw) + 2:len(clean_text)])
        return extracted_texts
    
    
    # Extract text
    result = extract_text_using_keywords(sample, keywords)
    # Create pandas dataframe
    df = pd.DataFrame([result], columns=keywords)
    
    print(df)
    
    # To append future results to the end of the pandas df you can use
    # df.loc[len(df)] = result
    

    Output

       EXAMINATION                                        TECHNIQUE                  COMPARISON    FINDINGS                                           IMPRESSION
    0  CHEST PA AND LAT INDICATION: F with new onset ...  Chest PA and lateral       None          There is no focal consolidation pleural effusi...  No acute cardiopulmonary process