Search code examples
pythonmachine-learningnlpdata-science

Extracting text data from files in different sub-directories raises "ValueError: substring not found"


I was trying to extract text data from files in different sub-directories and put the extracted data into pandas dataframes.

An example of the text data is given below:

"EXAMINATION: CHEST PA AND LAT INDICATION: History: F with shortness of breath TECHNIQUE: Chest PA and lateral COMPARISON: FINDINGS: The cardiac mediastinal and hilar contours are normal. Pulmonary vasculature is normal. Lungs are clear. No pleural effusion or pneumothorax is present. Multiple clips are again seen projecting over the left breast. Remote leftsided rib fractures are also re demonstrated. IMPRESSION: No acute cardiopulmonary abnormality."

However, when attempting to execute the code given below, it produced the following error, how do I resolve this?

Error

ValueError                                Traceback (most recent call last)
<ipython-input-108-bbeeb452bdef> in <module>
     48         df = pd.DataFrame(columns=keywords)
     49         # Extract text
---> 50         result = extract_text_using_keywords(text, keywords)
     51         # Append list of extracted text to the end of the pandas df
     52         df.loc[len(df)] = result

<ipython-input-108-bbeeb452bdef> in extract_text_using_keywords(clean_text, keyword_list)
     39             for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
     40                 prev_kw_index = clean_text.index(prev_kw)
---> 41                 current_kw_index = clean_text.index(current_kw)
     42                 extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
     43                 if current_kw == keyword_list[-1]:

ValueError: substring not found

Code

out = []
result = {}

for filename in glob.iglob('/content/sample_data/**/*.txt', recursive = True):
    
    out.append(filename)

print('File names: ',out)

for file in out:
      
        with open(file) as f:
          data = f.read()
          
    
        import re
        text = re.sub(r"[-_()\n\"#//@;<>{}=~|?,]*", "", data)
        text = re.sub(r'FINAL REPORT', '', text)
        text = re.sub(r'\s+', ' ', text)
        print(text)

        keywords = ["INDICATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]

        # Create function to extract text between each of the keywords
        # Assumption
        def extract_text_using_keywords(clean_text, keyword_list):
            extracted_texts = []
            for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
                prev_kw_index = clean_text.index(prev_kw)
                current_kw_index = clean_text.index(current_kw)
                extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
                if current_kw == keyword_list[-1]:
                    extracted_texts.append(clean_text[current_kw_index + len(current_kw) + 2:len(clean_text)])
            return extracted_texts

        # Create empty pandas df with keywords as column names
        df = pd.DataFrame(columns=keywords)
        # Extract text
        result = extract_text_using_keywords(text, keywords)
        # Append list of extracted text to the end of the pandas df
        df.loc[len(df)] = result

        #print(df)

        with pd.option_context('display.max_colwidth', None): # For diplaying full columns
          display(df)

Solution

  • The ValueError is raised by the function call index() in the line current_kw_index = clean_text.index(current_kw) because clean_text does not contain the current_kw that the code is attempting to find.

    It is likely that in one of your files, the data and therefore the text that your inputting to result = extract_text_using_keywords(text, keywords) does not contain either "INDICATION", "TECHNIQUE", "COMPARISON", "FINDINGS", or "IMPRESSION". So the easiest way to resolve this is to check which file is causing the issue and add the necessary keyword.

    To make this debugging easier, you can update the extract_text_using_keywords() function to include a try except block to give you a more useful output for the ValueError. You can also update other parts of the code to deal with the subsequent issues that will follow as a result of being unable to find the keyword. A complete solution is as follows:

    import glob
    import pandas as pd
    import re
    
    # Get & print all .txt file names with directory information
    out = []
    for filename in glob.iglob('content/sample_data/**/*.txt', recursive = True):
        out.append(filename)
    print('File names: ', out)
    
    # Define keywords
    keywords = ["INDICATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]
    # Create empty pandas df with keywords as column names
    df = pd.DataFrame(columns=keywords)
    
    
    # Create function to extract text between each of the keywords
    def extract_text_using_keywords(clean_text, keyword_list):
        extracted_texts = []
        for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
            try:            
                prev_kw_index = clean_text.index(prev_kw)
            except ValueError:
                print("Keyword {} was not found in the text.".format(prev_kw))
            try:
                current_kw_index = clean_text.index(current_kw)
            except ValueError:
                print("Keyword {} was not found in the text.".format(current_kw))
            try:
                extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
                if current_kw == keyword_list[-1]:
                    extracted_texts.append(clean_text[current_kw_index + len(current_kw) + 2:len(clean_text)])
            except UnboundLocalError:
                print("An index was not assigned for a particular keyword.")
        return extracted_texts
    
    
    # Iterate over all .txt files
    for file in out:
        with open(file) as f:
          data = f.read()
    
        text = re.sub(r"[-_()\n\"#//@;<>{}=~|?,]*", "", data)
        text = re.sub(r'FINAL REPORT', '', text)
        text = re.sub(r'\s+', ' ', text)
        # print(text)
    
        # Extract text
        result = extract_text_using_keywords(text, keywords)
    
        # If all keywords and their results were found
        if len(result) == len(keywords):
            # Append list of extracted text to the end of the pandas df
            df.loc[len(df)] = result
        else:
            print("\nFailed to extract text for one or more keywords.\
            \nPlease check that {} are all present in the following text:\n\n{}\n".format(keywords, text))
    
    # Display results
    print(df)
    # with pd.option_context('display.max_colwidth', None): # For diplaying full columns
    #     display(df)
    

    Produces the following error output when a keyword is not included (e.g. "TECHNIQUE"):

    Keyword TECHNIQUE was not found in the text.
    An index was not assigned for a particular keyword.
    Keyword TECHNIQUE was not found in the text.
    
    Failed to extract text for one or more keywords.
    Please check that ['EXAMINATION', 'TECHNIQUE', 'COMPARISON', 'FINDINGS', 'IMPRESSION'] are all present in the following text:
    
     EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection : Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process
    
    Empty DataFrame
    Columns: [INDICATION, TECHNIQUE, COMPARISON, FINDINGS, IMPRESSION]
    Index: []
    

    And produces the desired output when all keywords are included:

    File names:  ['content/sample_data\\my_data.txt', 'content/sample_data\\my_data2.txt']
                                         INDICATION              TECHNIQUE COMPARISON                                           FINDINGS                        IMPRESSION
    0  F with new onset ascites eval for infection   Chest PA and lateral       None   There is no focal consolidation pleural effusi...  No acute cardiopulmonary process
    1   Chronic pain noted in lower erector spinae                Palpate       None   Upper iliocostalis thoracis triggers pain alon...                               Nil