I was trying to extract text data from files in different sub-directories and put the extracted data into pandas dataframes.
An example of the text data is given below:
"EXAMINATION: CHEST PA AND LAT INDICATION: History: F with shortness of breath TECHNIQUE: Chest PA and lateral COMPARISON: FINDINGS: The cardiac mediastinal and hilar contours are normal. Pulmonary vasculature is normal. Lungs are clear. No pleural effusion or pneumothorax is present. Multiple clips are again seen projecting over the left breast. Remote leftsided rib fractures are also re demonstrated. IMPRESSION: No acute cardiopulmonary abnormality."
However, when attempting to execute the code given below, it produced the following error, how do I resolve this?
ValueError Traceback (most recent call last)
<ipython-input-108-bbeeb452bdef> in <module>
48 df = pd.DataFrame(columns=keywords)
49 # Extract text
---> 50 result = extract_text_using_keywords(text, keywords)
51 # Append list of extracted text to the end of the pandas df
52 df.loc[len(df)] = result
<ipython-input-108-bbeeb452bdef> in extract_text_using_keywords(clean_text, keyword_list)
39 for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
40 prev_kw_index = clean_text.index(prev_kw)
---> 41 current_kw_index = clean_text.index(current_kw)
42 extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
43 if current_kw == keyword_list[-1]:
ValueError: substring not found
out = []
result = {}
for filename in glob.iglob('/content/sample_data/**/*.txt', recursive = True):
out.append(filename)
print('File names: ',out)
for file in out:
with open(file) as f:
data = f.read()
import re
text = re.sub(r"[-_()\n\"#//@;<>{}=~|?,]*", "", data)
text = re.sub(r'FINAL REPORT', '', text)
text = re.sub(r'\s+', ' ', text)
print(text)
keywords = ["INDICATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]
# Create function to extract text between each of the keywords
# Assumption
def extract_text_using_keywords(clean_text, keyword_list):
extracted_texts = []
for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
prev_kw_index = clean_text.index(prev_kw)
current_kw_index = clean_text.index(current_kw)
extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
if current_kw == keyword_list[-1]:
extracted_texts.append(clean_text[current_kw_index + len(current_kw) + 2:len(clean_text)])
return extracted_texts
# Create empty pandas df with keywords as column names
df = pd.DataFrame(columns=keywords)
# Extract text
result = extract_text_using_keywords(text, keywords)
# Append list of extracted text to the end of the pandas df
df.loc[len(df)] = result
#print(df)
with pd.option_context('display.max_colwidth', None): # For diplaying full columns
display(df)
The ValueError
is raised by the function call index() in the line current_kw_index = clean_text.index(current_kw)
because clean_text
does not contain the current_kw
that the code is attempting to find.
It is likely that in one of your files, the data
and therefore the text
that your inputting to result = extract_text_using_keywords(text, keywords)
does not contain either "INDICATION", "TECHNIQUE", "COMPARISON", "FINDINGS", or "IMPRESSION". So the easiest way to resolve this is to check which file is causing the issue and add the necessary keyword.
To make this debugging easier, you can update the extract_text_using_keywords()
function to include a try except
block to give you a more useful output for the ValueError
. You can also update other parts of the code to deal with the subsequent issues that will follow as a result of being unable to find the keyword. A complete solution is as follows:
import glob
import pandas as pd
import re
# Get & print all .txt file names with directory information
out = []
for filename in glob.iglob('content/sample_data/**/*.txt', recursive = True):
out.append(filename)
print('File names: ', out)
# Define keywords
keywords = ["INDICATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]
# Create empty pandas df with keywords as column names
df = pd.DataFrame(columns=keywords)
# Create function to extract text between each of the keywords
def extract_text_using_keywords(clean_text, keyword_list):
extracted_texts = []
for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
try:
prev_kw_index = clean_text.index(prev_kw)
except ValueError:
print("Keyword {} was not found in the text.".format(prev_kw))
try:
current_kw_index = clean_text.index(current_kw)
except ValueError:
print("Keyword {} was not found in the text.".format(current_kw))
try:
extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
if current_kw == keyword_list[-1]:
extracted_texts.append(clean_text[current_kw_index + len(current_kw) + 2:len(clean_text)])
except UnboundLocalError:
print("An index was not assigned for a particular keyword.")
return extracted_texts
# Iterate over all .txt files
for file in out:
with open(file) as f:
data = f.read()
text = re.sub(r"[-_()\n\"#//@;<>{}=~|?,]*", "", data)
text = re.sub(r'FINAL REPORT', '', text)
text = re.sub(r'\s+', ' ', text)
# print(text)
# Extract text
result = extract_text_using_keywords(text, keywords)
# If all keywords and their results were found
if len(result) == len(keywords):
# Append list of extracted text to the end of the pandas df
df.loc[len(df)] = result
else:
print("\nFailed to extract text for one or more keywords.\
\nPlease check that {} are all present in the following text:\n\n{}\n".format(keywords, text))
# Display results
print(df)
# with pd.option_context('display.max_colwidth', None): # For diplaying full columns
# display(df)
Produces the following error output when a keyword is not included (e.g. "TECHNIQUE"):
Keyword TECHNIQUE was not found in the text.
An index was not assigned for a particular keyword.
Keyword TECHNIQUE was not found in the text.
Failed to extract text for one or more keywords.
Please check that ['EXAMINATION', 'TECHNIQUE', 'COMPARISON', 'FINDINGS', 'IMPRESSION'] are all present in the following text:
EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection : Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process
Empty DataFrame
Columns: [INDICATION, TECHNIQUE, COMPARISON, FINDINGS, IMPRESSION]
Index: []
And produces the desired output when all keywords are included:
File names: ['content/sample_data\\my_data.txt', 'content/sample_data\\my_data2.txt']
INDICATION TECHNIQUE COMPARISON FINDINGS IMPRESSION
0 F with new onset ascites eval for infection Chest PA and lateral None There is no focal consolidation pleural effusi... No acute cardiopulmonary process
1 Chronic pain noted in lower erector spinae Palpate None Upper iliocostalis thoracis triggers pain alon... Nil