I have a list object with about 600000 lists inside.
I need to do some data processing and conversion to replace some values of the lists inside the big list object.
I am using the following function to replace values:
# Clean empty strings/rows
def reformat_csv_data(csv_content):
for csv_content_row_idx, csv_content_row in reversed(list(enumerate(csv_content))):
if csv_content_row: # Check if the list is empty
for csv_content_column_idx, csv_content_column in enumerate(csv_content_row):
if str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower() == 'nan' or \
str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower() == 'n/a' or \
str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower() == 'na' or \
str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower() == 'null' or \
str(csv_content[csv_content_row_idx][csv_content_column_idx]) == '':
csv_content[csv_content_row_idx][csv_content_column_idx] = None
else:
del csv_content[csv_content_row_idx] # Delete list if empty
return csv_content
The problem that I have is that it gets too slow when processing large amount of data. I know this can be done more efficiently but I am very beginner and don't know how. Could you please help me? Thanks
As a minimum you could reduce the indexing, conversion and if statements to
val = str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower()
if val in {'nan', 'n/a', 'na', 'null', ''}:
csv_content[csv_content_row_idx][csv_content_column_idx] = None
But python is good at rebuilding lists efficiently and a nested list comprehension may be faster. Its a bit cryptic, but this code rebuilds the outer list filtering out empty rows with a second list comprehension that converts nan-like text to None. Notice that this function returns a new filtered list and you can delete the old one.
_none_synonyms = {'nan', 'n/a', 'na', 'null', ''}
def reformat_csv_data(csv_content):
return [ [ cell
if not isinstance(cell, str) or cell.lower() not in _none_synonyms
else None
for cell in row ]
for row in csv_content if row ]
It may be better still to do this filtering as you read the data. Since this code only iterates the original list-of-lists it could take an iterator instead. For instance, the CSV reader object
with open('some.csv') as in_fp:
my_table = reformat_csv_data(csv.reader(in_fp))