Search code examples
python-2.7csvreformatting

How to quickly replace values in a huge list


I have a list object with about 600000 lists inside.

I need to do some data processing and conversion to replace some values of the lists inside the big list object.

I am using the following function to replace values:

# Clean empty strings/rows
def reformat_csv_data(csv_content):
    for csv_content_row_idx, csv_content_row in reversed(list(enumerate(csv_content))):
        if csv_content_row:  # Check if the list is empty
            for csv_content_column_idx, csv_content_column in enumerate(csv_content_row):
                if str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower() == 'nan' or \
                            str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower() == 'n/a' or \
                            str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower() == 'na' or \
                            str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower() == 'null' or \
                            str(csv_content[csv_content_row_idx][csv_content_column_idx]) == '':
                    csv_content[csv_content_row_idx][csv_content_column_idx] = None
        else:
            del csv_content[csv_content_row_idx]  # Delete list if empty
    return csv_content

The problem that I have is that it gets too slow when processing large amount of data. I know this can be done more efficiently but I am very beginner and don't know how. Could you please help me? Thanks


Solution

  • As a minimum you could reduce the indexing, conversion and if statements to

    val = str(csv_content[csv_content_row_idx][csv_content_column_idx]).lower()
    if val in  {'nan', 'n/a', 'na', 'null', ''}:
            csv_content[csv_content_row_idx][csv_content_column_idx] = None
    

    But python is good at rebuilding lists efficiently and a nested list comprehension may be faster. Its a bit cryptic, but this code rebuilds the outer list filtering out empty rows with a second list comprehension that converts nan-like text to None. Notice that this function returns a new filtered list and you can delete the old one.

    _none_synonyms = {'nan', 'n/a', 'na', 'null', ''}
    
    def reformat_csv_data(csv_content):
        return [ [ cell 
                if not isinstance(cell, str) or cell.lower() not in _none_synonyms 
                else None 
                for cell in row ]
            for row in csv_content if row ]
    

    It may be better still to do this filtering as you read the data. Since this code only iterates the original list-of-lists it could take an iterator instead. For instance, the CSV reader object

    with open('some.csv') as in_fp:
        my_table = reformat_csv_data(csv.reader(in_fp))