Search code examples
pythoncsvcountrow

Count rows with column value greater than - Big csv python


I have a very large csv of around 35 million lines.

I have a simple python that counts me the total lines :

file_name = 'extract.csv'    
start = time.time()

with open(file_name) as f:
    line_count = sum(1 for line in f)

print(line_count)

The executable is very fast, and works well without memory blocks.

Now I need to count the rows which contain greater or less value than X in column Y.

How can I solve without crashes and without slowing down the execution too much?


Solution

  • Assuming your column is called col_y and your values are integers, you can use the python csv module.

    import csv
    
    threshold = 10
    file_name = "extract.csv"
    
    with open(file_name, newline="") as csvfile:
        reader = csv.DictReader(csvfile)
        line_count = sum(1 for row in reader if int(row.get("col_y", 0)) > threshold)
    
    print(line_count)
    

    Or for counting multiple things:

    import csv
    
    threshold = 10
    above = 0
    below = 0
    equal = 0
    file_name = "extract.csv"
    
    with open(file_name, newline="") as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            val = int(row.get("col_y", 0))
            if val > threshold:
                above += 1
            elif val < threshold:
                below += 1
            elif val == threshold:
                equal += 1
            else:
                print("Something isn't right")
    
    print(f"{above:,} above")
    print(f"{below:,} below")
    print(f"{equal:,} equal")
    

    These snippets use a cvs.DictReader. You can take a look at the docs to get an idea of how you might need to edit these examples to suit your needs.