Search code examples
pythonpython-3.xpandascsvexport-to-csv

How to solve the problem of each CSV element calculation error?


I have a CSV file that must count and output the results.

The CSV file has millions of rows. The following is my CSV file screenshot.

enter image description here enter image description here enter image description here

The following is my code:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option("display.max_rows",1000000000)
pd.set_option("display.max_columns",1000000000)
df = pd.read_csv("Ax_Seg_output_no_comma.csv")
cnted = df.groupby(['Content'],as_index=False)['Content'].agg({'cnt':'count'})
cnted.to_csv('01.csv',index=0)

I used pandas to count it, but I got some problems.

  1. It has not to count properly.

    I need to get the result such as A,5 B,2 C,1......

    However, I got some wrong results is A,5 B C,1

It has not counted some elements.

  1. A part of the lines has not to count.

  2. If I count only 25000 rows of the element, it can output the correct result.

The following is the wrong result:

enter image description here

And then, the normal result should be the following:

enter image description here

I doubt if it exceeds the pandas limit. I think it has no more errors.

Can anyone help me? Thanks

(It is the original CSV file: https://drive.google.com/file/d/18_Y3Wu8OFFpAzgRXRsNh8C_nyh8wPPEu/view?usp=sharing)


Solution

  • Your code is fine, but the results are confusing as some of the items (the value of 'Content') is multi-line. That's why you're seeing things such as:

    a
    
    b:2
    

    The reason that some items contain multi-lines / newline characters is that you have quote signs in your CSV. To ignore them, read the csv as follows:

    import csv 
    df = pd.read_csv("Ax_Seg_output_no_comma.csv", quoting=csv.QUOTE_NONE)