I have a CSV file that must count and output the results.
The CSV file has millions of rows. The following is my CSV file screenshot.
The following is my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option("display.max_rows",1000000000)
pd.set_option("display.max_columns",1000000000)
df = pd.read_csv("Ax_Seg_output_no_comma.csv")
cnted = df.groupby(['Content'],as_index=False)['Content'].agg({'cnt':'count'})
cnted.to_csv('01.csv',index=0)
I used pandas to count it, but I got some problems.
It has not to count properly.
I need to get the result such as A,5
B,2
C,1
......
However, I got some wrong results is A,5
B
C,1
It has not counted some elements.
A part of the lines has not to count.
If I count only 25000 rows of the element, it can output the correct result.
The following is the wrong result:
And then, the normal result should be the following:
I doubt if it exceeds the pandas limit. I think it has no more errors.
Can anyone help me? Thanks
(It is the original CSV file: https://drive.google.com/file/d/18_Y3Wu8OFFpAzgRXRsNh8C_nyh8wPPEu/view?usp=sharing)
Your code is fine, but the results are confusing as some of the items (the value of 'Content') is multi-line. That's why you're seeing things such as:
a
b:2
The reason that some items contain multi-lines / newline characters is that you have quote signs in your CSV. To ignore them, read the csv as follows:
import csv
df = pd.read_csv("Ax_Seg_output_no_comma.csv", quoting=csv.QUOTE_NONE)