Search code examples
pythonpandaspandas-profiling

Is it possible to get a detailed list of word frequencies from Pandas Profiling?


I'm currently working with a large batch of files that require me to check the frequencies of certain strings. My first idea was to import all files into a single dataset and use a for loop to check all files for the strings using the following code.

 # Define an empty dataframe to append all imported files to
df = pd.DataFrame()
new_list = []

# If text file is import successfully append the resulting dataframe to df. If an exception occurs append "None" instead.
# "`" was chosen as the delimiter to ensure that each file is saved to a single row.
for i in file_list: 
    try: df_1 = pd.read_csv(f"D:/Admin/3. OCR files/OCR_Translations/{i}", delimiter = "`") 
    df = df.append(df_1) new_list.append(f"D:/Admin/3. OCR files/OCR_Translations/{i}") 
except: 
    df = df.append(["None"])                
    new_list.append("None")

df = df.T.reset_index()

# Search the dataset for the required keyword
count = 0

for i in df["index"]:
    if "Keyword1" in i:
        count += 1

This ended up failing as there's absolutely zero guarantee that the strings will be spelled correctly in these files as the files in question were generated by an OCR program (that and the files in question are in Thai).

Pandas Profiling generates exactly what I need for the job at hand, except it doesn't give a full list as seen here in this link (https://i.sstatic.net/BPaYv.jpg). Is there a way to get the full list of word frequencies from Pandas Profiling? I've tried checking pandas_profiling documentation (https://pandas-profiling.github.io/pandas-profiling/docs/master/index.html) to see if there's anything I can do and so far I haven't seen anything pertaining to my use case here.


Solution

  • You might not do not need Pandas to count word occurrences in files.

    import collections
    
    word_counter = collections.Counter()
    
    for i in file_list:
        with open(f"D:/Admin/3. OCR files/OCR_Translations/{i}") as f:
            for line in f:
                words = line.strip().split()  # Split line by whitespaces.
                word_counter.update(words)  # Update counter with occurrences.
    
    
    print(word_counter)
    

    You might be also interested in the .most_common() method on Counters.

    Also, if you really need to, you can also turn the Counter to a dataframe; it's just a dict with special effects.