Search code examples
pythongroup-byclassificationuniquecounter

classifying and counting unique elements considering various column conditions


Hi i am using python to classify some data:

Articles                                       Filename
A New Marine Ascomycete from Brunei.    Invasive Species.csv
A new genus and four new species        Forestry.csv
A new genus and four new species        Invasive Species.csv

I want to know how many unique "Articles" does each "Filename" have.

So my desired output is this:

Filename                             Count_Unique
Invasive Species.csv                 1
Forestry.csv                         0

Another thing, I also would like to get this output:

Filename1                        Filename2                         Count_Common articles
Forestry.csv                     Invasive Species.csv               1

I concatenated the datasets and ended up counting the elements that exist in each "Filename."

Anyone who's willing to help? I have tried unique(), drop_duplicates() etc but it seems I cannot get my desired output.

Anyway, here are the last few lines of my code:

concatenated = pd.concat(data, ignore_index =True)
concatenatedconcatenated.groupby(['Title','Filename']).count().reset_index()
res = {col:concatenated[col].value_counts() for col in concatenated.columns}
res ['Filename']

Solution

  • No magic. Just some regular operations.

    (1) count "unique" articles in files

    Edit: added (quick-and-dirty) code to include filenames with zero-counts

    # prevent repetitive counting
    df = df.drop_duplicates()
    
    # articles to be removed (the ones appeared more than once)
    dup_articles = df["Articles"].value_counts()
    dup_articles = dup_articles[dup_articles > 1].index
    # remove duplicate articles and count
    mask_dup_articles = df["Articles"].isin(dup_articles)
    df_unique = df[~mask_dup_articles]
    df_unique["Filename"].value_counts()
    
    # N.B. all filenames not shown here of course has 0 count.
    #      I will add this part later on.
    
    Out[68]: 
    Invasive Species.csv    1
    Name: Filename, dtype: int64
    
    # unique article count with zeros
    df_unique_nonzero_count = df_unique["Filename"].value_counts().to_frame().reset_index()
    df_unique_nonzero_count.columns = ["Filename", "count"]
    
    df_all_filenames = pd.DataFrame(
        data={"Filename": df["Filename"].unique()}
    )
    # join: all filenames with counted filenames
    df_unique_count = df_all_filenames.merge(df_unique_nonzero_count, on="Filename", how="outer")
    # postprocess
    df_unique_count.fillna(0, inplace=True)
    df_unique_count["count"] = df_unique_count["count"].astype(int)
    # print
    df_unique_count
    
    Out[119]: 
                   Filename  count
    0  Invasive Species.csv      1
    1          Forestry.csv      0
    

    (2) count common articles between files

    # pick out records containing duplicate articles
    df_dup = df[mask_dup_articles]
    # merge on articles and then discard self- and duplicate pairs
    df_merge = df_dup.merge(df_dup, on=["Articles"], suffixes=("1", "2"))
    df_merge = df_merge[df_merge["Filename1"] > df_merge["Filename2"]] # alphabetical ordering
    # count
    df_ans2 = df_merge.groupby(["Filename1", "Filename2"]).count()
    df_ans2.reset_index(inplace=True)  # optional
    df_ans2
    
    Out[70]: 
                  Filename1     Filename2  Articles
    0  Invasive Species.csv  Forestry.csv         1