Search code examples
pythondataframecomparison

Comparing dataframes for same values


I have generated 2 dataframes from csv files:

  • one containing a column of words and a column of their occurrence in an article (saved in the df as type: object
  • one containing the same as above but with different words and different occurrence for these words (some words are listed in both df)

Here is how the df's look like (both are built the same):

              word  occurance
0            labor          4
1      predictions          2
2              nfl          2
3             kids          2
4           africa          2
5         pandemic          2
6             kara          2
7             days          2
8          swisher          2
9            event          2
10             day          2
11        football          2
12          office          2
13              us          2
14        politics          2
15           media          2
16        abortion          2
17         preview          2
18           music          2
19           texas          2
20           south          2
21         workers          2
22            anti          1
23         sanders          1
24        movement          1
25          bernie          1
26          budget          1

How can I check if there are words in the second df that occurred as well in the first df and if there is a match add the occurrences from first and second df so to have a total score (occurrence) saved within the first df at the end of the program?

Thank you in advance


Solution

  • Considering dataframes first and second for example.

    first

    first = pd.DataFrame({"word": ["A", "B", "C", "D"], "occurrence": [1, 2, 3, 4]})
    
        word  occurrence
    0      A           1
    1      B           2
    2      C           3
    3      D           4
    

    second

    second = pd.DataFrame({"word": ["A", "B", "Y", "Z"], "occurrence": [6, 2, 4, 1]})
    
        word  occurrence
    0      A           6
    1      B           2
    2      Y           4
    3      Z           1
    

    Final dataframe

    Since only the words present in first needs to be added with the ones present in second, using left join and taking the sum of occurrences works.

    pd.merge(first, second, how="left", on=["word"]) \
        .set_index(["word"]) \
        .sum(axis=1).astype(int) \
        .reset_index(name="occurrence")
    
        word  occurrence
    0      A           7
    1      B           4
    2      C           3
    3      D           4