I have a document containing >1000 instances of tuples. I want to calculate the frequency of the second element of the tuple across all rows, and then delete the tuples that belong to the "NN" group.
Here is my data:
pos_tag |
---|
[(semoga, SC), (saja, RB), (di, IN), (sini, PR), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (saja, RB), (ini, PR), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)] |
[(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (nya, PRP), (tidak, NEG), (selesai, VB)] |
[(sangat, RB), (baik, JJ)] |
I would like to know the frequency, showing:
tag | frequency |
---|---|
SC | 1 |
RB | 3 |
IN | 1 |
PR | 2 |
MD | 1 |
JJ | 3 |
NN | 8 |
etc. | ... |
After deleting words that belong to NN, the data will be:
pos_tag | pos_tag_clean |
---|---|
[(semoga, SC), (saja, RB), (di, IN), (sini, PR), (bisa, MD), (cepat, JJ), (cair, NN), (semoga, NN), (saja, RB), (ini, PR), (beneran, NN), (ada, VB), (nya, NN), (bantuan, NN), (buat, JJ), (butuh, VB), (banget, NN)] | [(semoga, SC), (saja, RB), (di, IN), (sini, PR), (bisa, MD), (cepat, JJ), (saja, RB), (ini, PR), (ada, VB),(buat, JJ), (butuh, VB)] |
[(kak, VB), (kenapa, WH), (perbaikan, NN), (sistem, NN), (nya, PRP), (tidak, NEG), (selesai, VB)] | [(kak, VB), (kenapa, WH), (nya, PRP), (tidak, NEG), (selesai, VB)] |
[(sangat, RB), (baik, JJ)] | [(sangat, RB), (baik, JJ)] |
Really need help, thanks!
You can explode
, slice the second item, and value_counts
:
out = (df['pos_tag']
.explode()
.str[1]
.value_counts()
.reset_index(name='frequency')
)