I have dataframe
like this
id name cat subcat
-------------------------------
1 aa bb cc A a-a
2 bb cc dd B b-a
3 aa bb ee C c-a
4 aa gg cc D d-a
I want to make dict
of this dataframe
Which includes the most Ngram
of two words like this
aa bb : 2
bb cc : 2
cc dd : 1
bb ee : 1
aa gg : 1
gg cc : 1
from itertools import combinations, chain
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
pd.Series(chain(*df['name'].str.split(' ')
.apply(lambda x: pairwise(x))))\
.value_counts()
Output:
(aa, bb) 2
(bb, cc) 2
(cc, dd) 1
(bb, ee) 1
(aa, gg) 1
(gg, cc) 1
dtype: int64
IIUC, you can try something like this:
from itertools import combinations, chain
pd.Series(list(chain(*df['name'].str.split(' ')
.apply(lambda x: list(combinations(x, 2))))))\
.value_counts()
Output:
(aa, bb) 2
(aa, cc) 2
(bb, cc) 2
(bb, dd) 1
(cc, dd) 1
(aa, ee) 1
(bb, ee) 1
(aa, gg) 1
(gg, cc) 1
dtype: int64