Search code examples
pythonscikit-learnn-gram

How to make Dict of Ngram of my dataframe start with some string Python


I have dataframe like this

id  name        cat     subcat
-------------------------------
1   aa bb cc    A       a-a
2   bb cc dd    B       b-a
3   aa bb ee    C       c-a
4   aa gg cc    D       d-a

I want to make dict of this dataframe Which includes the most Ngram of two words like this

aa bb : 2
bb cc : 2
cc dd : 1
bb ee : 1
aa gg : 1
gg cc : 1

Solution

  • Update using pairwise recipe from itertools

    from itertools import combinations, chain
    
    def pairwise(iterable):
        "s -> (s0,s1), (s1,s2), (s2, s3), ..."
        a, b = tee(iterable)
        next(b, None)
        return zip(a, b)
    
    pd.Series(chain(*df['name'].str.split(' ')
                               .apply(lambda x: pairwise(x))))\
      .value_counts()
    

    Output:

    (aa, bb)    2
    (bb, cc)    2
    (cc, dd)    1
    (bb, ee)    1
    (aa, gg)    1
    (gg, cc)    1
    dtype: int64
    

    IIUC, you can try something like this:

    from itertools import combinations, chain
    
    pd.Series(list(chain(*df['name'].str.split(' ')
                                    .apply(lambda x: list(combinations(x, 2))))))\
      .value_counts()
    

    Output:

    (aa, bb)    2
    (aa, cc)    2
    (bb, cc)    2
    (bb, dd)    1
    (cc, dd)    1
    (aa, ee)    1
    (bb, ee)    1
    (aa, gg)    1
    (gg, cc)    1
    dtype: int64