Search code examples
pythonstringscikit-learnpolynomials

Is there a python function similar to sklearn.PolynomialFeatures but for strings?


The sklearn.PolynomialFeatures function generates the polynomial and interaction features of a vector. For example :

>>> X = [[1,2,3]]
>>> G = sklearn.PolynomialFeatures(degree = 3, interaction_only = True, bias = False)
>>> G.fit_transform(X)
>>> print(G)
>>>
array([[1., 2., 3., 2., 3., 6., 6.]])

Is there an equivalent function that could work for strings so that if the input array is X = [['a','b','c']] the function would output array([['a','b','c','ab','ac','bc','abc']]) and that the function could take any input vector ? If no such function exist, do you have an idea on how to create it ?


Solution

  • It looks like you're looking for the superset of the input list of strings. This is fairly easy to implement using itertools, though if you want to have the fit/transform structure (allowing you to include the transformer in a pipeline), you can define your own transformer inheriting from TransformerMixin. Otherwise just use the code contained in the transform method:

    from sklearn.base import TransformerMixin
    from itertools import combinations, chain
    
    class NSuperset(TransformerMixin):
        def __init__(self, n):
            self.n = n
    
        def fit(self, X):
            return self
    
        def transform(self, X):
            superset = [[''.join(c) for x in X for c in combinations(x, r=i)] 
                        for i in range(1,self.n+1)]
            return list(chain.from_iterable(superset))
    

    ss = NSuperset(n=3)
    
    X = [['a','b','c']]
    ss.fit_transform(X)
    # ['a', 'b', 'c', 'ab', 'ac', 'bc', 'abc']