Search code examples
pythondata-processing

Generate vectors of dataset based on a feature list, in Python


I need to generate a vector for each sample in a dataset, based on total amount of features of the dataset.

# Assume the dataset has 6 features
features = ['a', 'b', 'c', 'd', 'e', 'f']

# Examples:

s1 = ['a', 'b', 'c']
# For s1, I want to generate a vector to represent features 
r1 = [1, 1, 1, 0, 0, 0]

s2 = ['a', 'c', 'f']
# For s2 then the vector should be
r2 = [1, 0, 1, 0, 0, 1]

Are there any python libraries to do this task? If not, how should I accomplish this?


Solution

  • This is pretty straight-forward and not really something you need a library for.

    Pure Python solution

    features = ['a', 'b', 'c', 'd', 'e', 'f']
    features_lookup = dict(map(reversed, enumerate(features)))
    
    
    s1 = ['a', 'b', 'c']
    s2 = ['a', 'c', 'f']
    
    
    def create_feature_vector(sample, lookup):
        vec = [0]*len(lookup)
        for value in sample:
            vec[lookup[value]] = 1
        return vec
    

    Output:

    >>> create_feature_vector(s1, features_lookup)
    [1, 1, 1, 0, 0, 0]
    
    >>> create_feature_vector(s2, features_lookup)
    [1, 0, 1, 0, 0, 1]
    

    Numpy alternative for a single feature vector

    If you happen to already be using numpy, this'll be much more efficient if your feature set is large:

    import numpy as np
    
    
    features = np.array(['a', 'b', 'c', 'd', 'e', 'f'])
    sample_size = 3
    
    
    def feature_sample_and_vector(sample_size, features):
        n = features.size
        sample_indices = np.random.choice(range(n), sample_size, replace=False)
        sample = features[sample_indices]
        vector = np.zeros(n, dtype="uint8")
        vector[sample_indices] = 1
        return sample, vector
    

    Numpy alternative for a large number of samples and their feature vectors

    Using numpy allows us to scale very well for large feature sets and/or large sample sets. Note that this approach can produce duplicate samples:

    import random
    import numpy as np
    
    
    # Assumes features is already a numpy array
    def generate_samples(features, num_samples, sample_size):
        n = features.size
        vectors = np.zeros((num_samples, n), dtype="uint8")
        idxs = [random.sample(range(n), k=sample_size) for _ in range(num_samples)]
        cols = np.sort(np.array(idxs), axis=1)  # You can remove the sort if having the features in order isn't important
        rows = np.repeat(np.arange(num_samples).reshape(-1, 1), sample_size, axis=1)
        vectors[rows, cols] = 1
        samples = features[cols]
        return samples, vectors
    

    Demo:

    >>> generate_samples(features, 10, 3)
    (array([['d', 'e', 'f'],
            ['a', 'b', 'c'],
            ['c', 'd', 'e'],
            ['c', 'd', 'f'],
            ['a', 'b', 'f'],
            ['a', 'e', 'f'],
            ['c', 'd', 'f'],
            ['b', 'e', 'f'],
            ['b', 'd', 'f'],
            ['a', 'c', 'e']], dtype='<U1'),
     array([[0, 0, 0, 1, 1, 1],
            [1, 1, 1, 0, 0, 0],
            [0, 0, 1, 1, 1, 0],
            [0, 0, 1, 1, 0, 1],
            [1, 1, 0, 0, 0, 1],
            [1, 0, 0, 0, 1, 1],
            [0, 0, 1, 1, 0, 1],
            [0, 1, 0, 0, 1, 1],
            [0, 1, 0, 1, 0, 1],
            [1, 0, 1, 0, 1, 0]], dtype=uint8))
    

    A very simple timing benchmark for 100,000 samples of size 12 from a feature set of 26 features:

    In [2]: features = np.array(list("abcdefghijklmnopqrstuvwxyz"))
    
    In [3]: num_samples = 100000
    
    In [4]: sample_size = 12
    
    In [5]: %timeit generate_samples(features, num_samples, sample_size)
    645 ms ± 9.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    The only real bottleneck is the list comprehension necessary for producing the indices. Unfortunately there's no 2-dimensional variant for generating samples without replacement using np.random.choice(), so you still have to resort to a relatively slow method for generating the random sample indices.