Search code examples
pythonscikit-learntext-classification

How do I get the sequence of vocabulary from a sparse matrix


I have a list of vocabularies ['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph'] and a list of sentences ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"]. I use 'CountVectorizer' from 'sklearn' to fit the sentences into a sparse matrix based on the eight words. And I get a output below.

[[0 0 0 0 0 1 0 1]
 [0 0 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 1]
 [0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0]]

Now I'm trying to find out the sequence of that eight words in the matrix. Any help will be appreciated.


Solution

  • CountVectorizer use lowercase by default, so 'Human', 'Graph', 'ESP' have no matches. And it seems vocabulary vector is sorted somehow in your result.

    You can set lowercase = False.

    lowercaseboolean, True by default Convert all characters to lowercase before tokenizing. sclearn doc

    I did like this.

    from sklearn.feature_extraction.text import CountVectorizer
    
    corpus = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"
    ]
    
    voc = ['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph']
    
    vectorizer = CountVectorizer(vocabulary=voc, lowercase=False)
    
    X = vectorizer.fit_transform(corpus)
    
    print(vectorizer.get_feature_names())
    print(X.toarray())
    
    
    #     ['Human', 'interface', 'machine', 'binary', 'minors', 'ESP', 'system', 'Graph']
    #     [[1 1 1 0 0 0 0 0]
    #      [0 0 0 0 0 0 1 0]
    #      [0 1 0 0 0 0 1 0]
    #      [0 0 0 0 0 0 0 0]
    #      [0 0 0 1 0 0 0 0]
    #      [0 0 0 0 0 0 0 0]
    #      [0 0 0 0 1 0 0 1]
    #      [0 0 0 0 1 0 0 1]]
    

    In matrix, each row is voc matching for a sentence. So this case 'Human', 'interface', 'machine' matched for 1st row(sentence).