Search code examples
pythondataframetf-idfsklearn-pandas

Append tfidf to pandas dataframe


I have the following pandas structure:

col1 col2 col3 text
1    1    0    meaningful text
5    9    7    trees
7    8    2    text

I'd like to vectorise it using a tfidf vectoriser. This, however, returns a parse matrix, which I can actually turn into a dense matrix via mysparsematrix).toarray(). However, how can I add this info with labels to my original df? So the target would look like:

col1 col2 col3 meaningful text trees
1    1    0    1          1    0
5    9    7    0          0    1
7    8    2    0          1    0

UPDATE:

Solution makes the concatenation wrong even when renaming original columns: enter image description here Dropping columns with at least one NaN results in only 7 rows left, even though I use fillna(0) before starting to work with it.


Solution

  • You can proceed as follows:

    Load data into a dataframe:

    import pandas as pd
    
    df = pd.read_table("/tmp/test.csv", sep="\s+")
    print(df)
    

    Output:

       col1  col2  col3             text
    0     1     1     0  meaningful text
    1     5     9     7            trees
    2     7     8     2             text
    

    Tokenize the text column using: sklearn.feature_extraction.text.TfidfVectorizer

    from sklearn.feature_extraction.text import TfidfVectorizer
    
    v = TfidfVectorizer()
    x = v.fit_transform(df['text'])
    

    Convert the tokenized data into a dataframe:

    df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
    print(df1)
    

    Output:

       meaningful      text  trees
    0    0.795961  0.605349    0.0
    1    0.000000  0.000000    1.0
    2    0.000000  1.000000    0.0
    

    Concatenate the tokenization dataframe to the orignal one:

    res = pd.concat([df, df1], axis=1)
    print(res)
    

    Output:

       col1  col2  col3             text  meaningful      text  trees
    0     1     1     0  meaningful text    0.795961  0.605349    0.0
    1     5     9     7            trees    0.000000  0.000000    1.0
    2     7     8     2             text    0.000000  1.000000    0.0
    

    If you want to drop the column text, you need to do that before the concatenation:

    df.drop('text', axis=1, inplace=True)
    res = pd.concat([df, df1], axis=1)
    print(res)
    

    Output:

       col1  col2  col3  meaningful      text  trees
    0     1     1     0    0.795961  0.605349    0.0
    1     5     9     7    0.000000  0.000000    1.0
    2     7     8     2    0.000000  1.000000    0.0
    

    Here's the full code:

    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    df = pd.read_table("/tmp/test.csv", sep="\s+")
    v = TfidfVectorizer()
    x = v.fit_transform(df['text'])
    
    df1 = pd.DataFrame(x.toarray(), columns=v.get_feature_names())
    df.drop('text', axis=1, inplace=True)
    res = pd.concat([df, df1], axis=1)