Search code examples
pythonpandasdataframetokenize

Only select columns [no rows] in Python notebooks


I am doing some analysis on unstructured data in notebooks - which accounts for a column of information. I want to pull this sole column out and do natural language processing to see what keywords are most frequent and tokenization.

When I apply my word tokenizer on the column for user reviews, the text I want to analyze:

text = df.loc[:, "User Reviews"]

The row numbers are included with the text "User Reviews" column.

Since some of the User Reviews contain the same numbers as the row numbers are, this is getting confusing for analysis, especially since I am applying tokenization and looking at term frequency. So the row starts at 1 in this below example, then the 2 is the next row, and then 3 and so on for 10K user reviews.

['1', 'great', 'cat', 'waiting', 'on', 'me', 'home', 'to', 'feed', 'love', 'fancy', 'feast',
 '2', 'my', '3', 'dogs', 'love', 'this', '3', 'So', 'bad', 'my', '4', 'dogs', 'threw', 'up', ...]

Is there a way to do this? Do I need to text.drop to drop the row? I have looked up a few sources, here:

https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/

https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c

But still am struggling.

                                            User Reviews  
0  i think my puppy likes this. She seemed to keep...  
1  Its Great! My cat waiting on me to feed her. Fa...  
2  My 3 dogs love this so much. Wanted to get more...
3  All of my 4 dogs threw this up. Wouldnt ever re...  
4  I think she likes it. I gave it to her yesterda...  
5  Do not trust this brand, dog died 3 yrs ago aft...  
6  Tried and true dog food, never has issues with ...  

Solution

  • The row numbers are included with the text "User Reviews" column.

    A pd.Series object includes an array of values along with an associated index. The index, if not affected by specific operations, may coincide with "row numbers" - but this is not guaranteed to be the case.

    It appears your tokenization logic is designed to apply on an array of values, rather than a series. You can extract the underlying numpy array, which only includes values, by using pd.Series.values:

    text = df.loc[:, "User Reviews"].values
    

    The numpy array representation loses the index and only keeps the underlying data.