I am doing some analysis on unstructured data in notebooks - which accounts for a column of information. I want to pull this sole column out and do natural language processing to see what keywords are most frequent and tokenization.
When I apply my word tokenizer on the column for user reviews, the text I want to analyze:
text = df.loc[:, "User Reviews"]
The row numbers are included with the text "User Reviews" column.
Since some of the User Reviews contain the same numbers as the row numbers are, this is getting confusing for analysis, especially since I am applying tokenization and looking at term frequency. So the row starts at 1 in this below example, then the 2 is the next row, and then 3 and so on for 10K user reviews.
['1', 'great', 'cat', 'waiting', 'on', 'me', 'home', 'to', 'feed', 'love', 'fancy', 'feast',
'2', 'my', '3', 'dogs', 'love', 'this', '3', 'So', 'bad', 'my', '4', 'dogs', 'threw', 'up', ...]
Is there a way to do this? Do I need to text.drop
to drop the row? I have looked up a few sources, here:
https://www.shanelynn.ie/using-pandas-dataframe-creating-editing-viewing-data-in-python/
https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c
But still am struggling.
User Reviews
0 i think my puppy likes this. She seemed to keep...
1 Its Great! My cat waiting on me to feed her. Fa...
2 My 3 dogs love this so much. Wanted to get more...
3 All of my 4 dogs threw this up. Wouldnt ever re...
4 I think she likes it. I gave it to her yesterda...
5 Do not trust this brand, dog died 3 yrs ago aft...
6 Tried and true dog food, never has issues with ...
The row numbers are included with the text "User Reviews" column.
A pd.Series
object includes an array of values along with an associated index. The index, if not affected by specific operations, may coincide with "row numbers" - but this is not guaranteed to be the case.
It appears your tokenization logic is designed to apply on an array of values, rather than a series. You can extract the underlying numpy
array, which only includes values, by using pd.Series.values
:
text = df.loc[:, "User Reviews"].values
The numpy
array representation loses the index and only keeps the underlying data.