I have some data in pandas that I want to use for named entity recognition. Sample of the data is below
text
['Angie', '’s', 'is', 'my', 'favorite', 'but', 'the', 'prices', 'at', 'little', 'Tonys', 'are', 'better', '.']
tags
['B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O']
I ran sklearn.model_selection import train_test_split
on the data
# split data
train_texts, test_texts, train_tags, test_tags = train_test_split(dataset["text"].tolist(),
dataset["tags"].tolist(),
test_size=0.20,
random_state=15)
However, when I try to print the list it gives me some weird behavior, specifically, it counts the square brackets []
and quotes ''
around the text and tags as part of the test and tags. For example, when I write
print(train_texts[0][0:9], train_tags[0][0:9], sep='\n')
output
['Angie',
['B-ORG',
My question is, why is it counting the brackets and quote characters as part of the string? How can I fix it?
I have used DataFrame
for declaration and performed the same task of splitting train_texts and test_texts
and train_tags and test_tags
. Kindly refer to a Solution Stated below. Then we will move ahead with the issue of []
and ''
in your scenario.
# Import all the important libraries
import pandas as pd
from sklearn.model_selection import train_test_split
# Store all String data into the 'data' variable
data = {
'text' : ['Angie', '’s', 'is', 'my', 'favorite', 'but', 'the', 'prices', 'at', 'little', 'Tonys', 'are', 'better', '.'],
'tags' : ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O']}
# Store above Initialized Data into DataFrame
dataset = pd.DataFrame(data)
NOTE:- Always Print few Records of the dataset
before moving ahead. Because it may happen sometimes that there was issue in your dataset
which can deflect your expected result.
# Print a few records of 'dataset'
dataset
text tags
0 Angie B-ORG
1 ’s I-ORG
2 is O
3 my O
4 favorite O
5 but O
6 the O
7 prices O
8 at O
9 little B-ORG
10 Tonys I-ORG
11 are O
12 better O
13 . O
Now we can pursue the splitting part. I have used the same method which was mentioned in your question part.
# split data
train_texts, test_texts, train_tags, test_tags = train_test_split(
dataset["text"].tolist(),
dataset["tags"].tolist(),
test_size=0.20,
random_state=15)
So, after Splitting we can print a Sliced list of train_texts and train_tags
print(train_texts[0][0:9], train_tags[0][0:9], sep='\n')
Output
of the above cell is stated below:-
favorite
O
As you can see, it was not printing any []
and ''
in Output.
Your Question:-
Q.) Why is it counting the brackets and quote characters as part of the string? How can I fix it?
A.) I don't know a proper reason behind this issue. But it may happen sometimes if your data haven't declared properly or due to any other declaration issue. But printing dataset
before moving ahead is a great practice. Because you can identify the behavior of data from this method.
Solution:- Usage of DataFrame
worked for me perfectly. You can use that.
Hope this Solution helps you. If you are still facing an issue kindly upload the full code. So, that we can find a solution accordingly.