Search code examples
python-3.xpandasstringsklearn-pandas

Weird Behavior When Slicing a List in Python


I have some data in pandas that I want to use for named entity recognition. Sample of the data is below

text
['Angie', '’s', 'is', 'my', 'favorite', 'but', 'the', 'prices', 'at', 'little', 'Tonys', 'are', 'better', '.']

tags
['B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O']

I ran sklearn.model_selection import train_test_split on the data

# split data
train_texts, test_texts, train_tags, test_tags = train_test_split(dataset["text"].tolist(),
                                                                dataset["tags"].tolist(),
                                                                test_size=0.20,
                                                                random_state=15)

However, when I try to print the list it gives me some weird behavior, specifically, it counts the square brackets [] and quotes '' around the text and tags as part of the test and tags. For example, when I write

print(train_texts[0][0:9], train_tags[0][0:9], sep='\n')

output
['Angie',
['B-ORG',

My question is, why is it counting the brackets and quote characters as part of the string? How can I fix it?


Solution

  • I have used DataFrame for declaration and performed the same task of splitting train_texts and test_texts and train_tags and test_tags. Kindly refer to a Solution Stated below. Then we will move ahead with the issue of [] and '' in your scenario.

    # Import all the important libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    
    # Store all String data into the 'data' variable
    data = {
    'text' : ['Angie', '’s', 'is', 'my', 'favorite', 'but', 'the', 'prices', 'at', 'little', 'Tonys', 'are', 'better', '.'],
    'tags' : ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O']}
    
    # Store above Initialized Data into DataFrame
    dataset = pd.DataFrame(data)
    

    NOTE:- Always Print few Records of the dataset before moving ahead. Because it may happen sometimes that there was issue in your dataset which can deflect your expected result.

    # Print a few records of 'dataset'
    dataset
    
        text        tags
    0   Angie       B-ORG
    1   ’s          I-ORG
    2   is          O
    3   my          O
    4   favorite    O
    5   but         O
    6   the         O
    7   prices      O
    8   at          O
    9   little      B-ORG
    10  Tonys       I-ORG
    11  are         O
    12  better      O
    13  .           O
    

    Now we can pursue the splitting part. I have used the same method which was mentioned in your question part.

    # split data
    train_texts, test_texts, train_tags, test_tags = train_test_split(
        dataset["text"].tolist(),
        dataset["tags"].tolist(),
        test_size=0.20,
        random_state=15)
    

    So, after Splitting we can print a Sliced list of train_texts and train_tags

    print(train_texts[0][0:9], train_tags[0][0:9], sep='\n')
    

    Output of the above cell is stated below:-

    favorite
    O
    

    As you can see, it was not printing any [] and '' in Output.

    Your Question:-

    Q.) Why is it counting the brackets and quote characters as part of the string? How can I fix it? A.) I don't know a proper reason behind this issue. But it may happen sometimes if your data haven't declared properly or due to any other declaration issue. But printing dataset before moving ahead is a great practice. Because you can identify the behavior of data from this method.

    Solution:- Usage of DataFrame worked for me perfectly. You can use that.

    Hope this Solution helps you. If you are still facing an issue kindly upload the full code. So, that we can find a solution accordingly.