Using Python, I have split up the chunks of a text file data into sentences into a list as below ("My list"). I need to figure out how to only pull out the word tokens and their associated POS tags (included in the sentence). My goal is in a bigram type of structure such as this: [('Football', 'NNP'), ('Baltimore', 'NNP'), ('pulled', 'NNP'), ('off', 'IN'), ('a','IN'),('victory','NN'),('.','.')]. I don't want to see the extra words/characters such as 'I-NP' and 'O' and ':'. However, periods ( . ) and commas ( , ) are fine. Would like to keep those if possible in paired list.
My list:
['Football',
'NNP',
'I-NP',
'O',
'-',
':',
'O',
'O',
'Baltimore',
'NNP',
'I-NP',
'B-ORG',
'pulled',
'NNP',
'I-NP',
'O',
'off',
'IN',
'I-PP',
'O',
'a',
'IN',
'I-NP',
'O',
'victory',
'NN',
'I-NP',
'O',
'.',
'.',
'O',
'O']
I would like to see like this but not sure how to get there:
[('Football', 'NNP'), ('Baltimore', 'NNP'), ('pulled', 'NNP'), ('off', 'IN'), ('a','IN'),('victory','NN'),('.','.')]
This problem is pretty basic if you can describe which lines to use for the keys and values that you want to retain. Looking at the data here, it seems that you want to exclude item in the input list that:
After excluding the items you don't want to use, the keys and values for the dictionary items are just pairs...[K, V, K, V...]. If it turns out that this doesn't work for all of your data, then you need to figure out what the right selection criteria is to delete all but the lines that make up the pairs you want to create dictionaries from.
Here's the code that uses the above criteria to give you what you want:
data = ['Football',
'NNP',
'I-NP',
'O',
'-',
':',
'O',
'O',
'Baltimore',
'NNP',
'I-NP',
'B-ORG',
'pulled',
'NNP',
'I-NP',
'O',
'off',
'IN',
'I-PP',
'O',
'a',
'IN',
'I-NP',
'O',
'victory',
'NN',
'I-NP',
'O',
'.',
'.',
'O',
'O']
data = [x for x in data if re.match(r"^[a-zA-Z.]+$", x) and x != 'O']
result = []
for i in range(0, len(data), 2):
result.append({data[i]: data[i+1]})
print(result)
Result:
[{'Football': 'NNP'}, {'Baltimore': 'NNP'}, {'pulled': 'NNP'}, {'off': 'IN'}, {'a': 'IN'}, {'victory': 'NN'}, {'.': '.'}]