Python: Find vocabulary of a bigram

I have a list of tweets (tokenized and preprocessed). It's like this:

['AT_TOKEN',
 'what',
 'AT_TOKEN',
 'said',
 'END',
 'AT_TOKEN',
 'plus',
 'you',
 've',
 'added',
 'commercials',
 'to',
 'the',
 'experience',
 'tacky',
 'END',
 'AT_TOKEN',
 'i',
 'did',
 'nt',
 'today',
 'must',
 'mean',
 'i',
 'need',
 'to',
 'take',
 'another',
 'trip',
 'END']

END signifies that a tweet has ended and a new one has begun.

I want to find the bigram vocabulary for this list but having a hard time how can I do it efficiently. I have figured out how I can do this for a unigram like this:

unique_words = defaultdict(int)
for i in range(len(data)):
    unique_words[data[i]] = 1
return list(unique_words.keys())

Problem is that I need to first convert this list into bigram and then find the vocabulary for that bigram.

Can anybody help me figure this out?

Solution

For single words you would need only set() (without defaultdict)

unique_words = list(set(data))

print(unique_words)

For two words you can use for-loop with data[i:i+2] and len(data)-1 (without defaultdict)

all_bigrams = []

for i in range(len(data)-1):
    all_bigrams.append( tuple(data[i:i+2]) )
    
unique_bigrams = list(set(all_bigrams))

print(unique_bigrams)

or using directly set() without all_bigrams

unique_bigrams = set()

for i in range(len(data)-1):
    unique_bigrams.add( tuple(data[i:i+2]) )
    
unique_bigrams = list(unique_bigrams)

print(unique_bigrams)

The same for three words but with data[i:i+3] and len(data)-2

all_threewords = []

for i in range(len(data)-2):
    all_threewords.append( tuple(data[i:i+3]) )
    
unique_threewords = list(set(all_threewords))

print(unique_threewords)

or using directly set() without all_threewords

unique_threewords = set()

for i in range(len(data)-2):
    unique_threewords.add( tuple(data[i:i+3]) )
    
unique_threewords = list(unique_threewords)

print(unique_threewords)

Full working example


data = ['AT_TOKEN',
 'what',
 'AT_TOKEN',
 'said',
 'END',
 'AT_TOKEN',
 'plus',
 'you',
 've',
 'added',
 'commercials',
 'to',
 'the',
 'experience',
 'tacky',
 'END',
 'AT_TOKEN',
 'i',
 'did',
 'nt',
 'today',
 'must',
 'mean',
 'i',
 'need',
 'to',
 'take',
 'another',
 'trip',
 'END']

# ---

unique_words = list(set(data))

print(unique_words)

# ---

all_bigrams = []

for i in range(len(data)-1):
    all_bigrams.append( tuple(data[i:i+2]) )
    
unique_bigrams = list(set(all_bigrams))

print(unique_bigrams)

# ---

unique_bigrams = set()

for i in range(len(data)-1):
    unique_bigrams.add( tuple(data[i:i+2]) )
    
unique_bigrams = list(unique_bigrams)

print(unique_bigrams)

# ---

all_threewords = []

for i in range(len(data)-2):
    all_threewords.append( tuple(data[i:i+3]) )
    
unique_threewords = list(set(all_threewords))

print(unique_threewords)

# ---

unique_threewords = set()

for i in range(len(data)-2):
    unique_threewords.add( tuple(data[i:i+3]) )
    
unique_threewords = list(unique_threewords)

print(unique_threewords)

But I don't know if you need pairs like ('END', 'AT_TOKEN') or any pair with 'END' or 'AT_TOKEN'.

It would need first convert to sublists

data = [
    
  ['AT_TOKEN', 'what'],
    
  ['AT_TOKEN', 'said', 'END'], 

  ['AT_TOKEN', 'plus', 'you', 've', 'added',
   'commercials', 'to', 'the', 'experience',
   'tacky', 'END'],
  
  ['AT_TOKEN', 'i', 'did', 'nt', 'today',
   'must', 'mean', 'i', 'need', 'to', 'take',
   'another', 'trip', 'END']
  
]

and later work with every sublist separatelly.