Search code examples
pythonnlpnltkn-gram

Python: Find vocabulary of a bigram


I have a list of tweets (tokenized and preprocessed). It's like this:

['AT_TOKEN',
 'what',
 'AT_TOKEN',
 'said',
 'END',
 'AT_TOKEN',
 'plus',
 'you',
 've',
 'added',
 'commercials',
 'to',
 'the',
 'experience',
 'tacky',
 'END',
 'AT_TOKEN',
 'i',
 'did',
 'nt',
 'today',
 'must',
 'mean',
 'i',
 'need',
 'to',
 'take',
 'another',
 'trip',
 'END']

END signifies that a tweet has ended and a new one has begun.

I want to find the bigram vocabulary for this list but having a hard time how can I do it efficiently. I have figured out how I can do this for a unigram like this:

unique_words = defaultdict(int)
for i in range(len(data)):
    unique_words[data[i]] = 1
return list(unique_words.keys())

Problem is that I need to first convert this list into bigram and then find the vocabulary for that bigram.

Can anybody help me figure this out?


Solution

  • For single words you would need only set() (without defaultdict)

    unique_words = list(set(data))
    
    print(unique_words)
    

    For two words you can use for-loop with data[i:i+2] and len(data)-1 (without defaultdict)

    all_bigrams = []
    
    for i in range(len(data)-1):
        all_bigrams.append( tuple(data[i:i+2]) )
        
    unique_bigrams = list(set(all_bigrams))
    
    print(unique_bigrams)
    

    or using directly set() without all_bigrams

    unique_bigrams = set()
    
    for i in range(len(data)-1):
        unique_bigrams.add( tuple(data[i:i+2]) )
        
    unique_bigrams = list(unique_bigrams)
    
    print(unique_bigrams)
    

    The same for three words but with data[i:i+3] and len(data)-2

    all_threewords = []
    
    for i in range(len(data)-2):
        all_threewords.append( tuple(data[i:i+3]) )
        
    unique_threewords = list(set(all_threewords))
    
    print(unique_threewords)
    

    or using directly set() without all_threewords

    unique_threewords = set()
    
    for i in range(len(data)-2):
        unique_threewords.add( tuple(data[i:i+3]) )
        
    unique_threewords = list(unique_threewords)
    
    print(unique_threewords)
    

    Full working example

    
    data = ['AT_TOKEN',
     'what',
     'AT_TOKEN',
     'said',
     'END',
     'AT_TOKEN',
     'plus',
     'you',
     've',
     'added',
     'commercials',
     'to',
     'the',
     'experience',
     'tacky',
     'END',
     'AT_TOKEN',
     'i',
     'did',
     'nt',
     'today',
     'must',
     'mean',
     'i',
     'need',
     'to',
     'take',
     'another',
     'trip',
     'END']
    
    # ---
    
    unique_words = list(set(data))
    
    print(unique_words)
    
    # ---
    
    all_bigrams = []
    
    for i in range(len(data)-1):
        all_bigrams.append( tuple(data[i:i+2]) )
        
    unique_bigrams = list(set(all_bigrams))
    
    print(unique_bigrams)
    
    # ---
    
    unique_bigrams = set()
    
    for i in range(len(data)-1):
        unique_bigrams.add( tuple(data[i:i+2]) )
        
    unique_bigrams = list(unique_bigrams)
    
    print(unique_bigrams)
    
    # ---
    
    all_threewords = []
    
    for i in range(len(data)-2):
        all_threewords.append( tuple(data[i:i+3]) )
        
    unique_threewords = list(set(all_threewords))
    
    print(unique_threewords)
    
    # ---
    
    unique_threewords = set()
    
    for i in range(len(data)-2):
        unique_threewords.add( tuple(data[i:i+3]) )
        
    unique_threewords = list(unique_threewords)
    
    print(unique_threewords)
    

    But I don't know if you need pairs like ('END', 'AT_TOKEN') or any pair with 'END' or 'AT_TOKEN'.

    It would need first convert to sublists

    data = [
        
      ['AT_TOKEN', 'what'],
        
      ['AT_TOKEN', 'said', 'END'], 
    
      ['AT_TOKEN', 'plus', 'you', 've', 'added',
       'commercials', 'to', 'the', 'experience',
       'tacky', 'END'],
      
      ['AT_TOKEN', 'i', 'did', 'nt', 'today',
       'must', 'mean', 'i', 'need', 'to', 'take',
       'another', 'trip', 'END']
      
    ]  
    

    and later work with every sublist separatelly.