Search code examples
pythonnlptokenize

Corpus analysis with python


I'm a new student of natural language processing and I have a task regarding simple corpus analysis. Given an input file (MovieCorpus.txt) we are assigned to compute the following statistics:

  1. Number of sentences, tokens, types (lemmas)
  2. Distribution of sentence length, types, POS
import nltk
import spacy as sp
from nltk import word_tokenize

# Setting Spacy Modelsp
nlp = sp.load('en_core_web_sm')

# Movie Corpus
with open ('MovieCorpus.txt','r') as f:
    read_data = f.read().splitlines()


# Tokenize, POS, Lemma
tokens = []
lemma = []
pos = []

for doc in nlp.pipe(read_data):

    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        tokens.append(None)
        lemma.append(None)
        pos.append(None)


ls = len(read_data)
print("The amount of sentences is %d:" %ls)

lt = len(tokens)
print("The amount of tokens is %d:" %lt)

ll = len(lemma)
print("The amount of lemmas is %d:" %ll)

This is attempt at answering those questions but since the file is very large (>300.000 sentences) it takes forever to analyze. Is there anything I did wrong? Should I rather use NLTK instead of spacy?


Solution

  • import pandas as pd
    import nltk
    from nltk import word_tokenize
    
    # Movie Corpus
    with open ('MovieCorpus.txt','r') as f:
        read_data = f.read().splitlines()
    
    df = pd.DataFrame({"text": read_data}) # Assuming your data has no header 
    
    data = data.head(10) 
    w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    def lemmatize_text(text):
        return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
    
    data['lemma'] = data.text.apply(lemmatize_text)
    data["tokens"] = data.text.apply(nltk.word_tokenize)
    data["posR"] =  data.tokens.apply(lambda x: nltk.pos_tag(x))
    tags = [[tag for word, tag in _] for _ in data["posR"].to_list()]
    data["pos"] =  tags
    
    print(data)
    
    

    From here on you should be able to do all other tasks by yourself.