Search code examples
pythondataframeweb-scrapingstop-words

Remove stop words from a dataframe's colum with Python


I managed to extract a list of words from a website and store them in a dataframe. Now what I need is to remove some of these words from the "Palabras" column and keep only the first 500 records.

This is my code so far:

import requests
wiki_url = "https://es.wiktionary.org/wiki/Wikcionario:Frecuentes-(1-1000)-Subt%C3%ADtulos_de_pel%C3%ADculas"
wiki_texto = requests.get(wiki_url).text
from bs4 import BeautifulSoup
wiki_datos = BeautifulSoup(wiki_texto, "html")
wiki_filas = wiki_datos.findAll("tr")
print(wiki_filas[1])

print("...............................")

wiki_celdas = wiki_datos.findAll("td")
print(wiki_celdas[0:])
fila_1 = wiki_celdas[0:]
info_1 = [elemento.get_text() for elemento in fila_1]
print(fila_1)
print(info_1)
info_1[0] = int(float(info_1[0]))
print(info_1)


print("...............................")

num_or = [int(float(elem.findAll("td")[0].get_text())) for elem in wiki_filas[1:]]
palabras = [elem.findAll("td")[1].get_text().rstrip() for elem in wiki_filas[1:]]
frecuencia = [elem.findAll("td")[2].get_text().rstrip() for elem in wiki_filas[1:]]
print(num_or[0:])
print(palabras[0:])
print(frecuencia[0:])

from pandas import DataFrame
tabla = DataFrame([num_or, palabras, frecuencia]).T
tabla.columns = ["Núm. orden", "Palabras", "Frecuencia"]
print(tabla.head())
print(tabla)

print("...............................")

import nltk
nltk.download()
from nltk.corpus import stopwords 
prep = stopwords.words('spanish')
print(prep)

enter image description here

So what I need is to remove the list of words contained in this code

stopwords.words('spanish')

from the "Palabras" column and keep only the first 500 records (words with a higher frequency):

import nltk
nltk.download()
from nltk.corpus import stopwords 
prep = stopwords.words('spanish')

Thanks in advance!


Solution

  • I managed to extract a list of words from a website and store them in a dictionary.

    Note: Actually, you stored them in a dataframe.

    You can use isin. You basically want to get the rows where a word in column 'Palabras' is in your list of stop words. So you just need to filter for those rows, and then take the opposite of those using ~. And since it's already sorted, just use .head(500)

    tabla = tabla[~tabla['Palabras'].isin(prep)].head(500)
    

    Also, since the html contains table tags, I would consider using pandas .read_html() as it uses beautifulsoup under the hood, but does the hard work for you. Your code can be cut down considerably:

    Full Code, Same Results:

    import nltk
    import pandas as pd
    #nltk.download()
    from nltk.corpus import stopwords 
    prep = stopwords.words('spanish')
    print(prep)
    
    
    
    tabla_beta = pd.read_html(wiki_url)[0]
    tabla_beta.columns = ["Núm. orden", "Palabras", "Frecuencia"]
    tabla_beta = tabla_beta[~tabla_beta['Palabras'].isin(prep)].head(500)