Search code examples
pythonpandasnltkdata-analysisstop-words

Cleaning Data and Filtering Series


I'm working on analyzing a dataset of job postings from Indeed. My issue is filtering the job description and grabbing skills that contain special characters. For example, I am unable to get 'c#' into a plot with this code:

def cleanData(desc):
    desc = word_tokenize(desc)
    desc = [word.lower() for word in desc]
    desc = [word for word in desc if word not in stop_words]
    return desc

stop_words = stopwords.words('english')
tags_df = df["Description"].apply(cleanData)
result = tags_df.apply(Counter).sum().items()
result = sorted(result, key=lambda kv: kv[1],reverse=True)
result_series = pd.Series({k: v for k, v in result})

skills = ["java", "c#", "c++", "javascript", "sql", "python", "php", "html", "css"]
filter_series = result_series.filter(items=skills)
filter_series.plot(kind='bar',figsize=(20,5))

However, I can still grab words such as 'c++', 'asp.net', and 'react.js'. Any and all help is appreciated.


Solution

  • You can modify the behavior of the nltk tokenizer by changing the regex for punctuation:

    from nltk.tokenize import TreebankWordTokenizer
    import re
    tokenizer = TreebankWordTokenizer()
    tokenizer.PUNCTUATION = [
            (re.compile(r"([:,])([^\d])"), r" \1 \2"),
            (re.compile(r"([:,])$"), r" \1 "),
            (re.compile(r"\.\.\."), r" ... "),
            (re.compile(r"[;@$%&]"), r" \g<0> "),
            (
                re.compile(r'([^\.])(\.)([\]\)}>"\']*)\s*$'),
                r"\1 \2\3 ",
            ),  # Handles the final period.
            (re.compile(r"[?!]"), r" \g<0> "),
            (re.compile(r"([^'])' "), r"\1 ' "),
        ]
    
    text = 'My favorite programming languages are c# and c++'
    tokens = tokenizer.tokenize(text)
    print(tokens)
    

    Output:

    ['My', 'favorite', 'programming', 'languages', 'are', 'c#', 'and', 'c++']