Search code examples
pythonpython-3.xnlpnltkrake

Extracting only technical keywords from a text using RAKE library in Python


I want to use rake to extract technical keywords from a job description that I've found on Linkedin, which looks like this:

input = "In-depth understanding of the Python software development stacks, ecosystems, frameworks and tools such as Numpy, Scipy, Pandas, Dask, spaCy, NLTK, sci-kit-learn and PyTorch.Experience with front-end development using HTML, CSS, and JavaScript.
Familiarity with database technologies such as SQL and NoSQL.Excellent problem-solving ability with solid communication and collaboration skills.
Preferred Skills And QualificationsExperience with popular Python frameworks such as Django, Flask or Pyramid."

I run this code, as it's supposed to return the keywords.

from rake_nltk import Rake

r = Rake()
r.extract_keywords_from_text(input)
keywords = r.get_ranked_phrases_with_scores()

for score, keyword in keywords:
    if len(keyword.split()) == 1:  # Check if the keyword is one word
        print(f"{keyword}: {score}")

But the output is this:

frameworks: 2.0
tools: 1.0
sql: 1.0
spacy: 1.0
scipy: 1.0
sci: 1.0
qualificationsexperience: 1.0
pytorch: 1.0
pyramid: 1.0
pandas: 1.0
numpy: 1.0
nosql: 1.0
nltk: 1.0
learn: 1.0
kit: 1.0
javascript: 1.0
front: 1.0
flask: 1.0
familiarity: 1.0
experience: 1.0
ecosystems: 1.0
django: 1.0
dask: 1.0
css: 1.0

Simply I just want the explicit name of tools, skills and frameworks. Such as "Numpy", "Scipy", "HTML", etc That are used in the text and NOT every single word that's found in it (such as "experience" or "tools").

Is there any way to do so? Or should I just provide a list of all possible python frameworks and related skill and then filter the output of rake? If the latter one is the solution, How can I find/make a thorough list?

Any help is appreciated.


Solution

  • You can utilize skill and knowledge token classification from Hugging Face's library

    from transformers import pipeline
    
    token_skill_classifier = pipeline(model="jjzha/jobbert_skill_extraction", aggregation_strategy="first")
    token_knowledge_classifier = pipeline(model="jjzha/jobbert_knowledge_extraction", aggregation_strategy="first")
    
    def aggregate_span(results):
        new_results = []
        current_result = results[0]
    
        for result in results[1:]:
            if result["start"] == current_result["end"] + 1:
                current_result["word"] += " " + result["word"]
                current_result["end"] = result["end"]
            else:
                new_results.append(current_result)
                current_result = result
    
        new_results.append(current_result)
    
        return new_results
    
    def ner(text):
        output_skills = token_skill_classifier(text)
        for result in output_skills:
            if result.get("entity_group"):
                result["entity"] = "Skill"
                del result["entity_group"]
    
        output_knowledge = token_knowledge_classifier(text)
        for result in output_knowledge:
            if result.get("entity_group"):
                result["entity"] = "Knowledge"
                del result["entity_group"]
    
        if len(output_skills) > 0:
            output_skills = aggregate_span(output_skills)
        if len(output_knowledge) > 0:
            output_knowledge = aggregate_span(output_knowledge)
    
        return {"text": text, "entities": output_skills}, {"text": text, "entities": output_knowledge}