Search code examples
pythonpandasnumpydatasetdata-analysis

Slow code; search for keyword in column, then change value of entire cell accordingly


I've been trying to learn Python Data Analytics by working on a little dataset of job postings.

The dataset table consists of three columns; job title, country, and job description.

Some of the cells in the job title column contains values like:

  • "Urgently Data Analyst needed - one year contract"
  • "Junior Data Analyst required, big hearts preferred"
  • "Senior Data Analytics position, two openings, ability to swim is a plus"

which I wanted to change to something simple, that can be aggregated like: "Data Analyst"

After a lot of searching and browsing through the numpy and pandas documentation I couldn't find any function with the sole purpose of doing this, all of the functions I found does something along the lines of replacing the found expressions with a new one, so I figure I'd have to use if statements anyways, so I did the code as shown in the next section. But I feel like it is too slow and long, there must be a better way to do this right?

TLDR; Code slow and long because I use too many ifs and don't know what else to do.

This is the code:

Imports (I'm using google jupyter notebook)

import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Reading csv and dropping some columns (total rows ~40k)

jobscsv = pd.read_csv("/content/drive/MyDrive/allJobs.csv")

jobscsv.head

jobscsv_modified = jobscsv.drop(['Date-Posted', 'Salary', 'Company', 'Job-Type'], axis=1)

jobscsv_modified.head

This is the problematic part

success_counter = 0

for i in range(0, len(jobscsv_modified)):

    if "Data Analyst" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Analyst'
      success_counter = success_counter + 1

    if "Data Analysis" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Analyst'
      success_counter = success_counter + 1

    if "Data Analytics" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Analyst'
      success_counter = success_counter + 1
##
    if "Data Scientist" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Data Science" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Data Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Data Architect" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Machine Learning" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if " AI " in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Artificial Intelligence" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Scientist'
      success_counter = success_counter + 1

    if "Database" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Database Engineer'
      success_counter = success_counter + 1

    if "Data Entry" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Data Entry'
      success_counter = success_counter + 1
##
    if "Customer Service" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Customer Service'
      success_counter = success_counter + 1

    if "Sales" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Sales'
      success_counter = success_counter + 1
##
    if "Software" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Programmer'
      success_counter = success_counter + 1

    if "Web Development" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Programmer'
      success_counter = success_counter + 1

    if "Stack" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Programmer'
      success_counter = success_counter + 1
##
    if "Volunteer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Volunteer Work'
      success_counter = success_counter + 1

    if "Laboratory" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Research'
      success_counter = success_counter + 1

    if "Research" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Research'
      success_counter = success_counter + 1

    if "PhD" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'PhD'
      success_counter = success_counter + 1

    if "Teacher" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Lecturer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Principal" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Dean" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "School" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Student" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Instructor" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Education'
      success_counter = success_counter + 1

    if "Postdoctoral" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Research'
      success_counter = success_counter + 1

##

    if "Mechanical Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Mechanical Engineer'
      success_counter = success_counter + 1

    if "Industrial Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Industrial Engineer'
      success_counter = success_counter + 1

    if "Mechatronics Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Mechatronics Engineer'
      success_counter = success_counter + 1

    if "Electrical Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Electrical Engineer'
      success_counter = success_counter + 1

    if "Civil Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Civil Engineer'
      success_counter = success_counter + 1

    if "Chemical Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Chemical Engineer'
      success_counter = success_counter + 1

    if "Process Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Process Engineer'
      success_counter = success_counter + 1

    if "Lab Engineer" in jobscsv_modified['Job-Title'][i]:
      jobscsv_modified['Job-Title'][i] = 'Research'
      success_counter = success_counter + 1

I've reached the point where I'm fairly confident no one does this. How can I improve it?


Solution

  • Updated Answer

    Based on your comment, it sounds like speed is important. I have a new solution that is both faster and cleaner. As before, I assume you don't need the success counts.

    Still using a dictionary containing the keyword and what to replace with, we can instead use DataFrame.where. This is a vectorized function (so it will be fast) that checks if a condition is met. Where it is met, it will do nothing, but where it fails, it will update the value with what you specify. To do the check, we will use Series.str.contains to search for the keywords in the desired column. Since where only updates if False, we will add ~ beforehand, which is the logical NOT.

    To use this, we will simply loop through the dictionary of replacements and check each key, updating the data frame where the condition is met.

    for key, value in replacements.items():
        df["Job-Title"].where(~df["Job-Title"].str.contains(key), value, inplace=True)
    

    I've run some tests:

    Rows Original Updated Richard
    1400 70.2ms 36.1ms 112ms
    14,000 667ms 208ms 1.133s
    250,000 11.30s 3.153s 20.05s

    Clearly, my updated version is the fastest and my previous version comes in second.

    Original Answer

    Assuming you don't actually need to keep track of the successes, you can use the replace method. For the replace method, you can pass a dictionary of replacements, where keys are what to look for and values are what to replace it with. Because you want to replace the entire string if it contains the key words, we can use regex.

    The regex we will use here is (^.*<word>.*$).

    • (...): This makes everything one group.
    • ^: This matches the beginning of the word.
    • .*: The . means "match any character" and * means "match zero or more of what's before it". Together it means it will match zero or more of any character.
    • <word>: This is where you'd put the word you want the regex to search for, i.e. "PhD" or "Machine Learning".
    • $: This matches the end of the word.

    Since it will be annoying to write the regex around every single replacement, we will write a function to convert the replacements into regex form.

    regexify = lambda d: {f"(^.*{key}.*$)":value for key, value in d.items()}
    

    Now, here's what you'll do.

    1. Create a dictionary of replacements and what to replace them with, e.g. {"Data Analyst":"Data Analyst", "Data Analysis":"Data Analyst", ..., "Software":"Programmer",...}.
    2. "Regexify" it by passing the dictionary to the function.
    3. Call the replace method with regex=True. You can also choose to use inplace=True.

    Example:

    import pandas as pd
    import random
    
    # I put some random characters before and after the word to simulate your data
    old = ["salkd old 1 asfgds", "asdfa old 2 ffdz", "afsfd old 3 asfds"] 
    
    df = pd.DataFrame({"A":random.choices(old, k=12)})
    print(df)
    
    replacements = {"old 1":"new 1", "old 2":"new 2", "old 3":"new 3"}
    regexify = lambda d: {f"(^.*{key}.*$)":value for key, value in d.items()}
    replacements_regexed = regexify(replacements)
    
    df["A"] = df["A"].replace(replacements_regexed, regex=True)
    # df["A"].replace(replacements_regexed, regex=True, inplace=True)  # same result
    print(df)
    

    Output:

    # Before replacement
                         A
    0     asdfa old 2 ffdz
    1   salkd old 1 asfgds
    2     asdfa old 2 ffdz
    3    afsfd old 3 asfds
    4     asdfa old 2 ffdz
    5    afsfd old 3 asfds
    6     asdfa old 2 ffdz
    7     asdfa old 2 ffdz
    8     asdfa old 2 ffdz
    9   salkd old 1 asfgds
    10    asdfa old 2 ffdz
    11   afsfd old 3 asfds
    
    # After replacement
            A
    0   new 2
    1   new 1
    2   new 2
    3   new 3
    4   new 2
    5   new 3
    6   new 2
    7   new 2
    8   new 2
    9   new 1
    10  new 2
    11  new 3