I've been trying to learn Python Data Analytics by working on a little dataset of job postings.
The dataset table consists of three columns; job title, country, and job description.
Some of the cells in the job title column contains values like:
which I wanted to change to something simple, that can be aggregated like: "Data Analyst"
After a lot of searching and browsing through the numpy and pandas documentation I couldn't find any function with the sole purpose of doing this, all of the functions I found does something along the lines of replacing the found expressions with a new one, so I figure I'd have to use if statements anyways, so I did the code as shown in the next section. But I feel like it is too slow and long, there must be a better way to do this right?
TLDR; Code slow and long because I use too many ifs and don't know what else to do.
This is the code:
Imports (I'm using google jupyter notebook)
import numpy as np
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
Reading csv and dropping some columns (total rows ~40k)
jobscsv = pd.read_csv("/content/drive/MyDrive/allJobs.csv")
jobscsv.head
jobscsv_modified = jobscsv.drop(['Date-Posted', 'Salary', 'Company', 'Job-Type'], axis=1)
jobscsv_modified.head
This is the problematic part
success_counter = 0
for i in range(0, len(jobscsv_modified)):
if "Data Analyst" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Analyst'
success_counter = success_counter + 1
if "Data Analysis" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Analyst'
success_counter = success_counter + 1
if "Data Analytics" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Analyst'
success_counter = success_counter + 1
##
if "Data Scientist" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Data Science" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Data Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Data Architect" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Machine Learning" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if " AI " in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Artificial Intelligence" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Scientist'
success_counter = success_counter + 1
if "Database" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Database Engineer'
success_counter = success_counter + 1
if "Data Entry" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Data Entry'
success_counter = success_counter + 1
##
if "Customer Service" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Customer Service'
success_counter = success_counter + 1
if "Sales" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Sales'
success_counter = success_counter + 1
##
if "Software" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Programmer'
success_counter = success_counter + 1
if "Web Development" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Programmer'
success_counter = success_counter + 1
if "Stack" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Programmer'
success_counter = success_counter + 1
##
if "Volunteer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Volunteer Work'
success_counter = success_counter + 1
if "Laboratory" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Research'
success_counter = success_counter + 1
if "Research" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Research'
success_counter = success_counter + 1
if "PhD" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'PhD'
success_counter = success_counter + 1
if "Teacher" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Lecturer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Principal" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Dean" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "School" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Student" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Instructor" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Education'
success_counter = success_counter + 1
if "Postdoctoral" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Research'
success_counter = success_counter + 1
##
if "Mechanical Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Mechanical Engineer'
success_counter = success_counter + 1
if "Industrial Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Industrial Engineer'
success_counter = success_counter + 1
if "Mechatronics Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Mechatronics Engineer'
success_counter = success_counter + 1
if "Electrical Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Electrical Engineer'
success_counter = success_counter + 1
if "Civil Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Civil Engineer'
success_counter = success_counter + 1
if "Chemical Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Chemical Engineer'
success_counter = success_counter + 1
if "Process Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Process Engineer'
success_counter = success_counter + 1
if "Lab Engineer" in jobscsv_modified['Job-Title'][i]:
jobscsv_modified['Job-Title'][i] = 'Research'
success_counter = success_counter + 1
I've reached the point where I'm fairly confident no one does this. How can I improve it?
Based on your comment, it sounds like speed is important. I have a new solution that is both faster and cleaner. As before, I assume you don't need the success counts.
Still using a dictionary containing the keyword and what to replace with, we can instead use DataFrame.where
. This is a vectorized function (so it will be fast) that checks if a condition is met. Where it is met, it will do nothing, but where it fails, it will update the value with what you specify. To do the check, we will use Series.str.contains
to search for the keywords in the desired column. Since where
only updates if False
, we will add ~
beforehand, which is the logical NOT.
To use this, we will simply loop through the dictionary of replacements and check each key, updating the data frame where the condition is met.
for key, value in replacements.items():
df["Job-Title"].where(~df["Job-Title"].str.contains(key), value, inplace=True)
I've run some tests:
Rows | Original | Updated | Richard |
---|---|---|---|
1400 | 70.2ms | 36.1ms | 112ms |
14,000 | 667ms | 208ms | 1.133s |
250,000 | 11.30s | 3.153s | 20.05s |
Clearly, my updated version is the fastest and my previous version comes in second.
Assuming you don't actually need to keep track of the successes, you can use the replace
method. For the replace
method, you can pass a dictionary of replacements, where keys are what to look for and values are what to replace it with. Because you want to replace the entire string if it contains the key words, we can use regex.
The regex we will use here is (^.*<word>.*$)
.
(...)
: This makes everything one group.^
: This matches the beginning of the word..*
: The .
means "match any character" and *
means "match zero or more of what's before it". Together it means it will match zero or more of any character.<word>
: This is where you'd put the word you want the regex to search for, i.e. "PhD" or "Machine Learning".$
: This matches the end of the word.Since it will be annoying to write the regex around every single replacement, we will write a function to convert the replacements into regex form.
regexify = lambda d: {f"(^.*{key}.*$)":value for key, value in d.items()}
Now, here's what you'll do.
{"Data Analyst":"Data Analyst", "Data Analysis":"Data Analyst", ..., "Software":"Programmer",...}
.replace
method with regex=True
. You can also choose to use inplace=True
.Example:
import pandas as pd
import random
# I put some random characters before and after the word to simulate your data
old = ["salkd old 1 asfgds", "asdfa old 2 ffdz", "afsfd old 3 asfds"]
df = pd.DataFrame({"A":random.choices(old, k=12)})
print(df)
replacements = {"old 1":"new 1", "old 2":"new 2", "old 3":"new 3"}
regexify = lambda d: {f"(^.*{key}.*$)":value for key, value in d.items()}
replacements_regexed = regexify(replacements)
df["A"] = df["A"].replace(replacements_regexed, regex=True)
# df["A"].replace(replacements_regexed, regex=True, inplace=True) # same result
print(df)
Output:
# Before replacement
A
0 asdfa old 2 ffdz
1 salkd old 1 asfgds
2 asdfa old 2 ffdz
3 afsfd old 3 asfds
4 asdfa old 2 ffdz
5 afsfd old 3 asfds
6 asdfa old 2 ffdz
7 asdfa old 2 ffdz
8 asdfa old 2 ffdz
9 salkd old 1 asfgds
10 asdfa old 2 ffdz
11 afsfd old 3 asfds
# After replacement
A
0 new 2
1 new 1
2 new 2
3 new 3
4 new 2
5 new 3
6 new 2
7 new 2
8 new 2
9 new 1
10 new 2
11 new 3