The task is to compare two production lines A and B with respect to a performance indicator called OEE. I suspect that there is an impact by the article and line, so I want to compare only articles that have been produced at least once on both lines (here: hammer, drill, pliers, widget). This may later enable to tell if certain articles perform better on one of both lines.
I have managed by iterarion to mark only articles that have run on both lines at least once. But I wonder: is there not a more elegant way to do it?
Interation is not considered the best way as I found out here: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758
But perhaps the DataFrame can be modified in a simpler way?
What I did - but it feels "clumsy":
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create test data
data = {"batch": [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
"production line": ["A", "A", "B", "A", "B", "B", "A", "B", "B", "B"],
"article": ["hammer", "drill", "hammer", "pliers", "pliers", "pliers", "hammer", "hammer", "hammer", "widget"],
"OEE": np.random.rand(10)}
df = pd.DataFrame(data, columns = ["batch", "production line", "article","OEE"])
# add a helper column
df["comparable"] = False
# check for each line if the article exists on the other line
for i in range (len(df)):
production_line = df.loc[i, "production line"]
article = df.loc[i, "article"]
k = i
while k < len(df):
if production_line != df.loc[k, "production line"] and article == df.loc[k, "article"]:
df.loc[i, "comparable"] = True
df.loc[k, "comparable"] = True
k +=1
# create reduced dataset
reduced = df[df["comparable"] == True]
sns.stripplot(x= reduced["production line"], y=reduced.OEE, data=reduced, size=16, alpha=.2, jitter=True, edgecolor="none")
You can groupby 'article', and then filter for the number of unique items in 'production line':
reduced = df.groupby("article").filter(lambda x: x["production line"].nunique() == 2)