Comparing rows from a Pandas DataFrame depending on certain attribute

The task is to compare two production lines A and B with respect to a performance indicator called OEE. I suspect that there is an impact by the article and line, so I want to compare only articles that have been produced at least once on both lines (here: hammer, drill, pliers, widget). This may later enable to tell if certain articles perform better on one of both lines.

I have managed by iterarion to mark only articles that have run on both lines at least once. But I wonder: is there not a more elegant way to do it?

Interation is not considered the best way as I found out here: https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758

But perhaps the DataFrame can be modified in a simpler way?

Python 3.9.7
IPython 7.29.0
Pandas 1.3.4

What I did - but it feels "clumsy":

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# create test data

data = {"batch": [100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
        "production line": ["A", "A", "B", "A", "B", "B", "A", "B", "B", "B"],
        "article": ["hammer", "drill", "hammer", "pliers", "pliers", "pliers", "hammer", "hammer", "hammer", "widget"],
        "OEE": np.random.rand(10)}
df = pd.DataFrame(data, columns = ["batch", "production line", "article","OEE"])

# add a helper column
df["comparable"] = False


# check for each line if the article exists on the other line
for i in range (len(df)):
    production_line = df.loc[i, "production line"]
    article = df.loc[i, "article"]
    k = i
    while k < len(df):
        if production_line != df.loc[k, "production line"] and article == df.loc[k, "article"]:
            df.loc[i, "comparable"] = True
            df.loc[k, "comparable"] = True
        k +=1

# create reduced dataset
reduced = df[df["comparable"] == True]

sns.stripplot(x= reduced["production line"], y=reduced.OEE, data=reduced, size=16, alpha=.2, jitter=True, edgecolor="none")

Solution

You can groupby 'article', and then filter for the number of unique items in 'production line':

reduced = df.groupby("article").filter(lambda x: x["production line"].nunique() == 2)