I am working on a survey and the data looks like this:
ID Q1 Q2 Q3 Gender Age Dep Ethnicity
001 Y N Y F 22 IT W
002 N Y Y M 35 HR W
003 Y N N F 20 IT A
004 Y N Y M 54 OPRE B
005 Y N Y M 42 OPRE B
Now, I'd like to add two indexes Dep and Gender to create a table like:
Question Dep Response ID % response
Q1 IT Y 2 100
IT N 0 0
HR Y 0 0
HR N 1 100
OPRE Y 2 100
OPRE N 0 0
Q2 IT Y 0 0
IT N 2 100
HR Y 1 100
HR N 0 0
OPRE Y 0 0
OPRE N 2 100
Q3 ......
My codes are like this:
df['% response'] = df['ID']/df['ID'].sum()
Which gives me
Question Dep Response ID % response
Q1 IT Y 2 20
IT N 0 0
HR Y 0 0
HR N 1 10
OPRE Y 2 20
OPRE N 0 0
Q2 IT Y 0 0
IT N 2 20
HR Y 1 10
HR N 0 0
OPRE Y 0 0
OPRE N 2 20
Q3 ......
I think the denominator is wrong. It should be grouped by the question and Dep and then do the count instead of sum all the IDs. Does anyone can help?
With your initial dataframe:
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4, 5],
"Q1": ["Y", "N", "Y", "Y", "Y"],
"Q2": ["N", "Y", "N", "N", "N"],
"Q3": ["Y", "Y", "N", "Y", "Y"],
"Gender": ["F", "M", "F", "M", "M"],
"Age": [22, 35, 20, 54, 42],
"Dep": ["IT", "HR", "IT", "OPRE", "OPRE"],
"Ethnicity": ["W", "W", "A", "B", "B"],
}
)
Here is one way to do it:
# Reshape the dataframe
df = (
pd.concat(
[
df[[col, "Dep", "ID"]]
.rename(columns={col: "Response"})
.assign(Question=col)
for col in ("Q1", "Q2", "Q3")
]
)
.groupby(["Question", "Dep", "Response"])
.count()
)
# Add 'zero' Y/N rows
for question in ("Q1", "Q2", "Q3"):
for dep in ("HR", "IT", "OPRE"):
for response in ("Y", "N"):
df = pd.concat([df, pd.DataFrame({(question, dep, response): {"ID": 0}}).T])
# Remove duplicated rows and reindex
df = df[~df.index.duplicated(keep="first")].sort_index(ascending=[True, True, False])
# Add percentage column
for question in ("Q1", "Q2", "Q3"):
for dep in ("HR", "IT", "OPRE"):
for response in ("Y", "N"):
try:
df.loc[(question, dep, response), "% response"] = (
100
* df.loc[(question, dep, response), "ID"]
/ df.loc[(question, dep)].sum().values[0]
)
except KeyError:
pass
# Cleanup
df = df.astype(int)
So that:
print(df)
# Output
ID % response
Question Dep Response
Q1 HR Y 0 0
N 1 100
IT Y 2 100
N 0 0
OPRE Y 2 100
N 0 0
Q2 HR Y 1 100
N 0 0
IT Y 0 0
N 2 100
OPRE Y 0 0
N 2 100
Q3 HR Y 1 100
N 0 0
IT Y 1 50
N 1 50
OPRE Y 2 100
N 0 0