The program I have written takes 2 csv's, one from each of our gene results reporters, and merges them on the column "Sample Id".
Assay ID_x | Gene Symbol | NCBI SNP Reference | Sample ID | Call_x | Assay ID_y | Gene Symbol | Call_y | Unique ID |
---|---|---|---|---|---|---|---|---|
C_### | CYP2D6 | rs## | NA12878 | C/A | C_## | CYP2D6 | C/A | 'In Agreement' |
C_### | CYP2D6 | rs## | NA12878_1 | C/C | C_## | CYP2D6 | C/C | 'In Agreement' |
The calls from the reporters are compared for similar results. Essentially if call x != call y, it is flagged for being not in agreement.
The problem: sometimes 2 rows are of the same sample, separated by name in the presence of an underscore (ex. NA1234
and NA1234_1
).
Considering the code only checks the calls on one line, I can't find a way to compare the other rows of the same sample.
Note that for the first row NA12878
has matching calls of C/A
and C/A
. The next sample NA12878_1
(which is the same sample) has an matching call for C/C
and C/C
.
These calls are actually not in agreement. C/A
and C/C
are not matching calls and must be flagged for review.
I've tried a few different tactics, but none have made it work as needed.
import pandas as pd
import tkinter as tk
from tkinter import filedialog
import tkinter.messagebox
import jinja2
import openpyxl
# The purpose of this program is to compare calls from duplicates, and the calls are aligned from 2 reporters
# There will be 2 Genotyper output passes
# We will flag the calls/comparisons if !=
# Defining the upload prompt
# File Upload window.
def ask_path():
root = tk.Tk()
root.withdraw()
path = filedialog.askopenfile(defaultextension=".csv",
filetypes=[("csv files", '*.csv'),
('all files', '*.*')]) # shows dialog box and return the path
return path
# Pop-up window branding
def onClick(msg):
tkinter.messagebox.showinfo("Precision Genetics", msg)
# Prompt & Load First Pass
genotyper1_col = ['Gene Symbol', 'Sample ID', 'NCBI SNP Reference', 'Assay ID', 'Call']
onClick('Upload First Pass')
geno1 = pd.read_csv(ask_path(), skiprows=17, usecols=genotyper1_col)
# Prompt & Load the Second Pass
genotyper2_col = ['Gene Symbol', 'Assay ID', 'Sample ID', 'NCBI SNP Reference', 'Call']
onClick('Upload Second Pass')
geno2 = pd.read_csv(ask_path(), skiprows=17, usecols=genotyper2_col)
# Merge on specified column
merged_data = pd.merge(geno1, geno2, on=['NCBI SNP Reference', 'Sample ID'])
# Filter out rows that don't have an NCBI SNP Reference in both passes
merged_data = merged_data[~merged_data['NCBI SNP Reference'].isna()]
# Here is where we will attempt to flag the Call Columns if they are not equal to one another
merged_data.loc[merged_data['Call_x'] != merged_data['Call_y'], 'Flag'] = 'Not in Agreement'
merged_data.loc[merged_data['Call_x'] == merged_data['Call_y'], 'Flag'] = '-'
# Highlight Cell if it isnt in agreement
def highlight_cells(value):
if value == merged_data['Not in Agreement']:
color = 'red'
else:
color = ''
return 'background-color: {}'.format(color)
merged_data.style.applymap(highlight_cells)
# Save the data
onClick('Please Choose Where You Would Like To Store This Data')
def save_loc(dataframe):
root = tk.Tk()
root.withdraw()
file_path = filedialog.asksaveasfilename(defaultextension=".csv",
filetypes=[("CSV Files", "*.csv"),
("All Files", "*.*")])
if file_path:
dataframe.to_csv(file_path, index=False)
root.destroy()
save_loc(merged_data)
NEW UPDATE Here is the new edited code, after the amazing help you all have given me. For whatever reason, I'm still getting "Not In Agreement" for everything in the Flagging column. Also, I thought it would be a good idea to update exactly what the merged csv looks like with its headers, and why I changed to location values to what they are. Thanks again for everything all.
def get_root_id(s):
return s.split("_", 1)[0]
sample_ids = defaultdict(list)
headers = []
with open("merged_data.csv", newline="") as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
root_id = get_root_id(row[3])
sample_ids[root_id].append(row)
for rows_by_id in sample_ids.values():
call_vals = set()
for row in rows_by_id:
call_vals.update(row[4:7])
if len(call_vals) > 1:
for row in rows_by_id:
row[8] = 'Not In Agreement'
output_file = 'output.csv'
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(header)
for rows_by_id in sample_ids.values():
for row in rows_by_id:
writer.writerow(row)
output_data = pd.read_csv(output_file)
print(output_data)
I interpret your problem to be, "How do I compare multiple rows with a similar ID?" If that represents your problem, read on.
I reconstructed your data from the original image (including actual CSV makes it easier for us to play with the data). I removed the redundant columns, and added an ID column to show the original row order. I also cut out some rows. I think this accurately highlights the features of your data:
ID,Sample_ID,Call_X,Call_Y,Flag
1,NA12878,C/A,C/A,
2,NA12878_1,C/C,C/C,
3,NTC,,,Not In Agreement
4,PGX000133,C/C,C/C,
5,PGX000133_1,NOAMP,NOAMP,
6,PGX000135,C/C,C/C,
7,PGX000135_D,C/C,C/C,
8,PGX000135_1,C/C,C/C,
I picture reading the CSV and creating a structure where rows are grouped together by the root of their sample ID, something like:
{
"NA12878": [
["1", "NA12878", "C/A", "C/A", ""],
["2", "NA12878_1", "C/C", "C/C", ""],
],
"NTC": [
["3", "NTC", "", "", "Not In Agreement"],
],
"PGX000133": [
["4", "PGX000133", "C/C", "C/C", ""],
["5", "PGX000133_1", "NOAMP", "NOAMP", ""],
],
"PGX000135": [
["6", "PGX000135", "C/C", "C/C", ""],
["7", "PGX000135_D", "C/C", "C/C", ""],
["8", "PGX000135_1", "C/C", "C/C", ""],
],
}
With that structure I can loop over groups of rows, and compare the call x/y values. For any ID:
I use a Python set
and just update it with all call values for every group of rows:
for sample_id, rows_by_id in sample_ids.items():
call_vals = set()
for row in rows_by_id:
call_vals.update(row[2:4])
print(f"{sample_id}: {call_vals}")
NA12878: {'C/C', 'C/A'}
NTC: {''}
PGX000133: {'C/C', 'NOAMP'}
PGX000135: {'C/C'}
If I find a set with a length greater than 1, I loop back over all rows in the group and set the flag field:
...
if len(call_vals) > 1:
for row in rows_by_id:
row[4] = "Not In Agreement"
Here's my complete program, where I use a defaultdict(list)
to easily append rows to a sample ID and create that initial structure I imagined (and showed earlier).
import csv
from collections import defaultdict
def get_root_id(s):
return s.split("_", 1)[0] # NA12878_1 → NA12878; PGX000133 → PGX000133
sample_ids = defaultdict(list)
header = []
# My header will look like (w/row indices):
# ID,Sample_ID,Call_X,Call_Y,Flag
# 0 1 2 3 4
with open("input.csv", newline="") as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
root_id = get_root_id(row[1])
sample_ids[root_id].append(row)
for rows_by_id in sample_ids.values():
call_vals = set()
for row in rows_by_id:
call_vals.update(row[2:4])
if len(call_vals) > 1:
for row in rows_by_id:
row[4] = "Not In Agreement"
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(header)
for rows_by_id in sample_ids.values():
for row in rows_by_id:
writer.writerow(row)
One feature of this structure is that if your input CSV is sorted by Sample_ID, the output CSV will have the same sort... (dunno if that's important, but that's how it will work out.)
My output looks like:
ID | Sample_ID | Call_X | Call_Y | Flag |
---|---|---|---|---|
1 | NA12878 | C/A | C/A | Not In Agreement |
2 | NA12878_1 | C/C | C/C | Not In Agreement |
3 | NTC | Not In Agreement | ||
4 | PGX000133 | C/C | C/C | Not In Agreement |
5 | PGX000133_1 | NOAMP | NOAMP | Not In Agreement |
6 | PGX000135 | C/C | C/C | |
7 | PGX000135_D | C/C | C/C | |
8 | PGX000135_1 | C/C | C/C |