I have a dataset in which I have several patients, their disease activity status and abundance of specific bacteria as below:
**Patient** **DiseaseActivity** **Bacteria**
15 Severe 0.6704
15 Quiescent 0.0350
24 Quiescent 0.0137
24 Quiescent 0.0088
26 Quiescent 0.0023
26 Severe 0.0410
33 Quiescent 0.2031
33 Quiescent 0.0893
37 Quiescent 0.0345
37 Quiescent 0.0031
52 Quiescent 0.0601
52 Severe 0.0200
53 Severe 0.0050
53 Severe 0.2724
69 Severe 0.9369
69 Quiescent 0.0008
2 Severe 0.0421
2 Quiescent 0.0120
12 Severe 0.3109
12 Severe 0.0646
40 Quiescent 0.8048
40 Severe 0.9113
51 Severe 0.1918
51 Severe 0.9538
Each patient has two samples obtained in 2 different time points. When I plot one by one, I can see that when disease severity goes from Quiescent to Severe, the abundance of Bacteria increases or disease severity goes from Severe to Quiescent, the abundance of Bacteria reduces even though only 6 patients fits into the this type of category.
My question is how can I check whether this is really the case at least for those 6 patients or what type of test I need to do for this type of dataset? And if I want to plot this data, what would be the most accurate way to plot the data?
Thank you very much in advance.
I don't know about 'most accurate', and I can't help you with what test to use, that depends on your audience as well as your data. But here's one possible plot?
change.df <- data.df%>%group_by(Patient)%>%summarize(status.change=paste(DiseaseActivity,collapse=""),bacteria.change=Bacteria[2]-Bacteria[1])
ggplot(change.df,aes(x=bacteria.change,y=status.change,color=status.change))+geom_point(size=5)+theme_bw()
This is assuming that every patient has two time points and that they're always in the order time1:time2, which is pretty dangerous! Timepoint should really be recorded in its own column.