Search code examples
rloopsmedianiqr

how to write a loop for finding median across columns


I have a dataframe regarding kidney transplant patients with different clinical outcomes (numbers changed for confidentiality purposes. In other words I have something like this.

Patient        eGFR1m cr1m  alb1m  cr3m   eGFR3m   alb3m  cr12m eGFR12m   Diseased
A              142    343     125   110     115     125     120   181        1
B              175    192     121   125     215     120     135   151        0
C              154    185     128   210     115     125     124   116        0  
D              170    215     215   110     125     110     145   205        1 
E              175    140     225   110     115     110     125   120        0  

This is the simplified version. I have a lot more outcomes so I want to create a loop for calculating median and IQR for each column in R.

Another thing is that I need the medians for the cohort, as well as medians for a diseased group and non-diseased group as comparisons. The disease outcome was collected as binary, non-continous variable. eGFR, cr, alb at each month are all continous, non-parametric variables.


Solution

  • it seems you want us to do all the steps of an initial exploratory data analysis for you. On your next postings, instead of requesting coding like this, you should first show your problems with reproducible code, show the results of your attempts, and ask specific questions about your doubts. That said, lets look at your question:

    You can use apply in loops to return median, mean, Q1 and Q3 for every column.

    sapply(yourdataframe, median) #will return a vector with the medians of every column
    

    Similarly,

    sapply(yourdataframe, quantile, 0.25) #will return a vector with all the first quartiles
    
    sapply(yourdataframe, quantile, 0.75) #will return a vector with all the third quartiles
    

    You may want to write a function that integrates all that in a single call, like this:

    
    descriptive<-function(x=data.frame(), digits=2, na.rm=TRUE, normality_test="shapiro"){
            library(stats)
            is.normal<-character()
            medians<-numeric()
            Q1<-numeric()
            Q3<-numeric()
            means<-numeric()
            SDs<-numeric()
            output<-character()
            for (i in seq_along(x)){
                    if (is.numeric(x[,i])){
                            medians[i]<-median(x[,i], na.rm = na.rm)
                            Q1[i]<-quantile(x[,i], 0.25, na.rm = na.rm)
                            Q3[i]<-quantile(x[,i], 0.75, na.rm = na.rm)
                            means[i]<-round(mean(x[,i], na.rm = na.rm), digits = digits)
                            SDs[i]<-round(sd(x[,i], na.rm=TRUE), digits = digits)
                            if (normality_test=="shapiro"){
                                    p.value<-shapiro.test(x[,i])$p.value
                            } else if (normality_test=="ks"){
                                    p.value<-ks.test(x[,i], "pnorm", means[i], SDs[i])$p.value
                            }
                            if (p.value<=0.05){
                                    is.normal[i]<-FALSE
                                    output[i]<-paste0(medians[i], " (", Q1[i], "-", Q3[i], ")")
                            }else{
                                    is.normal[i]<-TRUE
                                    output[i]<-paste0(means[i], " +-", SDs[i])
                            }
                    }else  {
                            is.normal[i]<-NA
                            means[i]<-NA
                            medians[i]<-NA
                            Q1[i]<-NA
                            Q3[i]<-NA
                            SDs[i]<-NA
                            output[i]<-NA
                    }
            }      
            
            df<-data.frame(rbind( "normal distr"=is.normal, "median"=medians, "Q1"=Q1, "Q3"=Q3, "mean"=means, "SD"=SDs, "output"=output))
            names(df)<-colnames(x)
            df
    }
    

    As an example:

    > descriptive(iris, normality_test="shapiro")
                  Sepal.Length Sepal.Width   Petal.Length   Petal.Width Species
    normal distr         FALSE        TRUE          FALSE         FALSE    <NA>
    median                 5.8           3           4.35           1.3    <NA>
    Q1                     5.1         2.8            1.6           0.3    <NA>
    Q3                     6.4         3.3            5.1           1.8    <NA>
    mean                  5.84        3.06           3.76           1.2    <NA>
    SD                    0.83        0.44           1.77          0.76    <NA>
    output       5.8 (5.1-6.4) 3.06 +-0.44 4.35 (1.6-5.1) 1.3 (0.3-1.8)    <NA>
    
    

    There are several ways to subset your data based on categorical values for analysis, check dplyr's filter and group_by functions.