thank you for helping me. I know enough about R to be dangerous, but not enough to be good. I'm trying to create a chi square matrix (exactly like a correlation matrix, but with chi sq) from a dataframe. I've searched extensively (including this page: Chi-square p value matrix in r which didn't help) even trying a loop, and I'm not getting there. Here's what I have so far:
R Studio version 1.2.5033 on Windows 10
data - 14 columns of 3401 rows, I'm only using 9 columns for the chi sq, 8 variables are binary, 1 is categorical
I've gotten the chi sq results I need, but I want to combine it in a matrix. As it is a standard matrix where columns get progressively smaller, the columns are not the same length, so most bind commands won't work. I do have plyr, and I am trying to do it that way, but am failing. Here's the code:
CA <- c(1, 0, 0, 1, 1, 1, 0, 1, 0, 0)
Pos <- c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0)
Mon <- c(1, 1, 1, 1, 0, 0, 0, 0, 0, 1)
Sc <- c(0, 0, 0, 0, 1, 1, 0, 0, 1, 0)
ood <- c(1, 1, 1, 1, 0, 1, 1, 0, 0, 1)
Eco <- c(1, 2, 4, 6, 7, 3, 2, 5, 7, 7)
Orp <- c(0, 0, 0, 1, 1, 1, 1, 0, 0, 0)
BC <- c(1, 1, 1, 0, 0, 0, 0, 1, 0, 1)
SA <- c(0, 0, 1, 1, 1, 1, 1, 0, 0, 1)
MV <- c(1, 1, 0, 1, 1, 0, 1, 1, 0, 0)
Ad <- c(0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
YR <- c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0)
KC <- c(1, 1, 1, 1, 1, 1, 0, 1, 1, 1)
H <- c(1, 1, 1, 1, 1, 1, 1, 0, 0 ,1)
data <- data.frame(CA, Pos, Mon, Sc, ood, Eco, Orp, BC, SA, MV, Ad, YR, KC, H)
Or, here is the dput() data:
structure(list(CA = c(0, 0, 1, 1, 1, 1, 0, 1, 0), Pos = c(0,
0, 0, 0, 1, 1, 0, 1, 0), Mon = c(1, 0, 1, 1, 0, 1, 1, 1, 1),
Sc = c(0, NA, 0, 0, 1, 1, 1, 1, 0), ood = c(1, 1, 1, 0, 1,
0, 1, 1, 1), Eco = c(7, 6, 7, 0, 1, 6, 5, 1, 3), Orp = c(0,
0, 0, 0, 0, 0, 0, 0, 0), BC = c(1, 0, 1, 1, 1, 1, 1, 1, 1
), SA = c(0, 1, 0, 0, 0, 0, 0, 0, 0), MV = c(15, 13, 16,
12, 16, 14, 11, 18, 12), Ad = c(2, 2, 2, 3, 3, 2, 2, 2, 2
), YR = c(2, 2, 1, 1, 2, 2, 2, 1, 2), KC = c(2, 2, 1, 1,
1, 1, 1, 1, 1), H = c(0, 1, 1, 0, 0, 1, 1, 0, 1)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(CA = structure(list(), class = c("collector_double",
"collector")), Pos = structure(list(), class = c("collector_double",
"collector")), Mon = structure(list(), class = c("collector_double",
"collector")), Sc = structure(list(), class = c("collector_double",
"collector")), ood = structure(list(), class = c("collector_double",
"collector")), Eco = structure(list(), class = c("collector_double",
"collector")), Orp = structure(list(), class = c("collector_double",
"collector")), BC = structure(list(), class = c("collector_double",
"collector")), SA = structure(list(), class = c("collector_double",
"collector")), MV = structure(list(), class = c("collector_double",
"collector")), Ad = structure(list(), class = c("collector_double",
"collector")), YR = structure(list(), class = c("collector_double",
"collector")), KC = structure(list(), class = c("collector_double",
"collector")), H = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
library(tidyverse)
library(plyr)
library(reshape2)
chisqstat <- function(x, y){round(chisq.test(x, y)$statistic, 2)} #chi sq formula, pulling stat
chisqp <- function(x, y){round(chisq.test(x, y)$p.value, 4)} #chi sq formula pulling pvalue
chisqdata <- list(apply(data[c(2:7, 9, 14)], 2, chisqstat, y = data$CA), #chi sq statistic for all values with child abuse
apply(data[c(2:7, 9, 14)], 2, chisqp, y = data$CA), #chi sq pvalue for all values with child abuse
apply(data[c(3:7, 9, 14)], 2, chisqstat, y = data$Pos),
apply(data[c(3:7, 9, 14)], 2, chisqp, y = data$Pos),
apply(data[c(4:7, 9, 14)], 2, chisqstat, y = data$Mon),
apply(data[c(4:7, 9, 14)], 2, chisqp, y = data$Mon),
apply(data[c(5:7, 9, 14)], 2, chisqstat, y = data$Sc),
apply(data[c(5:7, 9, 14)], 2, chisqp, y = data$Sc),
apply(data[c(6:7, 9, 14)], 2, chisqstat, y = data$ood),
apply(data[c(6:7, 9, 14)], 2, chisqp, y = data$ood),
apply(data[c(7, 9, 14)], 2, chisqstat, y = data$Eco),
apply(data[c(7, 9, 14)], 2, chisqp, y = data$Eco),
apply(data[c(9, 14)], 2, chisqstat, y = data$Orp),
apply(data[c(9, 14)], 2, chisqp, y = data$Orp),
apply(data[c(14)], 2, chisqstat, y = data$SA),
apply(data[c(14)], 2, chisqp, y = data$SA))
do.call(rbind.fill.matrix(), chisqdata)
When I do each line as:
as.matrix(apply(data[c(2:7, 9, 14], 2, chisqstat, y = data$CA))
I get exactly what I want with row names so I'm wondering if I need to use as.matrix. I know I can do a full matrix where the info above and below the diagonal is the same, but I find that visually overwhelming. I want to get something like this (this is correlations not chi square):
CA Pos Mon Sc ood Eco
Pos 0.20
Mon -0.20 0.60
Sc 0.22 -0.22 -0.65
ood -0.22 0.22 0.65 -0.52
Eco 0.00 -0.18 -0.18 0.38 -0.58
Orp 0.41 0.00 -0.41 0.36 0.09 0.04
I'm sure I'm missing something obvious, so your help is greatly appreciated. Or maybe there is a much nicer way to do this in general, again, thank you.
It looks like your dput
data added is a tibble
, which is probably from how your data was read in:
> class(data)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
While your previous example was a plain data.frame
(I called it df
).
This plays a role in how your columns are extracted in the chisqmatrix
code from here. Specifically:
m[i,j] = chisq.test(x[,i],x[,j],)$p.value
With a simple data.frame
, selecting the first column gives you a numeric vector, which is what you want:
> df[,1]
[1] 0 0 1 1 1 1 0 1 0
> class(df[,1])
[1] "numeric"
While with a tibble
, you will get another tibble
:
> data[,1]
# A tibble: 9 x 1
CA
<dbl>
1 0
2 0
3 1
4 1
5 1
6 1
7 0
8 1
9 0
> class(data[,1])
[1] "tbl_df" "tbl" "data.frame"
So, two ways to deal with this. One way is to use double brackets (e.g., df[[1]]
) which will extract a numeric vector:
> data[[1]]
[1] 0 0 1 1 1 1 0 1 0
Or, you can add drop = TRUE
to coerce to a vector (e.g., df[,1, drop = TRUE]
):
> data[,1,drop=T]
[1] 0 0 1 1 1 1 0 1 0
Because with a tibble
the [
defaults to drop = FALSE
:
https://tibble.tidyverse.org/reference/subsetting.html
So, here is the same function to reproduce your desired result:
chisqmatrix <- function(x) {
names = colnames(x); num = length(names)
m = matrix(nrow=num,ncol=num,dimnames=list(names,names))
for (i in 1:(num-1)) {
for (j in (i+1):num) {
#browser()
m[j,i] = chisq.test(x[, i, drop = TRUE],x[, j, drop = TRUE])$p.value
}
}
return (m)
}
mat <- chisqmatrix(data[c("CA", "Pos", "Mon", "Sc", "ood", "Eco")])
mat[-1, -ncol(mat)]
CA Pos Mon Sc ood
Pos 0.2356799 NA NA NA NA
Mon 1.0000000 1.0000000 NA NA NA
Sc 1.0000000 0.1441270 1.0000000 NA NA
ood 0.5303348 1.0000000 1.0000000 1.0000000 NA
Eco 0.4220127 0.2399069 0.6669878 0.1562356 0.2959326
Note: I removed "Orp" as this will give an error from the sample data where there is only a single level; so you can replace with:
chisqmatrix(data[c("CA", "Pos", "Mon", "Sc", "ood", "Eco", "Orp")])
if using your full data.
Please let me know if this works for you.