I would like to produce a hirarchical clustering analysis of data imported from .csv file into R. I'm having trouble retaining the first column of row names, so my dendrogram tips end up with no names, which is useless for downstream analyses and linking with meta-data.
When I import the .csv file, if I use the dataframe including the first column of row names for the dist function I get a warning: "Warning message: In dist(as.matrix(df)) : NAs introduced by coercion". I found a previous Stack Overflow question which addressed this: "NAs introduced by coercion" during Cluster Analysis in R The solution offered was to remove the row names. But this also removes the tip labels from the resulting distance matrix, which I need for making sense of the dendrogram and linking to meta-data downstream (e.g. to add colour to dendrogram tips or a heat map based on other variables).
# Generate dataframe with example numbers
Samples <- c('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D', 'Sample_E')
Variable_A <- c(0, 1, 1, 0, 1)
Variable_B <- c(0, 1, 1, 0, 1)
Variable_C <- c(0, 0, 1, 1, 1)
Variable_D <- c(0, 0, 1, 1, 0)
Variable_E <- c(0, 0, 1, 1, 0)
df = data.frame(Samples, Variable_A, Variable_B, Variable_C, Variable_D, Variable_E, row.names=c(1))
df
# generate distance matrix
d <- dist(as.matrix(df))
# apply hirarchical clustering
hc <- hclust(d)
# plot dendrogram
plot(hc)
That all works fine. But let's say I want to import my real data from a file...
# writing the example dataframe to file
write.csv(df, file = "mock_df.csv")
# importing a file
df_import <- read.csv('mock_df.csv', header=TRUE)
I no longer get the original row names using the same code as above:
# generating distance matrix for imported file
d2 <- dist(as.matrix(df_import))
# apply hirarchical clustering
hc2 <- hclust(d2)
# plot dendrogram
plot(hc2)
Everything works fine with the df created in R, but I lose row names with the imported data. How do I solve this?
Samples <- c('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D', 'Sample_E')
Variable_A <- c(0, 1, 1, 0, 1)
Variable_B <- c(0, 1, 1, 0, 1)
Variable_C <- c(0, 0, 1, 1, 1)
Variable_D <- c(0, 0, 1, 1, 0)
Variable_E <- c(0, 0, 1, 1, 0)
df = data.frame(Samples, Variable_A, Variable_B, Variable_C, Variable_D, Variable_E, row.names=c(1))
df
d <- dist(as.matrix(df))
hc <- hclust(d)
plot(hc)
df
write.csv(df, file = "mock_df.csv",row.names = TRUE)
df_import <- read.table('mock_df.csv', header=TRUE,row.names=1,sep=",")
d2 <- dist(as.matrix(df_import))
hc2 <- hclust(d2)
plot(hc2)
in other words use read.table instead of read.csv
df_import <- read.table('mock_df.csv', header=TRUE,row.names=1,sep=",")