Search code examples
rplotdataframecalculated-columnssurvey

Survey analysis with categorical data and chart plotting


I have a database that comes from a survey, and from this database I constructed a dataframe in R, that looks similar to this:

    cnt  <-as.factor(c("Country 1", "Country 2", "Country 3", "Country 1", "Country 2", "Country 3" ))
    bnk  <-as.factor(c("bank 1", "bank 2", "bank 3", "bank 1", "bank 2", "bank 3" ))
    qst  <-as.factor(c("q1", "q1", "q1", "q2","q2","q2" ))
    ans  <-as.numeric(c(1,1,2,1,2,3))
    df   <-data.frame(cnt, bnk, qst,ans)
names(df) <- c("Country", "Institute", "Question", "Answer")

      Country Institute Question Answer
1 Country 1    bank 1       q1      1
2 Country 2    bank 2       q1      1
3 Country 3    bank 3       q1      2
4 Country 1    bank 1       q2      1
5 Country 2    bank 2       q2      2
6 Country 3    bank 3       q2      3

It essentially this dataframe it shows that there two different questions - q1,q2, where the participants - here banks coming from different countries - have to respond in each question with a certain numeric scale.

My purpose is very simple. I want, for each question, to calculate and then plot the percentage of banks responded with 1, the percentage of them responded with 2, etc.

So, in our example, there are three banks. With regards to question 1, 2 of them answered 1 and one answered 2. So, I want to visualize - e.g through a bar chart - that there are 2/3 banks (i.e aprx. 67%) answered 1 and 1/3 (i.e aprx. 33% ) answered 2. Similarly for question 2.

Not sure, whether it matters but the range of possible numeric answers might vary according to the question. That is, for q1 the available answers range from 1 to 2, but for question 2 might range from 1 - 5.

Can someone suggest how I can quickly implement this in R ?

Of course, one dirty way is to count the number of banks, count the number of "ones" in q1 (q2) and then calculate the respective fractions. This method, however, is very time consuming and wondering whether are much better options available in R.

UPDATE

Doing all the above, I want for a couple of questions to create a bar chart that can look like this:

enter image description here

Where in the above example, the responses to question 8 that where equal to 1 were labeled - "My bank has being ...." and the responses that were equal to 2 with "My bank has being started ..." as the chart above shows.

Nevertheless, we can ignore the "labeling part" for the moment, as putting only 1 and 2 in the x axis will be sufficient.


Solution

  • Here's a quick answer with ggplot

    library(ggplot2)
    
    ggplot(df, aes(x=Question, fill=factor(Answer))) + geom_bar()
    

    The output look like this:

    enter image description here

    To calculate the percentage:

    library(dplyr)
    library(tidyr)
    
    (dat <- df %>% spread(Question, Answer))
        Country Institute q1 q2
    1 Country 1    bank 1  1  1
    2 Country 2    bank 2  1  2
    3 Country 3    bank 3  2  3
    
    dat$q1 %>% table/nrow(dat)
            1         2 
    0.6666667 0.3333333 
    
    dat$q2 %>% table/nrow(dat)
    
            1         2         3 
    0.3333333 0.3333333 0.3333333 
    

    Edit: Added plot to for the comment below

    ggplot(df, aes(x=Answer, fill=factor(Question))) + geom_bar()
    

    enter image description here

    Edit: Added to address the Updated question:

    df <- data.frame(answer=c(rep(1, 97), rep(2,3)))
    
    ggplot(df, aes(x=as.factor(answer))) + 
      geom_bar(aes(y=(..count..)/sum(..count..)), width=.5) + 
      scale_y_continuous(labels = scales::percent) +
      geom_text(aes(y = ((..count..)/sum(..count..)), label = scales::percent((..count..)/sum(..count..))), stat = "count", vjust = -0.25) +
      labs(title = "Question 8", y = "Percent", x = "") +
      scale_x_discrete(labels=c("My bank has been using \n guarantees already for \n more than 5 years", "My bank has started to use \n guarantees in their last 5 year")) 
    

    enter image description here