Search code examples
rdataframesummary

I'm not able to get summary for one of the columns in r


I have a data-frame which is df.

    id       time  internet lat lng
103  1 1385913600 14.057844   1   0
247  2 1385913600 14.062213   2   0
391  3 1385913600 14.066863   3   0
535  4 1385913600 14.045190   4   0
679  5 1385913600 12.772210   5   0
823 10 1385913600  8.101804  10   0

I added a new column and put values of all of them 0 by using one of the below methods:

df["cluster"] <- 0
df$cluster <- 0

And then by using my algorithm I changed value of each df$cluster which you can see the method I used to change value of df$cluster:

clusternumber <- clusternumber + 1
df$cluster[df$id == minid] <- clusternumber

At the end I got the result I'm looking but I've faced with a new problem. When I'm trying to get summary of my result I'm getting strange result.

> summary(df)
       id           internet            lat              lng            cluster    
 Min.   :    1   Min.   :   0.00   Min.   :  1.00   Min.   :  0.00   1      : 121  
 1st Qu.: 2500   1st Qu.:  15.57   1st Qu.: 25.25   1st Qu.: 25.00   2      : 121  
 Median : 5000   Median :  36.09   Median : 51.00   Median : 49.50   3      : 121  
 Mean   : 5000   Mean   :  75.73   Mean   : 50.50   Mean   : 49.51   4      : 121  
 3rd Qu.: 7501   3rd Qu.:  78.88   3rd Qu.: 75.75   3rd Qu.: 75.00   9      : 121  
 Max.   :10000   Max.   :6663.23   Max.   :100.00   Max.   :100.00   15     : 121  
                                                                     (Other):9272    

I'm looking to know how do I have to make a new column or change value of a column because now I'm getting this:

> summary(df$cluster)
      1       2       3       4       9      15      16      17      34      52      85     147       8       6       7      36 
    121     121    other(2727)

Solution

  • The output of your summary function clearly shows that the cluster column is factor. Below is a simple example.

    # Create an example data frame
    dat <- data.frame(Col_f = c("1.1", "1.1", "2.1", "2.1", "3.1", "3.1", 
                                "4.1", "4.1", "4.1"),
                      Col_n = c(1.1, 1.1, 2.1, 2.1, 3.1, 3.1, 4.1, 4.1, 4.1))
    
    # Check the structure of the data frame
    str(dat)
    # 'data.frame': 9 obs. of  2 variables:
    # $ Col_f: Factor w/ 4 levels "1.1","2.1","3.1",..: 1 1 2 2 3 3 4 4 4
    # $ Col_n: num  1.1 1.1 2.1 2.1 3.1 3.1 4.1 4.1 4.1
    
    # Use summary
    summary(dat)
    #   Col_f       Col_n      
    # 1.1:2   Min.   :1.100  
    # 2.1:2   1st Qu.:2.100  
    # 3.1:2   Median :3.100  
    # 4.1:3   Mean   :2.767  
    #         3rd Qu.:4.100  
    #         Max.   :4.100
    

    Notice that in Col_f summary function simply reports the number in each level.

    To convert the factor to numeric, You may want to convert the column to character first, then convert to numeric. Here is an example.

    # Convert the column of factor to numeric
    dat$Col_fn <- as.numeric(as.character(dat$Col_f))
    

    Notice that Col_fn is the same as Col_n.

    # Call str again
    str(dat)
    # 'data.frame': 9 obs. of  3 variables:
    # $ Col_f : Factor w/ 4 levels "1.1","2.1","3.1",..: 1 1 2 2 3 3 4 4 4
    # $ Col_n : num  1.1 1.1 2.1 2.1 3.1 3.1 4.1 4.1 4.1
    # $ Col_fn: num  1.1 1.1 2.1 2.1 3.1 3.1 4.1 4.1 4.1
    

    If you directly convert factor to numeric, it will be based on the level. Here is an example.

    # Convert the column of factor to numeric
    dat$Col_ff <- as.numeric(dat$Col_f)
    
    # Use str again
    str(dat)
    # 'data.frame': 9 obs. of  4 variables:
    # $ Col_f : Factor w/ 4 levels "1.1","2.1","3.1",..: 1 1 2 2 3 3 4 4 4
    # $ Col_n : num  1.1 1.1 2.1 2.1 3.1 3.1 4.1 4.1 4.1
    # $ Col_fn: num  1.1 1.1 2.1 2.1 3.1 3.1 4.1 4.1 4.1
    # $ Col_ff: num  1 1 2 2 3 3 4 4 4
    

    Notice that col_ff are integers ranging from 1 to 4 because those were the level numbers.