Search code examples
rplotgraphgraphing

Plotting problems - mishandling factor variable as numeric


I am not entirely sure what to name the problem I am having with the plotting function in R...

In my original dataset I had a variable called age with these levels: 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 2X, 30, 40, 50, 60. When I would plot age using:

plot(age, xlab="Age", ylab="Number of observations")

I would then get this plot (a bar plot with age on the x-axis and number of observations on the y-axis):

Plot with 2X in the data

I then removed 2X (for people somewhere in their 20's) from the data and used the same code above to get the new plot. When I re-ran the code the plot now looks like this (a plot with age on the y-axis):

Plot without 2X in the data

If anyone has any ideas about why the plot now has the age on the y-axis, please let me know! Thank you in advance for your help!


Solution

  • Diagnostic

    You are getting into S3 methods dispatch issues. plot is a generic function:

    methods(plot)
    # [1] plot.acf*           plot.data.frame*    plot.decomposed.ts*
    # [4] plot.default        plot.dendrogram*    plot.density*      
    # [7] plot.ecdf           plot.factor*        plot.formula*      
    #[10] plot.function       plot.hclust*        plot.histogram*    
    #[13] plot.HoltWinters*   plot.isoreg*        plot.lm*           
    #[16] plot.medpolish*     plot.mlm*           plot.ppr*          
    #[19] plot.prcomp*        plot.princomp*      plot.profile.nls*  
    #[22] plot.raster*        plot.spec*          plot.stepfun       
    #[25] plot.stl*           plot.table*         plot.ts            
    #[28] plot.tskernel*      plot.TukeyHSD*     
    

    Comments above asked you to provide str(age) before and after removing 2X, because such information helps tell which method has been dispatched when plot is called.

    When you have 2X data, age is definitely a factor. So when you call plot, plot.factor is invoked and a bar plot is produced.

    While when you removed 2X, it seems that age somehow becomes a numerical variable. So when you call plot, plot.default is invoked and a scatter plot is produced, in which case plot(age) is essentially doing plot.default(1:length(age), age).


    Solution

    One way that would definitely work is

    plot(factor(age), xlab="Age", ylab="Number of observations")
    

    However, I am still curious how you removed 2X subset so that age becomes numeric. Normally if age is a factor variable in R, removing a subset does not change variable class.

    Presumably age is stored in a .txt or .csv file and you read it via scan(), read.table() or read.csv(). When you remove 2X, you removed them in those files and read data into R again. In this way, R will identify age to be a different class at data read-in time.