Search code examples
rdataframematrixtibble

What are the differences between data.frame, tibble and matrix?


In R, some functions only work on a data.frame and others only on a tibble or a matrix.

Converting my data using as.data.frame or as.matrix often solves this, but I am wondering how the three are different ?


Solution

  • Because they serve different purposes.

    Short summary:

    • Data frame is a list of equal-length vectors. This means, that adding a column is as easy as adding a vector to a list. It also means that while each column has its own data type, the columns can be of different types. This makes data frames useful for data storage.

    • Matrix is a special case of an atomic vector that has two dimensions. This means that whole matrix has to have a single data type which makes them useful for algebraic operations. It can also make numeric operations faster in some cases since you don't have to perform type checks. However if you are careful enough with the data frames, it will not be a big difference.

    • Tibble is a modernized version of a data frame used in the tidyverse. They use several techniques to make them 'smarter' - for example lazy loading.

    Long description of matrices, data frames and other data structures as used in R.

    So to sum up: matrix and data frame are both 2d data structures. Each of these serves a different purpose and thus behaves differently. Tibble is an attempt to modernize the data frame that is used in the widely spread Tidyverse.

    If I try to rephrase it from a less technical perspective: Each data structure is making tradeoffs.

    • Data frame is trading a little of its efficiency for convenience and clarity.
    • Matrix is efficient, but harder to wield since it enforces restrictions upon its data.
    • Tibble is trading more of the efficiency even more convenience while also trying to mask the said tradeoff with techniques that try to postpone the computation to a time when it doesn't appear to be its fault.