I would like to know how to rearrange the source data (table) in order to output the desired table using R or SQL, which are displayed below.
Since looping is very slow in R, and my dataset is quite large... it's not preferred to have too much looping in the script. The efficiency is important.
Source data table:
Date | Country | ID | Fruit | Favorite | Money
20120101 US 1 Apple Book 100
20120101 US 2 Orange Knife 150
20120101 US 3 Banana Watch 80
20120101 US 4 Melon Water 90
20120102 US 1 Apple Phone 120
20120102 US 2 Apple Knife 130
20120102 US 3 Banana Watch 100
..... ...... .. ..... ...... ......
Output table:
Date | Country | Field | ID 1 | ID 2 | ID 3 | ID 4
20120101 US Fruit Apple Orange Banana Melon
20120101 US Favorite Book Knife Watch Water
20120101 US Money 100 150 80 90
20120102 US Fruit Apple Apple Banana N.A.
.... .... .... .... .... .... ....
Here is an approach in R, using your sample data:
x <- cbind(mydf[, c("Date", "Country", "ID")],
stack(mydf[, c("Fruit", "Favorite", "Money")]))
reshape(x, direction = "wide", idvar = c("Date", "Country", "ind"), timevar="ID")
# Date Country ind values.1 values.2 values.3 values.4
# 1 20120101 US Fruit Apple Orange Banana Melon
# 5 20120102 US Fruit Apple Apple Banana <NA>
# 8 20120101 US Favorite Book Knife Watch Water
# 12 20120102 US Favorite Phone Knife Watch <NA>
# 15 20120101 US Money 100 150 80 90
# 19 20120102 US Money 120 130 100 <NA>
To round up with other options, here's a melt
+ dcast
approach (which can be taken from "data.table" or "reshape2") and a "dplyr" + "tidyr" approach.
library(data.table)
dcast(
suppressWarnings(
melt(as.data.table(mydf), c("Date", "Country", "ID"))),
... ~ ID, value.var = "value")
# Date Country variable 1 2 3 4
# 1: 20120101 US Fruit Apple Orange Banana Melon
# 2: 20120101 US Favorite Book Knife Watch Water
# 3: 20120101 US Money 100 150 80 90
# 4: 20120102 US Fruit Apple Apple Banana NA
# 5: 20120102 US Favorite Phone Knife Watch NA
# 6: 20120102 US Money 120 130 100 NA
library(dplyr)
library(tidyr)
mydf %>%
gather(variable, value, Fruit:Money) %>%
spread(ID, value)
# Date Country variable 1 2 3 4
# 1 20120101 US Fruit Apple Orange Banana Melon
# 2 20120101 US Favorite Book Knife Watch Water
# 3 20120101 US Money 100 150 80 90
# 4 20120102 US Fruit Apple Apple Banana <NA>
# 5 20120102 US Favorite Phone Knife Watch <NA>
# 6 20120102 US Money 120 130 100 <NA>
In this answer, mydf
is defined as:
mydf <- structure(
list(Date = c(20120101L, 20120101L, 20120101L,
20120101L, 20120102L, 20120102L, 20120102L),
Country = c("US", "US", "US", "US", "US", "US", "US"),
ID = c(1L, 2L, 3L, 4L, 1L, 2L, 3L),
Fruit = c("Apple", "Orange", "Banana", "Melon",
"Apple", "Apple", "Banana"),
Favorite = c("Book", "Knife", "Watch", "Water",
"Phone", "Knife", "Watch"),
Money = c(100L, 150L, 80L, 90L, 120L, 130L, 100L)),
.Names = c("Date", "Country", "ID",
"Fruit", "Favorite", "Money"),
class = "data.frame", row.names = c(NA, -7L))