update:
Turned out it's caused by different classes of variables.
Many thanks to @r2evans, who solved this issue by converting interger64 to numeric when reading the data. His method is effective, but what's worth studying further is his problem-solving logic.
I deleted the data for confidentiality reasons.
I plotted histograms of all numeric clomuns in my data table.
head(dt) %>%
keep(is.numeric) %>%
gather() %>% na.omit() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
I chose head() as the data table is too large.
then I had this error:
Error in if (length(unique(intervals)) > 1 & any(diff(scale(intervals)) < : missing value where TRUE/FALSE needed
Then I let
eg <- head(dt)
write.csv2(head(dt), "eg.csv")
and saved eg here on github.
then
eg <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg.csv")
eg %>%
keep(is.numeric) %>%
gather() %>% na.omit() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
I got those right histograms!
What happened when I saved the data and read it again? Or is there a way to fix dt?
PS: dt was also created from saving csv and reading from fread. when I use
eg <- head(dt, 10000)
and save it on github, read again. same error happened.
Is it because my dt is too long (3 million rows) and had some wrong rows?
The problem symptom is that two of your fields are appear invariant. After downloading the full data dt
:
dt <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv")
dt %>%
keep(is.numeric) %>%
gather() %>%
na.omit() %>%
group_by(key) %>%
summarize(v = var(value))
# Warning: attributes are not identical across measure variables;
# they will be dropped
# # A tibble: 9 x 2
# key v
# <chr> <dbl>
# 1 area_size_high 1.00e18
# 2 area_size_low 3.64e10
# 3 lot_size_high 8.76e17
# 4 lot_size_low 5.60e 5
# 5 price_huf_high 0. ### problem!
# 6 price_huf_low 0.
# 7 total_room_count_high 3.23e17
# 8 total_room_count_low 1.46e 0
# 9 V1 8.33e 6
(Many plots tend to implode when the data is invariant.)
This is confusing, though, because head(dt)
definitely shows different values (right side):
V1 ds search_id property_type property_subtype price_huf_low price_huf_high
<int> <IDat> <char> <char> <char> <i64> <i64>
1: 1 2021-02-15 ad2be212-0c25-4e3a-aabf-be089053beba house <NA> 45000000 69000000
2: 2 2021-02-15 ab72ba19-d00f-49e2-8d0d-c6836f030758 apartment <NA> 0 48000000
3: 3 2021-02-06 24bbb050-2ecb-4078-a8dc-65e968f72f43 apartment <NA> 150000000 200000000
4: 4 2021-02-06 f7d87e6e-0f24-4d9e-ae82-2a448d6290bf apartment <NA> 2000000 29000000
5: 5 2021-02-14 71ea3cc4-5326-4bbe-a2ff-20dbae0d9aa8 apartment <NA> 200000000 400000000
(truncated).
However, the key to see there is the i64
, noting that these are 64-bit integers.
sapply(dt, function(z) class(z)[1])
# V1 ds search_id property_type property_subtype
# "integer" "IDate" "character" "character" "character"
# price_huf_low price_huf_high area_size_low area_size_high lot_size_low
# "integer64" "integer64" "integer" "integer" "integer"
# lot_size_high total_room_count_low total_room_count_high district
# "integer" "integer" "integer" "character"
You can fix this in one of two ways:
Fix it when you read it in (recommended):
dt <- fread("https://raw.githubusercontent.com/Deborah-Jia/Complete_Analysis_da2/main/eg1.csv",
integer64 = "numeric")
Fix it with data in your environment:
### data.table (since you used `fread`)
dt[, c("price_huf_low", "price_huf_high") := lapply(.SD, as.numeric),
.SDcols = c("price_huf_low", "price_huf_high")]
### or dplyr
dt %>%
mutate(across(starts_with("price"), as.numeric)) %>% # ... rest of your pipe
### if more than 'price_*' columns:
dt %>%
mutate(across(where(~ inherits(., "integer64")), as.numeric)) %>% # ...
Either way, once those two columns are converted to numeric
, they can be plotted with your original code:
dt %>%
keep(is.numeric) %>%
gather() %>% na.omit() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()