I desperately need your help. I scraped some data from wikipedia and I came across this ¦ sign. At first I thought it's just | but but it's obviously not.
Most of my cells looks like this
table$Population
7004164110000000000¦16,411[7]
7007111260000000000¦11,126,000[13]
I'm trying to remove everything but 16,411, but first I need to how to convert it into something else.
Any help appreciated, I was going nuts because when I tried the gsub function it didn't work and then the str_split_fixed one didn't work either...
dput(tables$Population)
gives
c("7007301655000000000¦30,165,500[6]", "7007241833000000000¦24,183,300[8]", "7007217070000000000¦21,707,000[10]", "7007150292310000000¦15,029,231[11]")
Here's another way to parse that table into a data frame:
library(rvest)
pg <- read_html("https://en.wikipedia.org/wiki/List_of_cities_proper_by_population")
html_node(pg, "table.wikitable") %>%
html_table() %>%
dplyr::tbl_df() %>%
janitor::clean_names() %>% # THE LINE BELOW DOES THE MAGIC YOU ORIGINALLY ASKE FOR BUT IN A DIFFERENT WAY
tidyr::separate(population, c("sortkey", "population"), sep="[^[:ascii:]]+") %>%
dplyr::mutate(
population = gsub("\\[.*$", "", population)
) %>%
readr::type_convert()
## # A tibble: 87 x 9
## rank city image sortkey population definition totalarea_km populationdensi… country
## <int> <chr> <lgl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 1 Chongqing NA 7.01e18 30165500. Municipality 700482403000… 366. China
## 2 2 Shanghai NA 7.01e18 24183300. Municipality 700363405000… 3814. China
## 3 3 Beijing NA 7.01e18 21707000. Municipality 700416411000… 1267. China
## 4 4 Istanbul NA 7.01e18 15029231. Metropolitan municipality 700262029000… 24231. Turkey
## 5 5 Karachi NA 7.01e18 14910352. City[14] 700337800000… 3944. Pakist…
## 6 6 Dhaka NA 7.01e18 14399000. City 700233754000… 42659. Bangla…
## 7 7 Guangzhou NA 7.01e18 13081000. City (sub-provincial) 700374340000… 1760. China
## 8 8 Shenzhen NA 7.01e18 12528300. City (sub-provincial) 700319920000… 6889. China
## 9 9 Mumbai NA 7.01e18 12442373. City[21] 700243771000… 28426. India
## 10 10 Moscow NA 7.01e18 13200000. Federal city[24][25] 2 511[26] 5256. Russia
## # ... with 77 more rows
The table uses the following underlying markup for the rows:
The "population" cells end up looking like this in an R raw vector (this is the first one, 30
== a space to provide a visual marker reference):
## [1] 37 30 30 37 33 30 31 36 35 35 30 30 30 30 30 30 30 30 30 e2 99 a0 33 30 2c 31 36 35 2c 35 30 30 5b 36 5d
Which looks more like a unicode embedding. Since it's "not ASCII" we can use that to our advantage for wrangling out the data.