This dataset involves product metadata information from Amazon.
The data looks something like this:
Id: 0
ASIN: 0771044445
discontinued product
Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
Id: 2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
categories: 2
How do I import this txt.gz file and only want to extract information related to "Id:" and "group:"? However, if each chunk (the chunk between 2 blank lines) contains "discontinued product", I do not want any information from that chunk at all.
Here is my solution to read your text. The temp.txt
in the document contain the text you shared. You can replace the line with the a proper code to access the text file from S3.
library(dplyr)
library(readr)
library(tidyr)
text <- read_lines("temp.txt")
df <- tibble(text = text[text!=""])
# split text using the colon
df %>%
# separate text into two columns
separate(col = text, into = c("variable", "value"), sep = ":",
extra = "merge", fill = "right") %>%
mutate(
# remove extra space from value column
value = trimws(value),
# add value to discontinued product for later usage
value = if_else(variable == "discontinued product", "TRUE", value),
# create index_book column base on Id row
index_book = cumsum(variable == "Id")) %>%
# using the index_book to assign the Id to all row with same index
group_by(index_book) %>%
mutate(book_id = as.numeric(value[variable == "Id"])) %>%
ungroup() %>%
# remove index_book column and rows with variable Id
select(-index_book) %>%
filter(variable != "Id") %>%
# Convert data into wide format
pivot_wider(id_cols = book_id, names_from = variable, values_from = value) %>%
# filter discontinued product
filter(is.na(`discontinued product`))
Here is the output of the code