r amazon-s3 filtering metadata data-extraction

How to read data from a structural text file

This dataset involves product metadata information from Amazon.

The data looks something like this:

Id:   0
ASIN: 0771044445
  discontinued product
        
Id:   1
ASIN: 0827229534
 title: Patterns of Preaching: A Sermon Sampler
 group: Book
 salesrank: 396585
 similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
 categories: 2
                       
Id:   2
ASIN: 0738700797
 title: Candlemas: Feast of Flames
 group: Book
 salesrank: 168596
 similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940
 categories: 2

How do I import this txt.gz file and only want to extract information related to "Id:" and "group:"? However, if each chunk (the chunk between 2 blank lines) contains "discontinued product", I do not want any information from that chunk at all.

Solution

Here is my solution to read your text. The temp.txt in the document contain the text you shared. You can replace the line with the a proper code to access the text file from S3.

library(dplyr)
library(readr)
library(tidyr)

text <- read_lines("temp.txt")
df <- tibble(text = text[text!=""])
# split text using the colon
df %>%
  # separate text into two columns
  separate(col = text, into = c("variable", "value"), sep = ":",
           extra = "merge", fill = "right") %>%
  mutate(
    # remove extra space from value column
    value = trimws(value), 
    # add value to discontinued product for later usage
    value = if_else(variable == "discontinued product", "TRUE", value),
    # create index_book column base on Id row
    index_book = cumsum(variable == "Id")) %>%
  # using the index_book to assign the Id to all row with same index
  group_by(index_book) %>%
  mutate(book_id = as.numeric(value[variable == "Id"])) %>%
  ungroup() %>%
  # remove index_book column and rows with variable Id
  select(-index_book) %>%
  filter(variable != "Id") %>%
  # Convert data into wide format
  pivot_wider(id_cols = book_id, names_from = variable, values_from = value) %>%
  # filter discontinued product
  filter(is.na(`discontinued product`))

Here is the output of the code