Search code examples
ramazon-s3filteringmetadatadata-extraction

How to read data from a structural text file


This dataset involves product metadata information from Amazon.

The data looks something like this:

Id:   0
ASIN: 0771044445
  discontinued product
        
Id:   1
ASIN: 0827229534
 title: Patterns of Preaching: A Sermon Sampler
 group: Book
 salesrank: 396585
 similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
 categories: 2
                       
Id:   2
ASIN: 0738700797
 title: Candlemas: Feast of Flames
 group: Book
 salesrank: 168596
 similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940
 categories: 2

How do I import this txt.gz file and only want to extract information related to "Id:" and "group:"? However, if each chunk (the chunk between 2 blank lines) contains "discontinued product", I do not want any information from that chunk at all.


Solution

  • Here is my solution to read your text. The temp.txt in the document contain the text you shared. You can replace the line with the a proper code to access the text file from S3.

    library(dplyr)
    library(readr)
    library(tidyr)
    
    text <- read_lines("temp.txt")
    df <- tibble(text = text[text!=""])
    # split text using the colon
    df %>%
      # separate text into two columns
      separate(col = text, into = c("variable", "value"), sep = ":",
               extra = "merge", fill = "right") %>%
      mutate(
        # remove extra space from value column
        value = trimws(value), 
        # add value to discontinued product for later usage
        value = if_else(variable == "discontinued product", "TRUE", value),
        # create index_book column base on Id row
        index_book = cumsum(variable == "Id")) %>%
      # using the index_book to assign the Id to all row with same index
      group_by(index_book) %>%
      mutate(book_id = as.numeric(value[variable == "Id"])) %>%
      ungroup() %>%
      # remove index_book column and rows with variable Id
      select(-index_book) %>%
      filter(variable != "Id") %>%
      # Convert data into wide format
      pivot_wider(id_cols = book_id, names_from = variable, values_from = value) %>%
      # filter discontinued product
      filter(is.na(`discontinued product`))
    

    Here is the output of the code

    Result with filted discontinued product