Search code examples
rxtsgrepl

Best way to clean data in R to and convert to XTS


I am trying to clean up some data I downloaded from the web an convert to XTS. I found some documentation on CRAN using GREPL to clean up the data, but am wondering if there is an easier way to do this other than using GREPL. I was hoping someone would be able to help me with the code to clean this data up either using GREPL or another function in R. Thank you in advance for any assistance you can provide me with.

  [1] "{"                                                                                 
  [2] "    \"Meta Data\": {"                                                              
  [3] "        \"1. Information\": \"Daily Prices (open, high, low, close) and Volumes\","
  [4] "        \"2. Symbol\": \"MSFT\","                                                  
  [5] "        \"3. Last Refreshed\": \"2017-06-08 15:15:00\","                           
  [6] "        \"4. Output Size\": \"Compact\","                                          
  [7] "        \"5. Time Zone\": \"US/Eastern\""     
  [8] "        },"                                                                        
  [9] "        \"2017-01-19\": {"                                                         
 [10] "            \"1. open\": \"62.2400\","                                             
 [11] "            \"2. high\": \"62.9800\","                                             
 [12] "            \"3. low\": \"62.1950\","                                              
 [13] "            \"4. close\": \"62.3000\","                                            
 [14] "            \"5. volume\": \"18451655\""                                           
 [15] "        },"                                                                        
 [16] "        \"2017-01-18\": {"                                                         
 [17] "            \"1. open\": \"62.6700\","                                             
 [18] "            \"2. high\": \"62.7000\","                                             
 [19] "            \"3. low\": \"62.1200\","                                              
 [20] "            \"4. close\": \"62.5000\","                                            
 [21] "            \"5. volume\": \"19670102\""                                           
 [22] "        },"                                                                        
 [23] "        \"2017-01-17\": {"                                                         
 [24] "            \"1. open\": \"62.6800\","                                             
 [25] "            \"2. high\": \"62.7000\","                                             
 [26] "            \"3. low\": \"62.0300\","                                              
 [27] "            \"4. close\": \"62.5300\","                                            
 [28] "            \"5. volume\": \"20663983\""                                           
 [29] "        }"                                                                         
 [30] "    }"                                                                             
 [31] "}"                                  

The final output for this data would look like:

            Open        High        Low        Close        Volume
2017-01-17  62.68       62.70       62.03       62.53       20663983
2017-01-18  62.67       62.70       62.12       62.50       19670102
2017-01-19  62.24       62.98       62.195      62.30       18451655

Solution

  • As beigel suggested, the first thing you need to do is parse the JSON.

    Lines <-
    "{                                                                                 
      \"Meta Data\": {
        \"1. Information\": \"Daily Prices (open, high, low, close) and Volumes\",
        \"2. Symbol\": \"MSFT\",
        \"3. Last Refreshed\": \"2017-06-08 15:15:00\",
        \"4. Output Size\": \"Compact\",
        \"5. Time Zone\": \"US/Eastern\"
      },
      \"2017-01-19\": {
          \"1. open\": \"62.2400\",
          \"2. high\": \"62.9800\",
          \"3. low\": \"62.1950\",
          \"4. close\": \"62.3000\",
          \"5. volume\": \"18451655\"
      },
      \"2017-01-18\": {
          \"1. open\": \"62.6700\",
          \"2. high\": \"62.7000\",
          \"3. low\": \"62.1200\",
          \"4. close\": \"62.5000\",
          \"5. volume\": \"19670102\"
      },
      \"2017-01-17\": {
          \"1. open\": \"62.6800\",
          \"2. high\": \"62.7000\",
          \"3. low\": \"62.0300\",
          \"4. close\": \"62.5300\",
          \"5. volume\": \"20663983\"
      }
    }"
    parsedLines <- jsonlite::fromJSON(Lines)
    

    Now that the data are in a usable structure, we can start cleaning it. Notice that each element in parsedLines is another list. Let's convert them to vectors with unlist, so we will have a list of vectors instead of a list of lists.

    parsedLines <- lapply(parsedLines, unlist)
    

    Now you might have noticed that the first element in parsedLines is metadata. We can attach that to the final object later. But first, let's rbind all the others elements into a matrix. We can do that for any length list by using do.call.

     ohlcv <- do.call(rbind, parsedLines[-1])  # [-1] removes the first element
    

    Now we can clean up the column names and convert the data from character to numeric.

    colnames(ohlcv) <- gsub("^[[:digit:]]\\.", "", colnames(ohlcv))
    ohlcv <- type.convert(ohlcv)
    

    At this point, I would personally convert to an xts object and attach the metadata. But you can continue with the ohlcv matrix, convert it to a data.frame, tibble, etc.

    # convert to xts
    x <- as.xts(ohlcv, dateFormat = "Date")
    # attach attributes
    metadata <- parsedLines[[1]]
    names(metadata) <- gsub("[[:digit:]]|\\.|[[:space:]]", "", names(metadata))
    xtsAttributes(x) <- metadata
    # view attributes
    str(x)
    
    An 'xts' object on 2017-01-17/2017-01-19 containing:
      Data: num [1:3, 1:5] 62.7 62.7 62.2 62.7 62.7 ...
     - attr(*, "dimnames")=List of 2
      ..$ : NULL
      ..$ : chr [1:5] " open" " high" " low" " close" ...
      Indexed by objects of class: [Date] TZ: UTC
      xts Attributes:  
    List of 5
     $ Information  : chr "Daily Prices (open, high, low, close) and Volumes"
     $ Symbol       : chr "MSFT"
     $ LastRefreshed: chr "2017-06-08 15:15:00"
     $ OutputSize   : chr "Compact"
     $ TimeZone     : chr "US/Eastern"