Search code examples
rdataframemultiple-columnssubtraction

How to subtract two columns in same dataframe with same columns headers/name in R with missing data


I have a dataframe with 8000 columns and 3785 rows. Columns are company names,so,I want to calculate to spread of a company by subtrating its ask price from bid price i.e Spread of company/stock = A.ASK- A.BID. In the column name A.ASK indicates that A is name of company and .ASK indicated it is the ASK price of the company where .BID is the BID price of A . And in my dataframe all ask and bid price of company are side by side column just as illustrated in example dataframe below. Additionally I have missing data for instance if company C started trading in year 2001 it will have NA for year 2000. So, I want is not ignore the date column for calculation and additionally where the there are NA for ask and bid of company it return me NA in resultant column

Date        A . ASK A .BID  C. ASK  C. BID
31/12/1999  NA      NA      NA      NA
03/01/2000  NA      NA      NA      NA
04/01/2000  82      77      NA      NA 
05/01/2000  82      77      NA      NA
06/01/2000  82      77      NA      NA
07/01/2000  82      77      NA      NA
10/01/2000  82      77      NA      NA
11/01/2000  82      77      NA      NA
12/01/2000  NA      NA      NA      NA
13/01/2000  NA      NA      NA      NA
14/01/2000  NA      NA      70      67
17/01/2000  NA      NA      70      67
18/01/2000  97      94      70      67
19/01/2000  97      92      70      67

df2<-df1

Date        A   C
31/12/1999  NA  NA
03/01/2000  NA  NA
04/01/2000  5   NA
05/01/2000  5   NA
06/01/2000  5   NA
07/01/2000  5   NA
10/01/2000  5   NA
11/01/2000  5   NA
12/01/2000  NA  NA
13/01/2000  NA  NA
14/01/2000  NA  3
17/01/2000  NA  3
18/01/2000  3   3
19/01/2000  5   3

Your help is highly appreciated


Solution

  • Having nicely formatted data which alternates between ask and bid price for each company makes this relatively straightforward. The following code should do what you're looking for.

    # Import data
    df <- read.table(text = 
                     "Date       A.ASK   A.BID   C.ASK   C.BID
                     31/12/1999  NA      NA      NA      NA
                     03/01/2000  NA      NA      NA      NA
                     04/01/2000  82      77      NA      NA 
                     05/01/2000  82      77      NA      NA
                     06/01/2000  82      77      NA      NA
                     07/01/2000  82      77      NA      NA
                     10/01/2000  82      77      NA      NA
                     11/01/2000  82      77      NA      NA
                     12/01/2000  NA      NA      NA      NA
                     13/01/2000  NA      NA      NA      NA
                     14/01/2000  NA      NA      70      67
                     17/01/2000  NA      NA      70      67
                     18/01/2000  97      94      70      67
                     19/01/2000  97      92      70      67",
                     header = TRUE
    )
    
    # Define a sequence which selects every second column
    # ask_cols starts at column 2
    # bid_cols starts at column 3
    ask_cols <- (1:((ncol(df)-1)/2))*2
    bid_cols <- (1:((ncol(df)-1)/2))*2+1
    
    # Use ask_cols and bid_cols to select columns from df and calculate
    df2 <- df[, ask_cols]-df[, bid_cols]
    
    # Add the date column to df2
    df2 <- cbind(df[, 1], df2)
    
    # We will use stringr for extracting company names to define column names
    library(stringr)
    
    colnames(df2) <- c("Date", str_extract(colnames(df[, ask_cols]), "([A-Za-z]+)"))
    

    Giving

    > df2
             Date  A  C
    1  31/12/1999 NA NA
    2  03/01/2000 NA NA
    3  04/01/2000  5 NA
    4  05/01/2000  5 NA
    5  06/01/2000  5 NA
    6  07/01/2000  5 NA
    7  10/01/2000  5 NA
    8  11/01/2000  5 NA
    9  12/01/2000 NA NA
    10 13/01/2000 NA NA
    11 14/01/2000 NA  3
    12 17/01/2000 NA  3
    13 18/01/2000  3  3
    14 19/01/2000  5  3
    

    Edit: A better way to define ask_cols and bid_cols is using the seq function

    ask_cols <- seq(2, ncol(df), 2)
    bid_cols <- seq(3, ncol(df), 2)
    

    Edit 2: A better regular expression for matching company names is to use a lookahead to match any series of characters which is followed by .ASK.

    colnames(df2) <- c("Date", str_extract(colnames(df[, ask_cols]), ".*(?=\\.ASK)"))