I have a dataframe with 8000 columns and 3785 rows. Columns are company names,so,I want to calculate to spread of a company by subtrating its ask price from bid price i.e Spread of company/stock = A.ASK- A.BID. In the column name A.ASK indicates that A is name of company and .ASK indicated it is the ASK price of the company where .BID is the BID price of A . And in my dataframe all ask and bid price of company are side by side column just as illustrated in example dataframe below. Additionally I have missing data for instance if company C started trading in year 2001 it will have NA for year 2000. So, I want is not ignore the date column for calculation and additionally where the there are NA for ask and bid of company it return me NA in resultant column
Date A . ASK A .BID C. ASK C. BID
31/12/1999 NA NA NA NA
03/01/2000 NA NA NA NA
04/01/2000 82 77 NA NA
05/01/2000 82 77 NA NA
06/01/2000 82 77 NA NA
07/01/2000 82 77 NA NA
10/01/2000 82 77 NA NA
11/01/2000 82 77 NA NA
12/01/2000 NA NA NA NA
13/01/2000 NA NA NA NA
14/01/2000 NA NA 70 67
17/01/2000 NA NA 70 67
18/01/2000 97 94 70 67
19/01/2000 97 92 70 67
df2<-df1
Date A C
31/12/1999 NA NA
03/01/2000 NA NA
04/01/2000 5 NA
05/01/2000 5 NA
06/01/2000 5 NA
07/01/2000 5 NA
10/01/2000 5 NA
11/01/2000 5 NA
12/01/2000 NA NA
13/01/2000 NA NA
14/01/2000 NA 3
17/01/2000 NA 3
18/01/2000 3 3
19/01/2000 5 3
Your help is highly appreciated
Having nicely formatted data which alternates between ask and bid price for each company makes this relatively straightforward. The following code should do what you're looking for.
# Import data
df <- read.table(text =
"Date A.ASK A.BID C.ASK C.BID
31/12/1999 NA NA NA NA
03/01/2000 NA NA NA NA
04/01/2000 82 77 NA NA
05/01/2000 82 77 NA NA
06/01/2000 82 77 NA NA
07/01/2000 82 77 NA NA
10/01/2000 82 77 NA NA
11/01/2000 82 77 NA NA
12/01/2000 NA NA NA NA
13/01/2000 NA NA NA NA
14/01/2000 NA NA 70 67
17/01/2000 NA NA 70 67
18/01/2000 97 94 70 67
19/01/2000 97 92 70 67",
header = TRUE
)
# Define a sequence which selects every second column
# ask_cols starts at column 2
# bid_cols starts at column 3
ask_cols <- (1:((ncol(df)-1)/2))*2
bid_cols <- (1:((ncol(df)-1)/2))*2+1
# Use ask_cols and bid_cols to select columns from df and calculate
df2 <- df[, ask_cols]-df[, bid_cols]
# Add the date column to df2
df2 <- cbind(df[, 1], df2)
# We will use stringr for extracting company names to define column names
library(stringr)
colnames(df2) <- c("Date", str_extract(colnames(df[, ask_cols]), "([A-Za-z]+)"))
Giving
> df2
Date A C
1 31/12/1999 NA NA
2 03/01/2000 NA NA
3 04/01/2000 5 NA
4 05/01/2000 5 NA
5 06/01/2000 5 NA
6 07/01/2000 5 NA
7 10/01/2000 5 NA
8 11/01/2000 5 NA
9 12/01/2000 NA NA
10 13/01/2000 NA NA
11 14/01/2000 NA 3
12 17/01/2000 NA 3
13 18/01/2000 3 3
14 19/01/2000 5 3
Edit: A better way to define ask_cols and bid_cols is using the seq
function
ask_cols <- seq(2, ncol(df), 2)
bid_cols <- seq(3, ncol(df), 2)
Edit 2: A better regular expression for matching company names is to use a lookahead to match any series of characters which is followed by .ASK.
colnames(df2) <- c("Date", str_extract(colnames(df[, ask_cols]), ".*(?=\\.ASK)"))