I asked a similar questions before but i still need some help/be pointed into the right direction.
I am trying to locate certain words within a column that consists of a SQL statement on all the rows and extract the next word in R studio.
Example: lets call this dataframe "SQL
| **UserID** | **SQL Statement**
1 | N781 | "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
2 | N283 | "SELECT D, E FROM Table.11 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
So I am trying to pull out the table name. So I am trying to find the words "From" and "Join" and pulling the next table names.
I have been using some code with help from earlier:
I make the column "SQL Statement" in a list of 2 name "b"
I use the code:
z <- mapply(grepl,"(FROM|JOIN)",b)
which gives me a True and fasle for each word in each list.
z <- mapply(grep,"(FROM|JOIN)",b)
The above is close. It give me a position of every match in each of the lists.
But I am just trying to find the word Join or From and take the text word out. I was trying to get an output something like
| **UserID** | **SQL Statement** | Tables
1 | N781 | "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON | Table.1, Table.2
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
2 | N283 | "SELECT D, E FROM Table.11 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 = | Table.11, Table.31
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
Here is a working script which uses base R options. The inspiration here is to leverage strsplit
to split the query string on the keywords FROM
or JOIN
. Then, the first separate word of each resulting term (except for the first term) should be a table name.
sql <- "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON
p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =
v.BusID WHERE SubID = 1 ORDER BY v.Name;"
terms <- strsplit(sql, "(FROM|JOIN)\\s+")
out <- unlist(lapply(terms, function(x) gsub("^([^[:space:]]+).*", "\\1", x)))
out <- out[2:length(out)]
out
[1] "Table.1" "Table.2" "Table.3"
To understand better what I did, follow the demo and have a look at the terms
list which resulted from splitting.
Edit:
Here is a link to another demo which shows how you might use the above logic on a vector of query strings, to generate a list of vector of tables, for each query