Search code examples
rregexsubstringtext-mining

Using R to text mine and extract words


I asked a similar questions before but i still need some help/be pointed into the right direction.

I am trying to locate certain words within a column that consists of a SQL statement on all the rows and extract the next word in R studio.

Example: lets call this dataframe "SQL

      |    **UserID**    |      **SQL Statement** 

 1   |    N781          |   "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON 
                            p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 = 
                            v.BusID WHERE SubID = 1 ORDER BY v.Name;"

2      |  N283          |   "SELECT D, E FROM Table.11 p JOIN Table.2 pv ON 
                           p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 = 
                           v.BusID WHERE SubID = 1 ORDER BY v.Name;"

So I am trying to pull out the table name. So I am trying to find the words "From" and "Join" and pulling the next table names.

I have been using some code with help from earlier:

I make the column "SQL Statement" in a list of 2 name "b"

I use the code:

z <- mapply(grepl,"(FROM|JOIN)",b)

which gives me a True and fasle for each word in each list.

z <- mapply(grep,"(FROM|JOIN)",b)

The above is close. It give me a position of every match in each of the lists.

But I am just trying to find the word Join or From and take the text word out. I was trying to get an output something like

      |    **UserID**    |      **SQL Statement**                                | Tables 

 1   |    N781          |   "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON       | Table.1, Table.2          
                            p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =       
                            v.BusID WHERE SubID = 1 ORDER BY v.Name;"

2      |  N283          |   "SELECT D, E FROM Table.11 p JOIN Table.2 pv ON 
                           p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 =    | Table.11, Table.31 
                           v.BusID WHERE SubID = 1 ORDER BY v.Name;"

Solution

  • Here is a working script which uses base R options. The inspiration here is to leverage strsplit to split the query string on the keywords FROM or JOIN. Then, the first separate word of each resulting term (except for the first term) should be a table name.

    sql <- "SELECT A, B FROM Table.1 p JOIN Table.2 pv ON 
            p.ProdID.1ProdID.1 JOIN Table.3 v ON pv.BusID.1 = 
            v.BusID WHERE SubID = 1 ORDER BY v.Name;"
    
    terms <- strsplit(sql, "(FROM|JOIN)\\s+")
    
    out <- unlist(lapply(terms, function(x) gsub("^([^[:space:]]+).*", "\\1", x)))
    
    out <- out[2:length(out)]
    out
    
    [1] "Table.1" "Table.2" "Table.3"
    

    Demo

    To understand better what I did, follow the demo and have a look at the terms list which resulted from splitting.

    Edit:

    Here is a link to another demo which shows how you might use the above logic on a vector of query strings, to generate a list of vector of tables, for each query

    Demo