I need to do a web-scraping of the webpage below. I need to submit the form with some specific values. After submitting the form, I need to import the data into the R (table of the link "View results as text file") in a data.frame. I tried to make the submission using the following code, but I did not get results:
library(rvest)
library(httr)
POST(
url = "http://tempest.wellesley.edu/~btjaden/TargetRNA2/advanced.html",
encode = "form",
body=list(
`text` = "Escherichia coli str. K-12 substr. MG1655",
`sequence` = ">RyhB GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTT",
`sRNA_subregions` = "on",
`window` = "13",
`before` = "80",
`after` = "20",
`seed` = "7",
`interaction_region` = "20",
`candidate_targets` = "",
`mRNA_accessibility` = "on",
`sigle_target` = "",
`pvalue`= "0.05",
`max_interactions`="400"
),
verbose()
) -> res
content(res, as="parsed")
I know there is an intermediate page I think there is an intermediate page before loading the results http://tempest.wellesley.edu/~btjaden/cgi-bin/processRequest2.cgi before , I do not know the parameters of this intermediate page. So I can not get results. I want to get this table (http://tempest.wellesley.edu/~btjaden/cgi-bin/targetRNA2.cgi?t1519754493.26):
Rank Gene Synonym Energy Pvalue sRNA_start sRNA_stop mRNA_start mRNA_stop
1 sdhD b0722 -12.98 0.004 28 42 -34 -20
2 ascG b2714 -12.65 0.005 52 65 8 20
3 ygjH b3074 -12.24 0.006 45 59 -8 6
4 sodB b1656 -11.43 0.011 37 50 -7 6
5 acnA b1276 -11.14 0.013 33 48 -6 9
6 srlQ b2708 -10.79 0.015 34 48 -6 8
7 cirA b2155 -10.71 0.016 40 57 -58 -40
8 nirB b3365 -10.51 0.018 37 55 -6 13
9 djlB b0646 -10.41 0.019 53 63 9 19
10 shiA b1981 -9.96 0.024 43 58 -63 -47
11 yhhN b3468 -9.78 0.026 50 62 -61 -49
12 ybbP b0496 -9.45 0.030 48 59 -7 4
13 ssuD b0935 -9.43 0.031 50 62 -19 -7
14 cysE b3607 -8.99 0.037 33 49 -8 10
15 insH1 b2030 -8.86 0.039 29 39 -75 -65
16 hscA b2526 -8.82 0.040 52 66 -20 -5
17 yciS b1279 -8.69 0.043 45 59 -10 5
18 dhaL b1199 -8.63 0.044 37 50 -8 6
19 nuoA b2288 -8.6 0.044 42 59 -8 8
20 narG b1224 -8.47 0.047 36 47 -51 -40
21 yraK b3145 -8.37 0.049 27 41 -80 -68
The POST
should go to the processRequest2.cgi
endpoint:
library(rvest)
library(httr)
POST(
url = "http://tempest.wellesley.edu/~btjaden/cgi-bin/processRequest2.cgi",
encode = "form",
body=list(
`text` = "Escherichia coli str. K-12 substr. MG1655",
`sequence` = ">RyhB GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTT",
`sRNA_subregions` = "on",
`window` = "13",
`before` = "80",
`after` = "20",
`seed` = "7",
`interaction_region` = "20",
`candidate_targets` = "",
`mRNA_accessibility` = "on",
`sigle_target` = "",
`pvalue`= "0.05",
`max_interactions`="400"
),
verbose()
) -> res
After that, you can look for the URL that it eventually redirects you to:
content(res, as="parsed") %>%
html_node(xpath=".//meta[@http-equiv]") %>%
html_attr("content") %>%
strsplit("=") %>%
.[[1]] %>%
.[2] %>%
sprintf("http://tempest.wellesley.edu/~btjaden/cgi-bin/%s", .) -> target_url
The site says wait 6 seconds:
Sys.sleep(6)
Then you can get the data:
pg <- read_html(target_url)
html_nodes(pg, "table")
## {xml_nodeset (89)}
## [1] <table><tr>\n<td align="left"><code>GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGAC< ...
## [2] <table width="800">\n<tr>\n<th align="center">Rank</th>\n <th align="center" ...
## [3] <table width="355"><tr>\n<td align="left">1</td>\n <td width="90%">\n ...
## [4] <table width="100%">\n<tr><td></td></tr>\n<tr><td width="100%" bgcolor="white ...
## [5] <table width="355"><tr>\n<td width="32%"> </td>\n <td bgcolor="1E90FF"> ...
## [6] <table width="355"><tr>\n<td width="56%"> </td>\n <td bgcolor="1E90FF"> ...
## [7] <table width="355"><tr>\n<td width="49%"> </td>\n <td bgcolor="1E90FF"> ...
## [8] <table width="355"><tr>\n<td width="41%"> </td>\n <td bgcolor="1E90FF"> ...
## [9] <table width="355"><tr>\n<td width="37%"> </td>\n <td bgcolor="1E90FF"> ...
## [10] <table width="355"><tr>\n<td width="38%"> </td>\n <td bgcolor="1E90FF"> ...
## [11] <table width="355"><tr>\n<td width="44%"> </td>\n <td bgcolor="1E90FF"> ...
## [12] <table width="355"><tr>\n<td width="41%"> </td>\n <td bgcolor="1E90FF"> ...
## [13] <table width="355"><tr>\n<td width="57%"> </td>\n <td bgcolor="1E90FF"> ...
## [14] <table width="355"><tr>\n<td width="47%"> </td>\n <td bgcolor="1E90FF"> ...
## [15] <table width="355"><tr>\n<td width="54%"> </td>\n <td bgcolor="1E90FF"> ...
## [16] <table width="355"><tr>\n<td width="52%"> </td>\n <td bgcolor="1E90FF"> ...
## [17] <table width="355"><tr>\n<td width="54%"> </td>\n <td bgcolor="1E90FF"> ...
## [18] <table width="355"><tr>\n<td width="37%"> </td>\n <td bgcolor="1E90FF"> ...
## [19] <table width="355"><tr>\n<td width="33%"> </td>\n <td bgcolor="1E90FF"> ...
## [20] <table width="355"><tr>\n<td width="56%"> </td>\n <td bgcolor="1E90FF"> ...
## ...