I need to extract a CVE from a text column. The format is CVE-yyyy-xxxxx
.
The year(yyyy
) is variable, and the x
’s will change per CVE - can be 1 to 6 digits. Sometimes the CVE is enclosed in ()
, sometimes followed by :
The column is like:
the vulnerability name (CVE-2019-0215)
the vulnerability name (CVE-2019-0290) extra words
the vulnerability name CVE-2018-23314: blah blah
Expected Output in a new column:
CVE-2019-0215
CVE-2019-0290
CVE 2018-23314
It is useful to layout the problem through sample data, and the expected result (ps: for this please learn some stackoverflow format capability).
sample data
CREATE TABLE mytable(
sourcecol VARCHAR(100) NOT NULL
);
INSERT INTO mytable(sourcecol) VALUES ('the vulnerability name (CVE-2019-0215)');
INSERT INTO mytable(sourcecol) VALUES ('the vulnerability name (CVE-2019-0290) extra words');
INSERT INTO mytable(sourcecol) VALUES ('the vulnerability name CVE-2018-23314: blah blah');
query: using regex pattern in substring function
select substring(sourcecol from '(CVE-[0-9]{1,6}-[0-9]{1,6})+')
from mytable
this regex "pattern" looks for the string starting with "CVE-" followed by 1 to 6 digits followed by "-" followed by 1 to 6 digits
result
+----------------+
| substring |
+----------------+
| CVE-2019-0215 |
| CVE-2019-0290 |
| CVE-2018-23314 |
+----------------+
see this dbfiddle