Search code examples
regexalteryx

How to extract number from html tags in Alteryx?


I have a scraped dataset that contains a column of data like below:

<td>1,968</td>
<td>185</td>
<td>1,285<sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup></td>

I am using Alteryx to process the data and I want to use regex to extract the number between the html tags <td> and </td>. So in the above case, I am supposed to get back 1968, 185 and 1285. I tried the following regular expressions, but neither worked using this tester. I believe the version of regex should be R for Alteryx, but not sure.

>([0-9]+)<

>[0-9]+<

Can someone please shed some light on this? Thanks!


Solution

  • An alternate Alteryx approach: use a Formula tool to remove <td> as well as commas and spaces, then use a Select tool to cast what remains to the numeric type of your choice... it will automatically take everything up to the first non-numeric character.