I can able to annotate the web pages using Portia web crawler, my question is how can use the Regex while extracting the data.
For Example,
I have extracted Location filed from a page
Output looks like,
Location : Location xyz,abc
But I need only the xyz,abc values.
I have googled for solutions, but not getting more information.
Could you explain about regex in Portia scrapy?
You need to use capture groups to extract the data so in this case:
Location: (.*)
This tells portia to extract all data following the Location:
string.
If for example you only wanted to extract all of the data between Location:
and the ,
you could use the following:
Location: (.*),
You can also place information inside the capture group so all data up to and including your pattern is extracted.