Search code examples
javaregexpentahodata-integration

HTML scraping in PDI spoon step (user defined java class)


Hi am using the HTTP Client step to get the source code of a website. I need to scrape out a particular part of one line.

example line: <a href="....." ......>TEXT I WANT</a>

so I figured I would use a UDJC in PDI and first split the text block into lines with String[] lines = code.split("\n+"); and then loop through the array and with an if condition (i.e. the regex check) see if I have the right line.

for(String line : lines){
        if line.matches(".*a href.*"){
            String outputString = code;
            break;
        }
    }

(I am trying this also in an IDE as pure java without PDI) I never get a hit though. Any idea how to fix this? Or is there a faster and easier way to get the chunk I want?


Solution

  • I do something like you want to in a similar case with a filter-step

    Transformation-Steps:

    1. generate row with field "dom", type string IMPORTANT: Limit should be 1 // pentaho needs a field for https-step, which is not needed in the following steps
    2. http-step, get the html-dump and set a fieldname "html" or something like that for it (maybe a status-code field would be good) // Check with a preview if the data is there
    3. filter-step: http includes "<a href" // check the output
    4. JavaScript-Step with your regex*, define a new field with yout wanted output

    * for the regex