Search code examples
findgrepwgetxargs

Grep/Find/Xargs: Search between two strings in folder or result of Wget


I have a folder full of html files, some of which have the following line:

    var topicName = "website/something/something_else/1234/12345678_.*, website/something/something_else/1234/12345678_.*//";

I need to get all instances of the text between inverted commas into a text file. I've been trying to combine FIND.exe and XARGS.exe to do this, but have not been successful.

I've been looking at things like the following, but don't know where to start to combine all three to get the output I want.

grep -rI "var topicName = " *

Ideally, I want to combine this with a call to wget also. So in order (a) do a recurisive mirror of a website (maybe limiting the results to Html files) i.e:

wget -mr -k robots=off --user-agent="Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11" --level=1 http://www.website.com/someUrl

(b) go through the html in each result and check if it contains the text 'var topicName', (c) if so, get the text between 'var topicName =' and '"' and write all the values to a text file at the end.

I'd appreciate any help at all with this.

Thanks.


Solution

  • For grabbing the text from the HTML into a file: If your version of grep supports it, the -o switch tells it to only print the matched portion of the line.

    With this in mind, 2 grep invocations should sort you out (provided you can identify uniquely ONLY the lines you wish to grab the text for); something like this:

    grep -Rn "var topicName =" html/ | grep -o '"[^"]*"' > topicNames.dat
    

    If it's unacceptable to leave the " symbols in there, you could pass it via sed after the second grep:

    grep -Rn "var topicName =" html/ | grep -o '"[^"]*"' | sed 's/"//g' > topicNames.dat