Some of the data I want to scrape is contained inside the pages JavaScript. It looks similar to this pattern:
<script type="text/javascript">
arrayName["field1"] = 12;
arrayName["field2"] = 42;
arrayName["field3"] = 1442;
</script>
<script type="text/javascript">
arrayName["field4"] = 62;
arrayName["field5"] = 3;
arrayName["field6"] = 542;
</script>
It's mixed in with a hell of a lot of other Javascript. I need to get these values.
I started like so:
var dom = CQ.CreateFromUrl("http://somesite.xxx");
CQ script = dom["script[type='text/javascript']"];
But I cannot think now how to grab this data. Is the only way to do it to create a regex and loop over everything or is there another way that has better performance?
I can't see how to use CSS selectors for actual JavaScript code. Should I try different approach?
It seems like you are really looking for a server-side Javascript engine - CsQuery can get you the contents of the script tags easily enough, but then you need to actually run the script and then be able to refer to the entities that are created. While in theory one could create some kind of query language to parse out lines of script, the reality is, that's basically just running it. If you need to pull out just particular lines containing simple assignments, and context isn't important, then you're probably looking at something as simple as regular expressions (or even grep) to filter out what you need.
I have used the Neosis V8 wrapper -- http://javascriptdotnet.codeplex.com/ -- also on nuget as Neosis.Javascript.
It's as fast as anything (since it uses Google's V8 engine under the hood); the only real downside is it's not a pure .NET solution, but once set up it's pretty painless. An example of using it is in my project https://github.com/jamietre/SharpLinter which uses it to run JsHint.
There are a variety of 100% .NET Javascript engines such as Jint, IronJS and Jurassic. I have used Jurassic before and it's probably the fastest because it compiles to bytecode. It's surprisingly complete, but is not really being actively developed, and so it will probably be difficult to get much support. But all of them are much, much slower than V8 and offer no real advantages other than having no non-.NET references.
Unless you really, really need it to be 100% .net just use JavscriptDotNet.