google-sheets xpath google-sheets-formula

Returning xpath value based on decendent criteria within IMPORTXML

I am trying to return the Time and Name of movies from a list, however I seem to get so far and end up going around in circles!

This is the HTML I am using:

<li>
  <a href="XXXXXX" class="program">
    <span class="time">9:00pm</span>
    <div class="meta">
      <h3><img src="XXXXXX" width="13" height="11" class="movie"> The Transporter 2</h3>
      <span class="desc">Action (2005)</span>
      <p>MOVIE DESCRIPTION.</p>
    </div>
  </a>
</li>

This format repeats for each film/programme. What I was aiming for was an XPATH query that returned the time and the title together, however I could not understand how to format this within the XPATH query. So now I have settled on two separate queries, one for the time and one for the title that I can then join together in google sheets. What complicates this is that there may be regular programmes in the list as well so the only way I can differentiate between a programme and a film is by querying if there is an image with the class of movie.

This works fine when returning the title:

=importxml(C1,"/html/body/ul/li/a/div/h3[descendant::img[@class ='movie']]//text()")

However when attempting to return the time, it returns all times, not just those where there is an image. I suspect this is due to sibling/descendent nodes that i'm just not understanding. I have tried the following but it still returns all times, not just those where there is an image class alongside the title. Without the descendent criteria it is returning all times for all programmes and movies, when ideally I would like it to only return the times for movies.

=importxml(C1,"/html/body/ul/li/a/span[//descendant::input[//img[contains(@class, 'movie')]]]")

If anyone has any suggestions I would be very grateful as I have been going around in circles all day!

Solution

For the first XPath no need for descendant::img, since it is a direct child.

"/html/body/ul/li/a/div/h3[img[@class ='movie']]/text()"

For the second XPath, put the predicate on the a element like this

"/html/body/ul/li/a[descendant::img[contains(@class, 'movie')]]/span"

Starting a predicate or a XPath with double slash // means , look everywhere in the document. Therefore that predicate in your 2e XPath always is true and therefore all times are returned.