Search code examples
google-apps-scriptgoogle-sheetsxml-parsinggoogle-sheets-formulafiltering

Google sheets importxml filtering specific xpath items with variable subitems


In a google spreadsheet I would like to filter the importxml() function, because the xml source document isn't so well formed (some sub-items are variably present) so that my columns alignment is not good (different types of data are mixed). see picture The query syntax I'm using has no effet. I've tried to select or to exclude, and both. Maybe you could help me. Thank you very much.

=importxml(A1;"*/*[name()='NameId'] | */*[name()='ScientificName'] | */*[name()!='DisplayDate'] | */*[name()!='Family'] | */*[name()!='RankAbbreviation'] | */*[name()!='NomenclatureStatusID'] | */*[name()!='NomenclatureStatusName'] | */*[name()!='TotalRows']")

Here is the sheet (french parameters) : https://docs.google.com/spreadsheets/d/1wY_rt9ZRIMesXDFX_DoN-FTwcBkhNaugGgWd0X2bE4g/edit?usp=sharing


Solution

  • Issue and workaround:

    When I saw the XML data of the URL, it seems that there is the case that the names in each child are different. I think that the reason of your issue is due to this. Unfortunately, it seems that at IMPORTXML and XPath, when the name is not existing, it cannot be directly replaced with the empty value. So, in this case, as a workaround for achieving your goal, I would like to propose to use a custom function created by XmlService of Google Apps Script instead of IMPORTXML.

    The flow of the sample script is as follows.

    1. Retrieve XML data from URL.
    2. Parse XML data.
    3. Retrieve header row.
    4. Create an result array, and put it.

    When above flow is reflected to a sample script, it becomes as follows.

    Sample script:

    Please copy and paste the following script to the script editor of Spreadsheet. And, please put a custom formula of =SAMPLE("http://services.tropicos.org/Name/Search?name=adonis&type=wildcard&apikey=7602bfa6-cd59-4029-a28d-3aeb0ff8836e&format=xml"). By this, all values are parsed and shown in the cells.

    function SAMPLE(url, ignoreHeaders, orderedHeaders) {
      // 1. Retrieve XML data from URL.
      const res = UrlFetchApp.fetch(url);
      
      // 2. Parse XML data.
      const xmlObj = XmlService.parse(res.getContentText());
      const root = xmlObj.getRootElement();
      const names = root.getChildren();
    
      // 3. Retrieve header row.
      let h = [];
      if (orderedHeaders) {
        h = orderedHeaders.split(",").map(e => e.trim());
      } else {
        const hObj = names.reduce((o, e) => {
          e.getChildren().forEach(f => {
            o[f.getName()] = true
          });
          return o;
        }, {});
        if (ignoreHeaders) {
          ignoreHeaders.split(",").forEach(e => {
            if (hObj[e.trim()]) delete hObj[e.trim()];
          });
        }
        h = Object.keys(hObj);
      }
    
      // 4. Create an result array.
      const ns = root.getNamespace();
      const result = names.reduce((ar, e) => {
        const temp = h.map(h => {
          const t = e.getChild(h, ns);
          if (t) {
            const v = t.getValue();
            return isNaN(v) ? v : Number(v);
          }
          return "";
        });
        ar.push(temp);
        return ar;
      }, [h]);
      return result;
    }
    
    • When =SAMPLE("URL") is put to a cell, all values are retrieved.
    • ignoreHeaders is used as the header you want to ignore. This is from */*[name()!='DisplayDate'] | */*[name()!='Family'] | */*[name()!='RankAbbreviation'] | */*[name()!='NomenclatureStatusID'] | */*[name()!='NomenclatureStatusName'] | */*[name()!='TotalRows'] in your xpath.
      • When you want to ignore the headers of DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows, please put a custom function like =SAMPLE("URL";"DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows"). By this, these headers are ignored from the retrieved values.
    • orderedHeaders is used as the custom ordered headers.
      • When you want to retrieve the values with the custom ordered headers like NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows, please put a custom function like =SAMPLE("URL";;"NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows")

    Result:

    When the URL of http://services.tropicos.org/Name/Search?name=adonis&type=wildcard&apikey=7602bfa6-cd59-4029-a28d-3aeb0ff8836e&format=xml is put to a cell "A1",

    Pattern 1:

    When =SAMPLE(A1) to a cell "B1", the following result is obtained.

    enter image description here

    Pattern 2:

    When =SAMPLE(A1;"DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows") to a cell "B1", the following result is obtained.

    enter image description here

    Pattern 3:

    When =SAMPLE(A1;;"NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows") to a cell "B1", the following result is obtained.

    enter image description here

    References: