I have XML data for many scientific publications and I am trying to parse through the data in KNIME to extract the fields that I need. Here is one example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC4400176
To extract the names of the authors, I am using the following XPath Query: /pmc-articleset/article/front/article-meta/contrib-group/contrib[@contrib-type="author"]
However, this returns:
BorisovaSvetlana A., KimHak Joong, PuXiaotao, LiuHung-wen*
I would like for the last and first names to be separated by some delimiter, comma/space, and for different author names to be separated by a semi-colon. Is this possible? Or is there a better way to extract the information compared to what I am currently doing that would allow me to achieve my ideal output:
Borisova, Svetlana A.; Kim, Hak Joong; Pu, Xiaotao; Liu, Hung-wen*
[edit]
Current KNIME workflow:
Sample current output:
I've tried having all of the author names for all of the publications outputting into a collection cell. (If I have all of the names outputting into multiple columns, this ends up creating hundreds of columns containing missing values. I've even tried to achieve my ideal output using multiple string manipulations, but it is still not as perfect, due to some author names having multiple names, hyphenated names, or names containing special characters.) The collection cell combines all of the author names with a comma delimiter between each author's name, but combines surnames and given-names. I can also do the same aforementioned string manipulations on these, but still run into the same issues as mentioned.
If I separate author names into multiple rows, this creates multiple rows for every article, from which I'm not sure how to get to my end goal for each article.
End goal:
Any ideas on how to solve this problem with the authors would be much appreciated!
You should ideally do this in multiple steps. I’d do it as follows:
contrib
elements and return the resulting “Nodes” as rows (not as strings) using the XPath nodesurname
, given-names
, and xref
using another XPath node[edit] You can find a fully working example workflow on my public NodePit space:
[regarding your edit] As far as I get, your challenge now is, that your table contains more than one publication, and the GroupBy node would combine them all into one row. To avoid that, you can make use of the “Looping” nodes. Simply surround the logic which I’ve described above with a pair of Chunk Loop Start and a Loop End node. This allows you to process each public “in isolation”.