Search code examples
xmlxpathawkxmllint

Site summary feed of Wikipedia excluding a single user


There is a "Recent changes" feed available on the Wikipedia homepage.

The same is also available as an ATOM feed. It is also possible to watch a single user by going to their user_account and selecting the feed. But is there any way to get to the feed excluding one (or two) users?


Update: Using xmllint I can extract the author names.

wget https://hunspell.s3.amazonaws.com/temp/out.txt

xmllint --xpath "//*[name() = 'feed']/*[name() = 'entry']/*[name() = 'author']/*[name() = 'name']" out.txt

But I want to exclude one or two authors from this feed. For example, Clarityfiend and Shortride.


Update:

When I tried xpath command, it worked very well with one parameter (english). But it failed with a Unicode parameter:

wget https://hunspell.s3.amazonaws.com/todel/out.txt

worked:

xpath -e "/feed/entry[author/name!='Aditya tamhankar' and author/name!='Sushant Madhale']" out.txt > a.txt

did not work:

xpath -e "/feed/entry[author/name!='Aditya tamhankar' and author/name!='संतोष गोरे']"  out.txt > filtered.txt

The entry by the second author is still there in filtered output.

grep 'संतोष गोरे' filtered.txt

The second command is OK with Unicode, but it does not display one record correctly...

# (t1='Aditya tamhankar' ; t2='संतोष गोरे'; echo 'setns x=http://www.w3.org/2005/Atom'; echo "cat /x:feed/x:entry[not(x:author/x:name[.='$t1'] | x:author/x:name[.='$t2'])]/descendant::*[self::x:updated or self::x:title or descendant-or-self::x:name]/text()") | xmllint --shell out.txt  | tail -n +4 | gawk '{ if(NR % 6 == 0){ print $0 "¬"} else { print $0 }}' |gawk 'BEGIN{FS="\n -------\n" ; RS="\n -------¬\n"; OFS="||"} { print $2,$1,$3 }END{ print FNR}'

All records except this one are correct:

152.238.27.63
/ >
||2021-07-15T20:14:03Z||
19

Solution

  • I suggest that you use xpath tool from your terminal (Ubuntu package libxml-xpath-perl). It supports XPath 2:

    wget -O - https://hunspell.s3.amazonaws.com/temp/out.txt | xpath -e "/feed/entry[author/name!='Clarityfiend' and author/name!='Shortride']" > filtered.txt
    

    UPD: If there is an out of memory error for input buffer, download the feed into a file rather than standard output:

    wget https://hunspell.s3.amazonaws.com/temp/out.txt
    xpath -e "/feed/entry[author/name!='Clarityfiend' and author/name!='Shortride']" out.txt > filtered.txt
    

    The XPath query will list all entries with author's name not equal to Clarityfiend or Shortride. The entries will be saved in filtered.txt.