Search code examples
linuxbashsedwgetgnome-terminal

sed explanation so I can recreate a bit of code?


Can someone please explain the following sed command?

title=$(wget -q -O - https://twitter.com/intent/user?user_id=$ID | sed -n 's/^.*<title>\(.*\) on Twitter<.title>.*$/\1/p')
printf "%s\n" "$title"

I tried (and failed terribly) to recreate it because I thought I understood what was going on in the code. So I wrote (well, more modded) it to be the following:

data-user-id=$(wget -q -O - https://twitter.com/$Username | sed -n 's/^.*"data-user-id">\([^<]*\)<.*$/\1/p')
printf "%s\n" "$data-user-id"

Obviously it errored because the syntax is wrong or something. But I'm trying to understand what is going on so I can make my own variant of it.

P.S. I can't just use the API for this due to how everything needs to be configured.


Solution

  • Give a try to this:

    wget -q -O - https://twitter.com/"${Username}" | sed -n '/data-screen-name=.'"${Username}"'".*data-user-id=/I {s/^.*data-screen-name=.'"${Username}"'".*data-user-id="\([0-9]*\)".*$/\1/Ip;q}'
    
    128700677
    

    data-user-id is present in several lines, so it is needed to select a line where data-screen-name=Username

    sed is using regular expression, there are 2 good tutorials to start with:

    A different sed script with a different output:

    Username="StackOverflow"
    wget -q -O - https://twitter.com/"${Username}" | sed -n '/data-screen-name=.'"${Username}"'".*data-user-id=/I {p;q}'
    
    data-screen-name="StackOverflow" data-name="Stack Overflow" data-user-id="128700677"
    

    -n instructs sed to not print anything, except when p command is used.

    . means any char.

    * applies to the previous char in the regex and it means zero or any number of this char.

    .* means zero or any number of any char.

    /data-screen-name=.'"${Username}"'".*data-user-id=/ select lines which contains data-screen-name= and any one char (.) and StackOverflow and " char and zero or any number of any char (.*) and data-user-id=.

    /I means ignore case.

    {p;q} are commands executed when above regex is true. p prints the current line. q exits the sed script.

    The first sed script at the top contains an additional s/regex/replacement/ to clean up the line.

    The additional elements used:

    ^ means the start of the line.

    \( ... \) are used to define a group.

    "\([0-9]*\)" is a group made of only digits, surrended with 2 " which are not part of the group. It is the first group found in the regex, so it can be referenced in the replacement part with \1.