sed explanation so I can recreate a bit of code?

Can someone please explain the following sed command?

title=$(wget -q -O - https://twitter.com/intent/user?user_id=$ID | sed -n 's/^.*<title>\(.*\) on Twitter<.title>.*$/\1/p')
printf "%s\n" "$title"

I tried (and failed terribly) to recreate it because I thought I understood what was going on in the code. So I wrote (well, more modded) it to be the following:

data-user-id=$(wget -q -O - https://twitter.com/$Username | sed -n 's/^.*"data-user-id">\([^<]*\)<.*$/\1/p')
printf "%s\n" "$data-user-id"

Obviously it errored because the syntax is wrong or something. But I'm trying to understand what is going on so I can make my own variant of it.

P.S. I can't just use the API for this due to how everything needs to be configured.

Solution

Give a try to this:

wget -q -O - https://twitter.com/"${Username}" | sed -n '/data-screen-name=.'"${Username}"'".*data-user-id=/I {s/^.*data-screen-name=.'"${Username}"'".*data-user-id="\([0-9]*\)".*$/\1/Ip;q}'

128700677

data-user-id is present in several lines, so it is needed to select a line where data-screen-name=Username

sed is using regular expression, there are 2 good tutorials to start with:

A different sed script with a different output:

Username="StackOverflow"
wget -q -O - https://twitter.com/"${Username}" | sed -n '/data-screen-name=.'"${Username}"'".*data-user-id=/I {p;q}'

data-screen-name="StackOverflow" data-name="Stack Overflow" data-user-id="128700677"

-n instructs sed to not print anything, except when p command is used.

. means any char.

* applies to the previous char in the regex and it means zero or any number of this char.

.* means zero or any number of any char.

/data-screen-name=.'"${Username}"'".*data-user-id=/ select lines which contains data-screen-name= and any one char (.) and StackOverflow and " char and zero or any number of any char (.*) and data-user-id=.

/I means ignore case.

{p;q} are commands executed when above regex is true. p prints the current line. q exits the sed script.

The first sed script at the top contains an additional s/regex/replacement/ to clean up the line.

The additional elements used:

^ means the start of the line.

$ ... $ are used to define a group.

"$[0-9]*$" is a group made of only digits, surrended with 2 " which are not part of the group. It is the first group found in the regex, so it can be referenced in the replacement part with \1.