Search code examples
bashawkgrepmetadatatitle

Grep title of a page which is written with spaces


I am trying to get the meta title of some website...

some people write title like

`<title>AllHeart Web INC, IT Services Digital Solutions Technology
</title>
`

`<title>AllHeart Web INC, IT Services Digital Solutions Technology</title>`

`<title>
AllHeart Web INC, IT Services Digital Solutions Technology
</title>`

some like more ways... my current focus on above 3 ways...

I wrote a simple code, it only capture 2nd way of title written, but i am not sure how can I grep the other ways,

`curl -s https://allheartweb.com/ | grep -o '<title>.*</title>'`

I also made a code (very bad i guess)

where i can grep number of line like

`
% curl -s https://allheartweb.com/ | grep -n '<title>'                   
7:<title>AllHeart Web INC, IT Services Digital Solutions Technology

% curl -s https://allheartweb.com/ | grep -n '</title>' 
8:</title>
`

and store it and run loop to get title item... which i guess a bad idea...

any help I can get all possible of getting title?


Solution

  • Try this:

    curl -s https://allheartweb.com/ | tr -d '\n' | grep -m 1 -oP '(?<=<title>).+?(?=</title>)'
    

    You can remove newlines from HTML via tr because they have no meaning in the title. The next step returns the first match of the shortest string enclosed in <title> </title>.

    This is quite a simple approach of course. xmllint would be better but that's not available to all platforms by default.