I am working on a scraping project for a company. I used Python selenium, mechanize , BeautifulSoup4 etc. libraries and had been successful on putting data into MySQL database and generating reports they wanted.
But I am curious : why there is no standardization on structure of websites. Every site has a different name\id for username\password fields. I looked at Facebook and Google Login pages, even they have different naming for username\password fields. also, other elements are also named arbitrarily and placed anywhere.
One obvious reason I can see is that bots will eat up lot of bandwidth and websites are basically targeted to human users. Second reason may be because websites want to show advertisements.There may be other reasons too.
Would it not be better if websites don't have to provide API's and there would be a single framework of bot\scraper login. For example, Every website can have a scraper friendly version which is structured and named according to a standard specification which is universally agreed on. And also have a page, which shows help like feature for the scraper. To access this version of website, bot\scraper has to register itself.
This will open up a entirely different kind of internet to programmers. For example, someone can write a scraper that can monitor vulnerability and exploits listing websites, and automatically close the security holes on the users system. (For this those websites have to create a version which have such kind of data which can be directly applied. Like patches and where they should be applied) And all this could be easily done by a average programmer. And on the dark side , one can write a Malware which can update itself with new attacking strategies.
I know it is possible to use Facebook or Google login using Open Authentication on other websites. But that is only a small thing in scraping.
My question boils down to, Why there is no such effort there out in the community? and If there is one, kindly refer me to it.
I searched over Stack overflow but could not find a similar. And I am not sure that this kind of question is proper for Stack overflow. If not, please refer me to the correct Stack exchange forum. I will edit the question, if something there is not according to community criteria. But it's a genuine question.
EDIT: I got the answer thanks to @b.j.g . There is such an effort by W3C called Semantic Web.(Anyway I am sure Google will hijack whole internet one day and make it possible,within my lifetime)
EDIT: I think what you are looking for is The Semantic Web
You are assuming people want their data to be scraped. In actuality, the data people scrape is usually proprietary to the publisher, and when it is scraped... they lose exclusivity on the data.
I had trouble scraping yoga schedules in the past, and I concluded that the developers were conciously making it difficult to scrape so third parties couldn't easily use their data.