web-crawler bots robots.txt googlebot slurp

How to set up a robot.txt which only allows the default page of a site

Say I have a site on http://example.com. I would really like allowing bots to see the home page, but any other page need to blocked as it is pointless to spider. In other words

http://example.com & http://example.com/ should be allowed, but http://example.com/anything and http://example.com/someendpoint.aspx should be blocked.

Further it would be great if I can allow certain query strings to passthrough to the home page: http://example.com?okparam=true

but not http://example.com?anythingbutokparam=true

Solution

So after some research, here is what I found - a solution acceptable by the major search providers: google , yahoo & msn (I could on find a validator here) :

User-Agent: *
Disallow: /*
Allow: /?okparam=
Allow: /$

The trick is using the $ to mark the end of URL.

How can I crawl a page for data as if it's logged in if I have the login credentials?
Get proxy ip address scrapy using to crawl
Add the spider's name to each line of log
scrape the html page after click on a div tag using BeautifulSoup
Python Script to crawl ADO Project for specific file and download it
How to crawl my site to detect 404/500 errors?
How extract all URLs in a website using BeautifulSoup
Scraping/Crawling a website with multiple tabs using python
Querying HTML Content in Common Crawl Dataset Using Amazon Athena
If I do everything on my page with Ajax, how can I do Search Engine Optimization?
Why is the DNS Resolver necessary in a crawler architecture?
How to extarct the google's buttons element via playwright?
Fastest service for crawling web pages or invoking APIs (iTunes in particular)?
Python-Requests (>= 1.*): How to disable keep-alive?
Running two browser instances in parallel for same list of websites in Puppeteer
Issue with Crawling Updated Table Data in a SPA with JavaScript
How do I conditionally turn off pre-render for 2 Blazor pages
How can I write dynamic text to the <head> that the Google crawler will see?
Scraping Python advice needed
Find all possible links in a website / Screen-Web Scraping with Python
Looking to create a bot to crawl API and store data in an excel/csv file
problems in a ruby screen-scraping script
Any Good Open Source Web Crawling Framework in C#
I want to print the IMDB rating of a movie/series to the terminal after the completion of automation
Go Colly - visiting a URL in a for loop
TypeError: 'str' object is not callable with driver.current_url() (Python 3.6)(Selenium)
Importing URLs for JSOUP to Scrape via Spreadsheet
How to load and collect all comments with Selenium and Java
How to scrape all possible results from a search bar of a website
find function in Beautifulsoup return None in first list