How to Check robots.txt Before Web Scraping
If you scrape or crawl websites, checking robots.txt is one of the fastest ways to understand what a site asks bots to avoid. It wonât replace legal review or proper access controls, but itâs the baseline step for preventing crawl errors, accidental overload, and avoidable blocking.
This guide shows practical ways to verify robots.txt (browser, command line, and tools) and how to read it correctlyâespecially how Allow/Disallow are evaluated for a specific URL.
- The core
robots.txtrules and the quickest way to verify them - How to interpret
DisallowvsAllowusing âlongest matchâ logic - Common gotchas (host, redirects, fetch failures, encoding) and how to handle them
What is robots.txt?
A robots.txt file is a plain text file placed at a siteâs root that tells crawlers (bots) which URL paths they may crawl and which they should avoid. The standard behind this is the Robots Exclusion Protocol (REP), and the file is expected to live at /robots.txt.
Key takeaway: Under REP, rules are served from /robots.txt (lowercase), and matching is evaluated from the start of the URL path. The most specific rule wins (the longest matching path). Also, /robots.txt itself is implicitly allowed.
Reference:
Important: robots.txt is not an access-control mechanism. Anything you list can become a roadmap to sensitive areas. If you must protect admin panels or private files, use authentication, IP allowlists, or other security controls (REP also warns about this in its security considerations).
Reference:
The fastest way to verify robots.txt
A practical workflow for checking robots.txt looks like this:
- Confirm it exists by opening it directly
- Verify HTTP status codes and redirects
- Read the contents (
User-agent/Allow/Disallow/Sitemap) - Decide whether a specific target URL is allowed for your crawler
Google also documents that it fetches and parses robots.txt, and that scope is determined per host, protocol, and port.
Reference:
Step 1: Check via URL
Start by checking whether the site has a robots.txt at the root.
Focus on two things:
- Does it load? (Not a 404.)
- Does it look intentional? CMS themes and plugins sometimes auto-generate surprising rules.
Important: A robots.txt placed in a subdirectory wonât apply. Crawlers look for it at /robots.txt (per the spec).
Reference:
Step 2: Check with curl
Browser checks are useful, but theyâre easy to misread (especially redirects or unexpected status codes). For web scraping, always validate how the server responds at the HTTP level.
Check the status code
curl -I https://example.com/robots.txtIn most cases you want 200 OK. If you get a 3xx redirect or 4xx/5xx errors, your crawlerâs behavior may need to changeâREP defines different handling for missing vs unreachable files.
Fetch the body
curl -s https://example.com/robots.txtCheck with your crawlerâs User-Agent
Some sites respond differently based on User-Agent. At minimum, confirm you can fetch the file using the same UA string your crawler will send.
curl -s -A "MyCrawler/1.0" https://example.com/robots.txtKey takeaway: REP rules apply per User-agent group. If you donât identify which group your crawler matches, itâs easy to misjudge whether a path is allowed.
Reference:
Step 3: How to read robots.txt
Most robots.txt files follow a simple structure: a User-agent group followed by rules like Allow and Disallow.
User-agent: *
Disallow: /private/
Allow: /private/public-info/
Sitemap: https://example.com/sitemap.xmlThe directives that matter most
- User-agent: which crawler(s) the rules apply to
- Disallow: path prefixes the crawler should not fetch
- Allow: exceptions that override a broader
Disallow - Sitemap: where the sitemap lives (not required by REP, but widely supported)
Key takeaway: Allow/Disallow match against the URL path from the beginning, and the most specific (longest) match wins. If an Allow rule and a Disallow rule match equally, REP recommends choosing Allow.
Reference:
Example decision
With the following rules, the site is allowing /private/public-info/ but blocking other URLs under /private/.
User-agent: *
Disallow: /private/
Allow: /private/public-info/Step 4: Decide whether a URL is allowed
To decide whether a specific URL can be crawled, apply this checklist:
- Find the
User-agentgroup that matches your crawler - Compare which
Allow/Disallowrules match the target URLâs path - Use the longest match (most specific rule) as the winner
Quick checklist
| Check | Where to look | How to judge |
|---|---|---|
| Group match | User-agent | Pick the most appropriate group for your UA string |
| Blocked paths | Disallow | Does the path match from the start? |
| Allowed exceptions | Allow | If itâs more specific than Disallow, Allow wins |
| The robots file itself | /robots.txt | Implicitly allowed (fetch failures are a separate issue) |
Common pitfalls when checking robots.txt
Wrong host (or subdomain)
robots.txt is scoped by host, protocol, and port. For example, www.example.com and example.com may have different rules. Make sure youâre checking /robots.txt on the same host as the URLs you plan to crawl.
Reference:
Redirects
If fetching robots.txt returns a redirect, different crawlers may follow it differently (or treat it as an error). Check for 3xx responses with curl -I, and if needed, follow redirects to see what you actually get.
curl -I -L https://example.com/robots.txtUnreachable vs missing
A âmissing fileâ (such as 404) and âunreachableâ (such as 5xx) can be treated differently. REP defines behavior for cases where the file canât be reached (for example, assuming a temporary full disallow). In scraping systems, explicitly decide what your crawler should do when it canât fetch robots.txt: stop, retry, or pause and re-check later.
Important: Donât assume âI couldnât read robots.txtâ means âIâm free to crawl.â It may be a temporary outage or a WAF rule blocking your requests.
Reference:
Character encoding and content-type
The spec expects a UTF-8 text file, typically served as text/plain. If you see garbled characters or inconsistent allow/deny results, inspect response headers (especially Content-Type) and confirm the file encoding.
curl -I https://example.com/robots.txt | sed -n '1,20p'Practical notes for production web scraping
Checking robots.txt is only the entry point. In real projects, also verify:
- Terms of service and whether an official API exists (some sites prohibit scraping even if robots.txt is permissive)
- Rate limits and load management
- Whether the data includes personal data or other sensitive information
- Whether âbypassâ techniques could trigger legal or policy issues (get legal sign-off when needed)
Need a robots.txt-safe scraping plan?
If youâre moving from a quick proof-of-concept to production scraping, we can help you design safe robots.txt handling (including error fallbacks), rate limits, and operational guardrails to reduce blocks and incidents.
Summary
To check robots.txt, (1) open /robots.txt to confirm it exists, (2) use curl to verify status codes and fetch the contents, and (3) determine allow/deny by identifying the right User-agent group and applying longest-match logic for Allow/Disallow. For web scraping in particular, define safe behavior for robots fetch failures (stop vs retry) and validate policies, load, and legal constraints to avoid preventable problems.