Legal & EthicsScraping

How to Check robots.txt Before Web Scraping

Learn how to verify robots.txt for web scraping: fetch it, check status/redirects, interpret Allow vs Disallow, and avoid common crawler pitfalls.

Ibuki Yamamoto
Ibuki Yamamoto
2025ćčŽ12月29æ—„ 6min read

How to Check robots.txt Before Web Scraping

If you scrape or crawl websites, checking robots.txt is one of the fastest ways to understand what a site asks bots to avoid. It won’t replace legal review or proper access controls, but it’s the baseline step for preventing crawl errors, accidental overload, and avoidable blocking.

This guide shows practical ways to verify robots.txt (browser, command line, and tools) and how to read it correctly—especially how Allow/Disallow are evaluated for a specific URL.

What You’ll Learn
  • The core robots.txt rules and the quickest way to verify them
  • How to interpret Disallow vs Allow using “longest match” logic
  • Common gotchas (host, redirects, fetch failures, encoding) and how to handle them

What is robots.txt?

A robots.txt file is a plain text file placed at a site’s root that tells crawlers (bots) which URL paths they may crawl and which they should avoid. The standard behind this is the Robots Exclusion Protocol (REP), and the file is expected to live at /robots.txt.

Key takeaway: Under REP, rules are served from /robots.txt (lowercase), and matching is evaluated from the start of the URL path. The most specific rule wins (the longest matching path). Also, /robots.txt itself is implicitly allowed.

Reference:

Important: robots.txt is not an access-control mechanism. Anything you list can become a roadmap to sensitive areas. If you must protect admin panels or private files, use authentication, IP allowlists, or other security controls (REP also warns about this in its security considerations).

Reference:

The fastest way to verify robots.txt

A practical workflow for checking robots.txt looks like this:

  1. Confirm it exists by opening it directly
  2. Verify HTTP status codes and redirects
  3. Read the contents (User-agent / Allow / Disallow / Sitemap)
  4. Decide whether a specific target URL is allowed for your crawler

Google also documents that it fetches and parses robots.txt, and that scope is determined per host, protocol, and port.

Reference:

Step 1: Check via URL

Start by checking whether the site has a robots.txt at the root.

Focus on two things:

  • Does it load? (Not a 404.)
  • Does it look intentional? CMS themes and plugins sometimes auto-generate surprising rules.

Important: A robots.txt placed in a subdirectory won’t apply. Crawlers look for it at /robots.txt (per the spec).

Reference:

Step 2: Check with curl

Browser checks are useful, but they’re easy to misread (especially redirects or unexpected status codes). For web scraping, always validate how the server responds at the HTTP level.

Check the status code

curl -I https://example.com/robots.txt

In most cases you want 200 OK. If you get a 3xx redirect or 4xx/5xx errors, your crawler’s behavior may need to change—REP defines different handling for missing vs unreachable files.

Fetch the body

curl -s https://example.com/robots.txt

Check with your crawler’s User-Agent

Some sites respond differently based on User-Agent. At minimum, confirm you can fetch the file using the same UA string your crawler will send.

curl -s -A "MyCrawler/1.0" https://example.com/robots.txt

Key takeaway: REP rules apply per User-agent group. If you don’t identify which group your crawler matches, it’s easy to misjudge whether a path is allowed.

Reference:

Step 3: How to read robots.txt

Most robots.txt files follow a simple structure: a User-agent group followed by rules like Allow and Disallow.

User-agent: *
Disallow: /private/
Allow: /private/public-info/
Sitemap: https://example.com/sitemap.xml

The directives that matter most

  • User-agent: which crawler(s) the rules apply to
  • Disallow: path prefixes the crawler should not fetch
  • Allow: exceptions that override a broader Disallow
  • Sitemap: where the sitemap lives (not required by REP, but widely supported)

Key takeaway: Allow/Disallow match against the URL path from the beginning, and the most specific (longest) match wins. If an Allow rule and a Disallow rule match equally, REP recommends choosing Allow.

Reference:

Example decision

With the following rules, the site is allowing /private/public-info/ but blocking other URLs under /private/.

User-agent: *
Disallow: /private/
Allow: /private/public-info/

Step 4: Decide whether a URL is allowed

To decide whether a specific URL can be crawled, apply this checklist:

  • Find the User-agent group that matches your crawler
  • Compare which Allow/Disallow rules match the target URL’s path
  • Use the longest match (most specific rule) as the winner

Quick checklist

Check Where to look How to judge
Group match User-agent Pick the most appropriate group for your UA string
Blocked paths Disallow Does the path match from the start?
Allowed exceptions Allow If it’s more specific than Disallow, Allow wins
The robots file itself /robots.txt Implicitly allowed (fetch failures are a separate issue)

Common pitfalls when checking robots.txt

Wrong host (or subdomain)

robots.txt is scoped by host, protocol, and port. For example, www.example.com and example.com may have different rules. Make sure you’re checking /robots.txt on the same host as the URLs you plan to crawl.

Reference:

Redirects

If fetching robots.txt returns a redirect, different crawlers may follow it differently (or treat it as an error). Check for 3xx responses with curl -I, and if needed, follow redirects to see what you actually get.

curl -I -L https://example.com/robots.txt

Unreachable vs missing

A “missing file” (such as 404) and “unreachable” (such as 5xx) can be treated differently. REP defines behavior for cases where the file can’t be reached (for example, assuming a temporary full disallow). In scraping systems, explicitly decide what your crawler should do when it can’t fetch robots.txt: stop, retry, or pause and re-check later.

Important: Don’t assume “I couldn’t read robots.txt” means “I’m free to crawl.” It may be a temporary outage or a WAF rule blocking your requests.

Reference:

Character encoding and content-type

The spec expects a UTF-8 text file, typically served as text/plain. If you see garbled characters or inconsistent allow/deny results, inspect response headers (especially Content-Type) and confirm the file encoding.

curl -I https://example.com/robots.txt | sed -n '1,20p'

Practical notes for production web scraping

Checking robots.txt is only the entry point. In real projects, also verify:

  • Terms of service and whether an official API exists (some sites prohibit scraping even if robots.txt is permissive)
  • Rate limits and load management
  • Whether the data includes personal data or other sensitive information
  • Whether “bypass” techniques could trigger legal or policy issues (get legal sign-off when needed)

Need a robots.txt-safe scraping plan?

If you’re moving from a quick proof-of-concept to production scraping, we can help you design safe robots.txt handling (including error fallbacks), rate limits, and operational guardrails to reduce blocks and incidents.

Contact UsFeel free to reach out for scraping consultations and quotes
Get in Touch

Summary

To check robots.txt, (1) open /robots.txt to confirm it exists, (2) use curl to verify status codes and fetch the contents, and (3) determine allow/deny by identifying the right User-agent group and applying longest-match logic for Allow/Disallow. For web scraping in particular, define safe behavior for robots fetch failures (stop vs retry) and validate policies, load, and legal constraints to avoid preventable problems.

About the Author

Ibuki Yamamoto
Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+
Annual Data Collection
24/7
Uptime
High Quality
Data Quality