Legal & EthicsNewsScraping

RSL 1.0 and robots.txt: From Blocking Bots to Licensing AI Use

Learn how RSL 1.0 extends robots.txt with machine-readable licensing terms so you can allow crawling under clear conditions for AI and data use.

Ibuki Yamamoto
Ibuki Yamamoto
February 6, 2026 4min read

robots.txt used to be a simple yes/no switch for crawlers. In the generative AI era, that framing breaks down. The real question isn’t whether a bot can fetch your pages—it’s how the collected data will be used after the fact. More site owners now want options like “allow under specific conditions” or “allow only with compensation,” rather than a binary allow/deny model.

That’s the gap RSL 1.0 (Really Simple Licensing) is designed to fill. RSL builds on robots.txt as the discovery entry point, but adds a machine-readable way to publish usage terms and require license retrieval before a crawler uses your content.

What You’ll Learn
  • What robots.txt can (and can’t) do in practice
  • How RSL 1.0 adds “license terms” as a first-class concept
  • A practical rollout plan—and the operational steps that make it effective

Conclusion

Crawler control is shifting from “block with Disallow” to “publish terms and allow only with agreement (license acquisition)”. robots.txt still matters as the front door, but standards like RSL 1.0 are emerging as a realistic option for AI crawling because they can express “allowed vs. prohibited uses,” “required compensation,” and “conditions you must follow.”

At a high level, RSL 1.0 adds a License directive to robots.txt so crawlers can discover the URL of a license document and are expected to retrieve and follow it.

robots.txt basics

robots.txt is standardized as the Robots Exclusion Protocol (REP). It’s a convention that lets crawlers read and follow a site owner’s published instructions. In RFC 9309, robots.txt is defined as a mechanism for providing crawler-facing directives—not as authentication/authorization, and not as access control with guaranteed enforcement.

Key points to remember

  • robots.txt is built from rules using User-agent plus Allow/Disallow
  • Scope is limited to the same host, protocol, and port
  • Most reputable crawlers comply, but robots.txt doesn’t technically stop a non-compliant scraper

In day-to-day operations, robots.txt is often treated as search-engine guidance. Google also documents placement, scope, and how Googlebot fetches and interprets robots.txt to help prevent accidental de-indexing.

Where “blocking” falls short

Robots.txt-first governance runs into hard limits in the following scenarios.

It can’t express AI usage conditions

Common requirements today look like: “indexing is fine, but AI training or AI summaries are not.” REP, however, is primarily an Allow/Disallow model, which makes it difficult to represent use-case-specific conditions in a machine-readable way.

It can’t stop bad actors

robots.txt is a signal of intent—not an HTTP-level gate. If you need to mitigate abusive scraping, you still need other layers such as a WAF, rate limiting, signed delivery, or authentication.

It’s a weak starting point for negotiation

What many organizations actually want is: “You may use this, but only under these terms.” Traditional robots.txt provides no standard way to publish conditions (payment, permitted uses, retention limits, redistribution rules, attribution requirements, and so on) in a way that automation can reliably consume. That pushes teams back into one-off emails, manual reviews, and legal processes.

Note

robots.txt is not a contract by itself. That said, publishing clear policies—and having a counterparty access your site after referencing them—can become relevant later in legal analysis (contract formation, tort claims, unauthorized access allegations, etc.). Design your governance with legal/compliance input rather than treating it as “just an engineering setting.”

What is RSL 1.0?

RSL 1.0 (Really Simple Licensing) extends robots.txt so you can publish machine-readable usage terms and licensing requirements, and so crawlers can discover, retrieve, and comply with those terms. In public coverage, RSL is framed as an attempt to standardize AI crawling terms—covering allowed usage and compensation (payments or other consideration)—while keeping robots.txt as the familiar entry point.


Add a License directive to robots.txt

RSL guidance requires that an RSL-enabled robots.txt includes a License directive pointing to an RSL license document (or feed). In RSL documentation, that directive is described as a global declaration (i.e., not tied to a specific user agent) so crawlers can discover it before they access or process content.

# robots.txt (example)
License: https://example.com/license.xml

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /private/

From “go away” to “here are the terms”

The key shift is moving from “don’t crawl” to “you may crawl, but only under these terms.” That gives publishers, e-commerce sites, and data providers a clearer foundation for designing conditions by use case (AI training, AI answers/summaries, search indexing, partners, and more) without relying entirely on bespoke negotiation.

How to roll it out

This is where implementation becomes real. Adopting RSL 1.0 is not just a robots.txt tweak—you need an operational plan that connects policy, legal review, and enforcement layers.

1. Break down your objectives

Start by making the policy decisions explicit.

  • Do you want to allow search indexing?
  • Do you want to allow AI training, summarization, or answer generation?
  • If you allow some AI uses, what compensation and reuse limits apply?
  • Does this align with your internal policies (terms of service, privacy, copyright, data policy)?

2. Prepare a license document

RSL assumes a license document (or feed) that robots.txt can point to. That document is where you define what’s permitted, what’s prohibited, and what conditions apply (payment, attribution, retention, redistribution, etc.). In practice, you should treat legal review as mandatory.

3. Add License to robots.txt

Following RSL guidance, add the License directive. Use an absolute URL; the license can be hosted on a different host than the robots.txt origin.

4. Add enforcement in other layers

RSL can formalize the “contract front door,” but it doesn’t magically block adversaries. In production, you typically pair it with controls like:

  • Rate limiting (by IP / ASN / User-Agent)
  • Bot detection (fingerprinting, JS challenges, header validation)
  • WAF/CDN bot management
  • Authentication for sensitive pages or signed URLs for high-value assets

Operational tips that actually make this work

  • Define “who and what you allow” internally first; then align technical controls to match
  • Structure access logs so you can classify traffic by use case (AI vs. search vs. partners)
  • Template your response plan for violations (block, notify, escalate legally)

robots.txt vs. RSL 1.0

If you’re wondering which one to use, the cleanest answer is: they serve different roles.

Dimension robots.txt (REP) RSL 1.0
Primary purpose Direct crawlers to allow or avoid crawling paths Publish usage terms and licensing requirements
Expressiveness Mainly Allow/Disallow Designed for conditions and license acquisition
Enforcement Advisory by design Contract framework (effectiveness depends on adoption + operations)
Intended counterparties Mainly search crawlers Includes AI/data-collection crawlers

Practical considerations

Keep it consistent with your terms

If you add RSL, align it with your Terms of Service, copyright policy, and API terms. “robots.txt allows it, but the terms forbid it” is the kind of inconsistency that can weaken your position in negotiations or disputes.

Don’t tank search traffic by accident

If your goal is “limit AI usage while preserving organic search,” you need careful rule design—often by separating crawler identities (User-Agent) where appropriate. Validate your approach against Google’s published interpretation to avoid unintended blocks.

Be explicit about scope

REP scope is limited to a given host/protocol/port. If you serve content across multiple hosts (subdomains, image CDNs, static asset hosts, APIs), you’ll likely need per-host policies and configuration.

Note

RSL is still a relatively new standard. If a crawler doesn’t support RSL, you may not get the “license-based permissioning” effect you expect. Plan for a mixed world: confirm counterparties’ support and backstop with technical controls (WAF, rate limits, auth) for non-compliant clients.

Want AI crawler terms that hold up?

If youre moving beyond simple robots.txt blocking, we can help you design an RSL-based policy, align it with your terms, and back it with practical controls (WAF, rate limits, logging) that work in production.

Contact UsFeel free to reach out for scraping consultations and quotes
Get in Touch

Summary

  • robots.txt is standardized via REP, but it provides crawler instructions—not hard access control
  • In the generative AI era, more teams need “allow + conditions (licensing),” not just “block”
  • RSL 1.0 adds a License directive to robots.txt and frames license retrieval/compliance as part of crawling
  • Real-world effectiveness depends on crawler adoption plus operational enforcement (WAF, rate limiting, etc.)

Further reading

About the Author

Ibuki Yamamoto
Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+
Annual Data Collection
24/7
Uptime
High Quality
Data Quality