robots.txt used to be a simple yes/no switch for crawlers. In the generative AI era, that framing breaks down. The real question isnât whether a bot can fetch your pagesâitâs how the collected data will be used after the fact. More site owners now want options like âallow under specific conditionsâ or âallow only with compensation,â rather than a binary allow/deny model.
Thatâs the gap RSL 1.0 (Really Simple Licensing) is designed to fill. RSL builds on robots.txt as the discovery entry point, but adds a machine-readable way to publish usage terms and require license retrieval before a crawler uses your content.
- What robots.txt can (and canât) do in practice
- How RSL 1.0 adds âlicense termsâ as a first-class concept
- A practical rollout planâand the operational steps that make it effective
Conclusion
Crawler control is shifting from âblock with Disallowâ to âpublish terms and allow only with agreement (license acquisition)â. robots.txt still matters as the front door, but standards like RSL 1.0 are emerging as a realistic option for AI crawling because they can express âallowed vs. prohibited uses,â ârequired compensation,â and âconditions you must follow.â
At a high level, RSL 1.0 adds a License directive to robots.txt so crawlers can discover the URL of a license document and are expected to retrieve and follow it.
robots.txt basics
robots.txt is standardized as the Robots Exclusion Protocol (REP). Itâs a convention that lets crawlers read and follow a site ownerâs published instructions. In RFC 9309, robots.txt is defined as a mechanism for providing crawler-facing directivesânot as authentication/authorization, and not as access control with guaranteed enforcement.
Key points to remember
- robots.txt is built from rules using User-agent plus Allow/Disallow
- Scope is limited to the same host, protocol, and port
- Most reputable crawlers comply, but robots.txt doesnât technically stop a non-compliant scraper
In day-to-day operations, robots.txt is often treated as search-engine guidance. Google also documents placement, scope, and how Googlebot fetches and interprets robots.txt to help prevent accidental de-indexing.
Where âblockingâ falls short
Robots.txt-first governance runs into hard limits in the following scenarios.
It canât express AI usage conditions
Common requirements today look like: âindexing is fine, but AI training or AI summaries are not.â REP, however, is primarily an Allow/Disallow model, which makes it difficult to represent use-case-specific conditions in a machine-readable way.
It canât stop bad actors
robots.txt is a signal of intentânot an HTTP-level gate. If you need to mitigate abusive scraping, you still need other layers such as a WAF, rate limiting, signed delivery, or authentication.
Itâs a weak starting point for negotiation
What many organizations actually want is: âYou may use this, but only under these terms.â Traditional robots.txt provides no standard way to publish conditions (payment, permitted uses, retention limits, redistribution rules, attribution requirements, and so on) in a way that automation can reliably consume. That pushes teams back into one-off emails, manual reviews, and legal processes.
Note
robots.txt is not a contract by itself. That said, publishing clear policiesâand having a counterparty access your site after referencing themâcan become relevant later in legal analysis (contract formation, tort claims, unauthorized access allegations, etc.). Design your governance with legal/compliance input rather than treating it as âjust an engineering setting.â
What is RSL 1.0?
RSL 1.0 (Really Simple Licensing) extends robots.txt so you can publish machine-readable usage terms and licensing requirements, and so crawlers can discover, retrieve, and comply with those terms. In public coverage, RSL is framed as an attempt to standardize AI crawling termsâcovering allowed usage and compensation (payments or other consideration)âwhile keeping robots.txt as the familiar entry point.
Add a License directive to robots.txt
RSL guidance requires that an RSL-enabled robots.txt includes a License directive pointing to an RSL license document (or feed). In RSL documentation, that directive is described as a global declaration (i.e., not tied to a specific user agent) so crawlers can discover it before they access or process content.
# robots.txt (example)
License: https://example.com/license.xml
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /private/From âgo awayâ to âhere are the termsâ
The key shift is moving from âdonât crawlâ to âyou may crawl, but only under these terms.â That gives publishers, e-commerce sites, and data providers a clearer foundation for designing conditions by use case (AI training, AI answers/summaries, search indexing, partners, and more) without relying entirely on bespoke negotiation.
How to roll it out
This is where implementation becomes real. Adopting RSL 1.0 is not just a robots.txt tweakâyou need an operational plan that connects policy, legal review, and enforcement layers.
1. Break down your objectives
Start by making the policy decisions explicit.
- Do you want to allow search indexing?
- Do you want to allow AI training, summarization, or answer generation?
- If you allow some AI uses, what compensation and reuse limits apply?
- Does this align with your internal policies (terms of service, privacy, copyright, data policy)?
2. Prepare a license document
RSL assumes a license document (or feed) that robots.txt can point to. That document is where you define whatâs permitted, whatâs prohibited, and what conditions apply (payment, attribution, retention, redistribution, etc.). In practice, you should treat legal review as mandatory.
3. Add License to robots.txt
Following RSL guidance, add the License directive. Use an absolute URL; the license can be hosted on a different host than the robots.txt origin.
RSL can formalize the âcontract front door,â but it doesnât magically block adversaries. In production, you typically pair it with controls like: Operational tips that actually make this work If youâre wondering which one to use, the cleanest answer is: they serve different roles. If you add RSL, align it with your Terms of Service, copyright policy, and API terms. ârobots.txt allows it, but the terms forbid itâ is the kind of inconsistency that can weaken your position in negotiations or disputes. If your goal is âlimit AI usage while preserving organic search,â you need careful rule designâoften by separating crawler identities (User-Agent) where appropriate. Validate your approach against Googleâs published interpretation to avoid unintended blocks. REP scope is limited to a given host/protocol/port. If you serve content across multiple hosts (subdomains, image CDNs, static asset hosts, APIs), youâll likely need per-host policies and configuration. Note RSL is still a relatively new standard. If a crawler doesnât support RSL, you may not get the âlicense-based permissioningâ effect you expect. Plan for a mixed world: confirm counterpartiesâ support and backstop with technical controls (WAF, rate limits, auth) for non-compliant clients.4. Add enforcement in other layers
robots.txt vs. RSL 1.0
Dimension
robots.txt (REP)
RSL 1.0
Primary purpose
Direct crawlers to allow or avoid crawling paths
Publish usage terms and licensing requirements
Expressiveness
Mainly Allow/Disallow
Designed for conditions and license acquisition
Enforcement
Advisory by design
Contract framework (effectiveness depends on adoption + operations)
Intended counterparties
Mainly search crawlers
Includes AI/data-collection crawlers
Practical considerations
Keep it consistent with your terms
Donât tank search traffic by accident
Be explicit about scope
Want AI crawler terms that hold up?
If youre moving beyond simple robots.txt blocking, we can help you design an RSL-based policy, align it with your terms, and back it with practical controls (WAF, rate limits, logging) that work in production.
Summary
- robots.txt is standardized via REP, but it provides crawler instructionsânot hard access control
- In the generative AI era, more teams need âallow + conditions (licensing),â not just âblockâ
- RSL 1.0 adds a License directive to robots.txt and frames license retrieval/compliance as part of crawling
- Real-world effectiveness depends on crawler adoption plus operational enforcement (WAF, rate limiting, etc.)