CAP for RSL: Implementing License Tokens for Crawlers
RSL (Really Simple Licensing) includes CAP (Crawler Authorization Protocol) to enforce licensing with real HTTP authentication—not the “please follow robots.txt” honor system. In practice, CAP makes crawlers present a License Token via an HTTP Authorization header. This guide walks through the shortest reliable implementation path: license discovery, token acquisition via OLP (/token), token presentation (Authorization: License), and error handling via WWW-Authenticate.
- The CAP-required HTTP headers and status codes
- How to obtain and validate License Tokens with OLP (
/tokenand optionally/introspect) - Implementation patterns for both crawlers and resource servers
How CAP works (the big picture)
CAP is the protocol that tells crawlers: “If this URL is RSL-licensed, you must present a token.” Technically, CAP uses the standard HTTP authentication framework by requiring a License authentication scheme in the Authorization header. The core request format is:
Authorization: License <license_token>
If the token is valid and the request is permitted, the server returns 200 OK with the content.
The key takeaways
- Crawlers request protected content using
Authorization: License <token>. - If the token is missing/invalid, the site returns
401or402, and explains what happened viaWWW-Authenticate: License .... - Crawlers obtain tokens from the license server via OLP
/token, and (optionally) validate them with/introspect.
Prerequisite: Discovering RSL licensing
You can’t implement CAP unless a crawler can deterministically discover which resources are RSL-licensed and where the license terms live. A common pattern is to publish a license document URL in robots.txt via a License: directive. The crawler fetches robots.txt, finds the license URL, and downloads the RSL license XML.
RSL is designed to integrate with existing discovery mechanisms (for example: robots.txt, HTTP headers, and HTML link tags), so you’re not locked into a single channel.
Minimum robots.txt example
User-Agent: *
Allow: /
License: https://example.com/license.xmlIf you only need to declare AI-related preferences (for example, whether training is allowed), you can also express policy using IETF “AI Preferences.” RSL’s role is broader: it aims to make the full licensing workflow machine-readable—including how to obtain permission and how payment/compensation rules apply.
Crawler-side implementation
1) Locate the license URL
A typical crawler flow looks like this:
- Fetch
robots.txt, then parse theLicense:directive to find the RSL license XML URL. - Fetch the RSL license XML, then determine which URLs (or URL ranges) are covered and—if specified—the
serverattribute (the License Server base URL). - Based on your intended usage (and any payment requirements), request a License Token from the license server using OLP.
2) Obtain a License Token via OLP
RSL uses OLP (Open License Protocol) to issue License Tokens. From an implementation perspective, this behaves like an OAuth 2.0 token endpoint: you send an application/x-www-form-urlencoded request containing:
resource: the URL you want to access/licenselicense: the requested<license>XML element (URL-encoded)
The response returns an OAuth-style token, but with token_type set to license.
Common implementation pitfall
Don’t conflate two different authentication layers:
- CAP: how the crawler authenticates to the site (resource server) using
Authorization: License ... - OLP: how the crawler authenticates to the license server (typically via OAuth client authentication such as
client_id/client_secret)
Example: requesting a token
# Example: request a license from /token (values are illustrative)
LICENSE_XML='<license xmlns="https://rslstandard.org/rsl"><permits type="usage">all</permits></license>'
ENC_LICENSE=$(python -c 'import urllib.parse,sys; print(urllib.parse.quote(sys.argv[1]))' "$LICENSE_XML")
curl -sS -X POST "https://api.example.com/token" \
-u "${CLIENT_ID}:${CLIENT_SECRET}" \
-H "Content-Type: application/x-www-form-urlencoded" \
--data "grant_type=client_credentials" \
--data "resource=https%3A%2F%2Fexample.com%2F" \
--data "license=${ENC_LICENSE}" | jq
3) Present the token with CAP
Once you have a License Token, include it on requests to the target site using the Authorization header:
GET /data HTTP/1.1
Host: example.com
User-Agent: YourCrawler/1.0
Authorization: License <license_token>On success, CAP requires the server to return the content with 200 OK and include a Link header referencing the governing license.
4) Retry logic on failures
Under CAP, missing or invalid tokens produce 401 Unauthorized or 402 Payment Required, with details in WWW-Authenticate: License .... A robust crawler implementation reads those hints, re-discovers the license if needed, obtains/refreshes a token, and retries the request.
Site-side implementation
Decision points
The resource server (your site) has a straightforward set of responsibilities:
- Decide whether the requested URL is protected (i.e., falls under your published RSL scope).
- Check whether the
Authorizationheader uses theLicensescheme. - Validate the token and confirm that the request is permitted for the given
resource(optionally by calling OLP/introspect). - Return
200/401/402/403and include CAP-required headers.
Status code mapping
| Situation | Recommended status | Required / recommended headers |
|---|---|---|
| No token presented | 401 | WWW-Authenticate: License + license reference (Link header or body) |
| Payment required | 402 | WWW-Authenticate: License + license reference (Link header or body) |
| Token is valid, but not permitted | 403 | WWW-Authenticate: License error=”insufficient_scope” + Link |
| Permitted | 200 | Link: <…license.xml>; rel=”license”; type=”application/rsl+xml” |
Example: 401/402 responses
HTTP/1.1 401 Unauthorized
WWW-Authenticate: License error="invalid_request", error_description="Access to this resource requires a license"
Link: <https://example.com/license.xml>; rel="license"; type="application/rsl+xml"
Content-Type: text/plain; charset=UTF-8In the HTTP authentication framework (RFC 7235), the server uses 401 plus WWW-Authenticate to tell the client what authentication scheme to use on retry. The client then resends the request with credentials in Authorization. CAP intentionally follows this model.
Token validation via /introspect
The most practical approach on the site side is often: forward the received License Token to the license server’s /introspect endpoint, then decide based on fields like active (token validity) and permitted (whether this resource is allowed).
curl -sS -X POST "https://api.example.com/introspect" \
-u "${CLIENT_ID}:${CLIENT_SECRET}" \
-H "Content-Type: application/x-www-form-urlencoded" \
--data "token=${LICENSE_TOKEN}" \
--data "resource=https%3A%2F%2Fexample.com%2Fdata" | jq
Implementation gotchas
Caching and token refresh
Tokens include expires_in. A non-expiring license may use 0. Operationally, it’s safest to: (1) track expiration on the crawler side, and (2) assume tokens may still be revoked server-side—so call /introspect when it matters (or on suspicious/edge cases).
Always return a license reference
CAP requires you to return a reference to the governing license even on error responses (for example via a Link header). This is crucial for interoperability: it’s how a crawler can reliably discover the terms and obtain/renew the correct token.
Be consistent about what 403 means
In CAP, 403 specifically means “the token is valid, but the request isn’t permitted” (i.e., insufficient_scope). If you return 403 for everything you want to block, crawlers can’t tell whether they should re-license, upgrade usage rights, or fix a token—so your intended licensing and payment flow breaks down.
Minimal implementation checklist
Here’s the smallest practical setup for implementing CAP-based License Tokens.
- Publisher (site/resource server): Publish the license URL in
robots.txt. For protected URLs, requireAuthorization: License. If missing/invalid, return401(or402when payment is required) plusWWW-AuthenticateandLink rel="license". - Crawler: Follow
robots.txt→license.xml, then acquire a token via OLP/tokenif needed. Send requests withAuthorization: License <token>. On401/402, readWWW-Authenticate, refresh/reacquire the token, and retry.
The fastest path to a robust implementation
Start with “site validates tokens by delegating to /introspect” and “crawler implements token acquisition + retry.” This approach is also more resilient if the draft spec evolves—you update the license server integration rather than reworking every edge case in your resource server.
Need CAP/RSL working end-to-end?
If youre trying to enforce RSL terms in production, the hard part is usually the operational glue: token lifecycles, retries, and consistent 401/402/403 behavior. We can help you design, implement, and validate CAP/OLP flows for both crawlers and sites.
Wrap-up
RSLs CAP upgrades licensing enforcement from “robots.txt requests” to a real contract entry point using standard HTTP authentication (RFC 7235). If you implement (1) license discovery, (2) token acquisition via OLP, (3) token presentation with Authorization: License, and (4) error handling via WWW-Authenticate, youll cover the essential CAP behavior without getting lost in edge cases.