AutomationLegal & EthicsScraping

CAP for RSL: Implementing License Tokens for Crawlers

Learn how RSL CAP enforces web crawler licensing with Authorization: License tokens, OLP /token issuance, /introspect validation, and 401/402/403 handling.

Ibuki Yamamoto
Ibuki Yamamoto
March 27, 2026 4min read

CAP for RSL: Implementing License Tokens for Crawlers

RSL (Really Simple Licensing) includes CAP (Crawler Authorization Protocol) to enforce licensing with real HTTP authentication—not the “please follow robots.txt” honor system. In practice, CAP makes crawlers present a License Token via an HTTP Authorization header. This guide walks through the shortest reliable implementation path: license discovery, token acquisition via OLP (/token), token presentation (Authorization: License), and error handling via WWW-Authenticate.

What You’ll Learn
  • The CAP-required HTTP headers and status codes
  • How to obtain and validate License Tokens with OLP (/token and optionally /introspect)
  • Implementation patterns for both crawlers and resource servers

How CAP works (the big picture)

CAP is the protocol that tells crawlers: “If this URL is RSL-licensed, you must present a token.” Technically, CAP uses the standard HTTP authentication framework by requiring a License authentication scheme in the Authorization header. The core request format is:

Authorization: License <license_token>

If the token is valid and the request is permitted, the server returns 200 OK with the content.

The key takeaways

  • Crawlers request protected content using Authorization: License <token>.
  • If the token is missing/invalid, the site returns 401 or 402, and explains what happened via WWW-Authenticate: License ....
  • Crawlers obtain tokens from the license server via OLP /token, and (optionally) validate them with /introspect.

Prerequisite: Discovering RSL licensing

You can’t implement CAP unless a crawler can deterministically discover which resources are RSL-licensed and where the license terms live. A common pattern is to publish a license document URL in robots.txt via a License: directive. The crawler fetches robots.txt, finds the license URL, and downloads the RSL license XML.

RSL is designed to integrate with existing discovery mechanisms (for example: robots.txt, HTTP headers, and HTML link tags), so you’re not locked into a single channel.

Minimum robots.txt example

User-Agent: *
Allow: /

License: https://example.com/license.xml

If you only need to declare AI-related preferences (for example, whether training is allowed), you can also express policy using IETF “AI Preferences.” RSL’s role is broader: it aims to make the full licensing workflow machine-readable—including how to obtain permission and how payment/compensation rules apply.


Crawler-side implementation

1) Locate the license URL

A typical crawler flow looks like this:

  1. Fetch robots.txt, then parse the License: directive to find the RSL license XML URL.
  2. Fetch the RSL license XML, then determine which URLs (or URL ranges) are covered and—if specified—the server attribute (the License Server base URL).
  3. Based on your intended usage (and any payment requirements), request a License Token from the license server using OLP.

2) Obtain a License Token via OLP

RSL uses OLP (Open License Protocol) to issue License Tokens. From an implementation perspective, this behaves like an OAuth 2.0 token endpoint: you send an application/x-www-form-urlencoded request containing:

  • resource: the URL you want to access/license
  • license: the requested <license> XML element (URL-encoded)

The response returns an OAuth-style token, but with token_type set to license.

Common implementation pitfall

Don’t conflate two different authentication layers:

  • CAP: how the crawler authenticates to the site (resource server) using Authorization: License ...
  • OLP: how the crawler authenticates to the license server (typically via OAuth client authentication such as client_id/client_secret)

Example: requesting a token

# Example: request a license from /token (values are illustrative)
LICENSE_XML='<license xmlns="https://rslstandard.org/rsl"><permits type="usage">all</permits></license>'
ENC_LICENSE=$(python -c 'import urllib.parse,sys; print(urllib.parse.quote(sys.argv[1]))' "$LICENSE_XML")

curl -sS -X POST "https://api.example.com/token" \
  -u "${CLIENT_ID}:${CLIENT_SECRET}" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  --data "grant_type=client_credentials" \
  --data "resource=https%3A%2F%2Fexample.com%2F" \
  --data "license=${ENC_LICENSE}" | jq

3) Present the token with CAP

Once you have a License Token, include it on requests to the target site using the Authorization header:

GET /data HTTP/1.1
Host: example.com
User-Agent: YourCrawler/1.0
Authorization: License <license_token>

On success, CAP requires the server to return the content with 200 OK and include a Link header referencing the governing license.

4) Retry logic on failures

Under CAP, missing or invalid tokens produce 401 Unauthorized or 402 Payment Required, with details in WWW-Authenticate: License .... A robust crawler implementation reads those hints, re-discovers the license if needed, obtains/refreshes a token, and retries the request.

Site-side implementation

Decision points

The resource server (your site) has a straightforward set of responsibilities:

  • Decide whether the requested URL is protected (i.e., falls under your published RSL scope).
  • Check whether the Authorization header uses the License scheme.
  • Validate the token and confirm that the request is permitted for the given resource (optionally by calling OLP /introspect).
  • Return 200/401/402/403 and include CAP-required headers.

Status code mapping

Situation Recommended status Required / recommended headers
No token presented 401 WWW-Authenticate: License + license reference (Link header or body)
Payment required 402 WWW-Authenticate: License + license reference (Link header or body)
Token is valid, but not permitted 403 WWW-Authenticate: License error=”insufficient_scope” + Link
Permitted 200 Link: <…license.xml>; rel=”license”; type=”application/rsl+xml”

Example: 401/402 responses

HTTP/1.1 401 Unauthorized
WWW-Authenticate: License error="invalid_request", error_description="Access to this resource requires a license"
Link: <https://example.com/license.xml>; rel="license"; type="application/rsl+xml"
Content-Type: text/plain; charset=UTF-8

In the HTTP authentication framework (RFC 7235), the server uses 401 plus WWW-Authenticate to tell the client what authentication scheme to use on retry. The client then resends the request with credentials in Authorization. CAP intentionally follows this model.

Token validation via /introspect

The most practical approach on the site side is often: forward the received License Token to the license server’s /introspect endpoint, then decide based on fields like active (token validity) and permitted (whether this resource is allowed).

curl -sS -X POST "https://api.example.com/introspect" \
  -u "${CLIENT_ID}:${CLIENT_SECRET}" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  --data "token=${LICENSE_TOKEN}" \
  --data "resource=https%3A%2F%2Fexample.com%2Fdata" | jq


Implementation gotchas

Caching and token refresh

Tokens include expires_in. A non-expiring license may use 0. Operationally, it’s safest to: (1) track expiration on the crawler side, and (2) assume tokens may still be revoked server-side—so call /introspect when it matters (or on suspicious/edge cases).

Always return a license reference

CAP requires you to return a reference to the governing license even on error responses (for example via a Link header). This is crucial for interoperability: it’s how a crawler can reliably discover the terms and obtain/renew the correct token.

Be consistent about what 403 means

In CAP, 403 specifically means “the token is valid, but the request isn’t permitted” (i.e., insufficient_scope). If you return 403 for everything you want to block, crawlers can’t tell whether they should re-license, upgrade usage rights, or fix a token—so your intended licensing and payment flow breaks down.

Minimal implementation checklist

Here’s the smallest practical setup for implementing CAP-based License Tokens.

  • Publisher (site/resource server): Publish the license URL in robots.txt. For protected URLs, require Authorization: License. If missing/invalid, return 401 (or 402 when payment is required) plus WWW-Authenticate and Link rel="license".
  • Crawler: Follow robots.txtlicense.xml, then acquire a token via OLP /token if needed. Send requests with Authorization: License <token>. On 401/402, read WWW-Authenticate, refresh/reacquire the token, and retry.

The fastest path to a robust implementation

Start with “site validates tokens by delegating to /introspect” and “crawler implements token acquisition + retry.” This approach is also more resilient if the draft spec evolves—you update the license server integration rather than reworking every edge case in your resource server.

Need CAP/RSL working end-to-end?

If youre trying to enforce RSL terms in production, the hard part is usually the operational glue: token lifecycles, retries, and consistent 401/402/403 behavior. We can help you design, implement, and validate CAP/OLP flows for both crawlers and sites.

Contact UsFeel free to reach out for scraping consultations and quotes
Get in Touch

Wrap-up

RSLs CAP upgrades licensing enforcement from “robots.txt requests” to a real contract entry point using standard HTTP authentication (RFC 7235). If you implement (1) license discovery, (2) token acquisition via OLP, (3) token presentation with Authorization: License, and (4) error handling via WWW-Authenticate, youll cover the essential CAP behavior without getting lost in edge cases.

References

About the Author

Ibuki Yamamoto
Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+
Annual Data Collection
24/7
Uptime
High Quality
Data Quality