Scraping

The Reality of the 2026 AI Bot Surge: Exploring Web Media Strategies Based on TollBit and Akamai Metrics

AI bot traffic is rising fast. Use TollBit and Akamai signals to redesign measurement, rate limits, APIs, and contracts for sustainable web scraping control.

Ibuki Yamamoto
Ibuki Yamamoto
February 6, 2026 4min read
What You’ll Learn
  • What the TollBit and Akamai metrics actually say about the AI-bot traffic surge
  • Where web publishers and media teams should rethink their anti-scraping strategy
  • Practical measurement, control, and contracting steps that hold up in 2026

Conclusion

Your 2026 data-collection strategy cant be reduced to allow vs. block. The workable approach is to redesign operations around who collects what, how often, and through which path and then run it in the order of measure classify control agree (contract).

TollBit reports that AI-originated visits grew 41 in a short period (from roughly 1 in 200 to 1 in 50 visits) and that bots ignoring robots.txt reached about 13%. Akamai has also warned that AI-driven automation traffic is accelerating and can undermine both business models and analytics. In this environment, relying on robots.txt alone is not a strategy. You need a combined playbook that includes API distribution, rate controls, bot monetization and contracts, and logging/analytics design.

What the Metrics Say About the Surge

TollBits Observations

In its State of the Bots series, TollBit highlights the following trend lines across 2025:

  • The relative share of AI visits climbs quickly (early 2025: 1 AI visit per 200 later: 1 per 50)
  • The proportion of AI bots bypassing robots.txt rises to around the 13% range
  • There are periods where RAG-style retrieval (fetching content to answer queries) exceeds bulk collection for model training

In other words, its not just grab everything once for training. The structural shift is toward continuous, distributed retrieval for daily answer generation. That hits hardest where freshness mattersprices and inventory, news, company profiles, and FAQ/help content.

(Summary) TollBit reports that the relative share of AI visits increased from 1/200 to 1/50, and that robots.txt-bypassing behavior rose to around 13%.


Akamais Observations

In a press release dated November 4, 2025, Akamai argues that AI bots are meaningfully increasing automation traffic and can distort the foundations of web operations: business models, analytics, and performance. A key concern is that scraping of public content can take the value without sending users back, putting ad- and subscription-based revenue at risk.

Akamais position is that AI-bot activity is rising fast enough to impact operations, measurement, and monetization. Publishers and web teams need metrics that dont automatically treat more traffic as more growth.

How to Use Supporting Data Points

As an additional signal, Cloudflare has publicly discussed blocking massive volumes of AI-bot requests. That kind of statement doesnt map 1:1 to your site (definitions and customer mix differ), but its a useful indicator that CDN and security providers also see this as non-trivial at scale.

Side-by-Side: What Each Metric Represents

Heres the same set of points organized by what is being measured. Note that each company uses different definitions and observation populations, so treat the numbers as directional rather than directly comparable.

Source What it measures What it suggests Practical takeaway
TollBit AI scraping/visit behavior across a publisher network Rising AI visit share, more robots.txt bypass, shift toward RAG-style retrieval Traffic classification, bot monetization/contracts, and reference retrieval API design
Akamai Automation/fraud/bot trends across a broad customer base AI bot growth can break assumptions behind analytics and web business models Design bot controls not only as security, but also as revenue + analytics integrity

Why This Becomes a 2026 Design Problem

The reason AI-bot growth forces a 2026 redesign comes down to three shifts:

  1. The goal of collection is changing: If RAG-style retrieval grows faster than bulk training crawls, access becomes ongoing rather than episodic.
  2. How it appears in logs is changing: Browser impersonation and human-like behaviors make simple User-Agent filtering less effective.
  3. Value transfer accelerates: If users get answers in summaries, click-through to the original source drops, pressuring ads and paywalls.

Just block it isnt a plan. Over-blocking can degrade UX for real users and can also catch legitimate crawlers (search engines, partners, accessibility tools). Start with measurement and classification, then tighten controls intentionally.

What Web Media Teams Should Revisit

Rebuild Measurement First

The first step is to measure with AI bots in mind. At minimum, update your access logs and analytics pipeline to capture:

  • A heuristic score using multiple signals (UA, ASN, IP ranges, JA3, etc.)
  • Whether the request path is HTML retrieval or API retrieval
  • Endpoint-level cost hotspots (search, product pages, images, APIs)
  • Observed robots.txt compliance (measure actual requests to disallowed paths)

Use Graduated Controls (Not All-or-Nothing)

Operationally, staged controls tend to be more stable than full blocking:

  • Rate limiting (per IP, session, token/API key)
  • Isolating expensive paths (search results, deep pagination)
  • Clear monetization/contract routes (bot-facing terms, API key issuance)
  • Extra challenges only for suspicious traffic (JS challenges, proof-of-work, etc.)

The Limits of robots.txt

robots.txt works for crawlers that choose to complyand fails for crawlers that dont. TollBits reporting explicitly highlights rising robots.txt bypass behavior.

robots.txt is a policy signal, not an enforcement mechanism. If you need enforcement, shift the design toward application- or edge-level controls.

At the same time, robots.txt is still useful for coordinating with search engines and other legitimate crawlers. In practice, the right move is usually not to remove it, but to run it in parallel with real controls.

Quick Check: The Official Spec

To avoid misusing robots.txt, it helps to anchor on the spec. The Robots Exclusion Protocol (REP) defines robots.txt as a mechanism that tells crawlers what is allowed or disallowedwith the expectation that the crawler implementation fetches, interprets, and chooses to follow it. Its not a technical access-control system.

(Summary) The Robots Exclusion Protocol defines how crawlers retrieve and interpret robots.txt to decide whether they should access specific paths.

A 2026 Blueprint

Think in Four Layers

In practice, its easier to design and operate controls when you separate them into four layers:

  1. Policy: What you allow vs. prohibit (by use case)
  2. Distribution: HTML, API, feeds, or licensed partnerships
  3. Control: Rate limiting, auth, WAF/bot management, monetization
  4. Measurement: AI vs. non-AI classification, cost, impact on revenue and referrals

Standardize Decisions with a Checklist

Make block vs. allow vs. charge a checklist decisionnot a gut call:

  • Is the data high-frequency and time-sensitive? (If yes, APIs often win.)
  • Does retrieval primarily extract value, or does it drive referrals?
  • Where is load concentrated (search, listing pages, images, API endpoints)?
  • Can a contract realistically solve it (reachable party, identifiable entity)?

FAQ

Is AI bot scraping illegal?

The legal analysis varies by jurisdiction, contract terms, collection methods, and the type of content. Operationally, start by reviewing risk across terms of service, authentication bypass, excessive load, personal data, and copyright/database rights. For high-risk cases, confirm with counsel.

Can you reliably tell AI bots from humans?

Its getting harder with single signals (like User-Agent) alone. A practical approach is to score traffic using multiple signals (behavior patterns, headers, ASN, cookies, JS execution) and run an operations loop that assumes false positives will happen.

Whats the best ROI mitigation?

In many environments, the best cost-performance comes from combining (1) rate limits on high-load paths, (2) caching optimization, and (3) a minimal, well-scoped API. Roll out more advanced controls in phases, starting where the financial impact is highest.

About the Author

Ibuki Yamamoto
Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+
Annual Data Collection
24/7
Uptime
High Quality
Data Quality