Web Scraping Explained: How It Works, When It’s Legal, and Real Use Cases

Web scraping is one of those terms everyone has heard—but many people still ask the same practical questions: “What is it, really?”, “Is it illegal?”, and “How is it different from crawling or using an API?” This guide answers those questions in a business-ready way, for both beginners and teams evaluating scraping for production use.

What You’ll Learn

What web scraping is—and how it fundamentally differs from crawling and APIs
The most accurate answer to “Is scraping illegal?”—four legal risk categories and the key statutory hooks (Japan-focused)
A real-world case to know: the Okazaki City Central Library incident (Japan)
Common use cases and how to decide between in-house builds, tools, or outsourcing
How to choose an implementation approach (Python, no-code tools, and more)

What is web scraping?

“Scraping” comes from the verb scrape—to rub or strip something off a surface. In tech, web scraping means automatically collecting information from websites and transforming it into a format you can use (spreadsheets, databases, JSON, CSV, and so on).

Think of tasks that would take hours in a browser—like checking product prices every day across multiple e-commerce sites, or filtering job listings into a spreadsheet. A scraper turns that manual work into an automated workflow that runs in seconds.

Web scraping in 30 seconds

What it does: Automatically extracts the specific data you need from web pages
How it works: Fetch the page → parse HTML → extract and normalize data
Typical uses: Price monitoring, competitive analysis, market research, data collection for analytics/ML
Is it illegal?: The act of scraping isn’t inherently illegal, but four legal risk areas matter in practice

Scraping is often discussed as if it’s one technique, but it’s helpful to separate two categories. In most modern contexts, “scraping” usually means the second one: web scraping.

Screen scraping: Captures what is rendered on-screen (often used for legacy system migration or UI-driven automation)
Web scraping: Fetches HTML over HTTP and parses it to extract specific elements (price, product name, URLs, etc.)

Web scraping vs. crawling vs. APIs

The most common source of confusion is how scraping relates to crawling and APIs. They solve different problems and play different roles, so getting the mental model right early makes everything else easier.

Crawling vs. scraping: “discover” vs. “extract”

Crawling is about discovering and collecting pages at scale—following links, expanding coverage, and building a list of URLs or a corpus. The canonical example is a search engine crawler like Googlebot.

Scraping is about extracting specific data from pages you already know you want.

Aspect	Crawling	Scraping
Primary goal	Discover and collect pages across the web	Extract the data you need from a page
Scope	Wide and shallow (coverage-focused)	Narrow and deep (field-focused)
Main output	URL lists, full pages	Structured fields (price, name, stock, etc.)
Typical examples	Search engine crawlers, sitemap crawlers	Price trackers, competitive intelligence tools

In production, you often combine both: crawl to collect URLs, then scrape each URL to extract structured data.

APIs vs. scraping: “official interface” vs. “independent extraction”

An API is the service provider’s official interface—explicitly designed for programmatic access (often returning JSON). Think of APIs like X (formerly Twitter) API or Google Maps API.

Scraping, by contrast, extracts what’s visible in HTML, regardless of whether the site intended to provide a machine-friendly interface.

Aspect	Using an API	Scraping
Provider posture	Explicitly allowed/published by the service	Collector extracts data independently
Accessible data	Only what the provider exposes	Anything rendered in HTML (in principle)
Format	Structured (JSON/XML)	Requires parsing HTML per page
Stability	Often versioned; changes may be announced	Breaks when the site’s DOM/layout changes
Legal/contract safety	Usually safe if you follow the terms	Higher risk: terms, copyright, privacy/data protection
Coverage	You can’t access data that isn’t exposed	If it’s on the page, you may be able to extract it

A practical rule of thumb

If an API exists, use the API first. APIs typically win on stability, performance, and legal clarity. Scraping is best viewed as a way to compensate for API limitations—when there’s no API, the API doesn’t include what you need, or the limits/pricing make it impractical.

How web scraping works

Technically, web scraping breaks down into three steps. No matter which library or tool you use, the internal flow is usually a variation of this.

1. Fetch the page (HTTP request)

Send an HTTP request to the target URL and retrieve the response (HTML and related resources). In Python, people commonly use requests. In Node.js, fetch or axios is typical. For simple pages, this may be enough.

2. Parse the HTML

Convert the HTML string into a tree structure that’s easy to query. Popular options include Python’s BeautifulSoup and Node.js’s Cheerio. Once parsed, you can target elements via CSS selectors or XPath.

3. Extract and normalize the data

Pull the fields you care about (product name, price, URL, etc.), then normalize them into your desired format—CSV, JSON, a database schema, and so on. This step usually includes the “unsexy” work: removing tags, converting encodings, and normalizing text like “$1,980” into 1980.

Static vs. dynamic sites changes the difficulty

Static (server-rendered) sites: The data is already in the HTML. requests + BeautifulSoup can be fast and reliable.
Dynamic (JavaScript-rendered) sites / SPAs: Data appears after the browser runs JavaScript. You often need a headless browser like Selenium or Playwright, which increases CPU/memory cost significantly.

Benefits of web scraping

The core value is simple: scraping makes continuous, large-scale data collection feasible at a cost and cadence that manual work can’t match.

Automates large-scale collection: Collect thousands to millions of pages without manual effort—hours or days of copy/paste becomes a scheduled job.
Accesses data with no public API: Many sites don’t provide an API. Scraping can be the only viable path when there’s no official channel.
Supports near-real-time monitoring: Run hourly, daily, or on any schedule to track pricing and inventory changes continuously.
Lets you design the output schema: Shape the data for your analytics stack, BI tooling, or ML pipeline instead of being constrained by a vendor’s format.
Enables ongoing competitive and market intelligence: With consistent snapshots, you can detect trends that aren’t visible through one-off manual checks.

Downsides and operational pitfalls

Scraping gets hard after your first “it works.” Most of the real cost shows up in operations: maintenance, reliability, and compliance.

It breaks when the site changes: A DOM change, class rename, or layout refactor can silently kill extraction. Plan for ongoing maintenance—forever.
Rate limits and blocking are real: High-frequency access triggers IP blocks, throttling, CAPTCHAs, and bot detection. You’ll need sane concurrency, retries, and sometimes proxies.
Dynamic sites raise the engineering bar: Headless browsers are slower and heavier; it’s not unusual to see an order-of-magnitude jump in compute usage.
Legal/compliance adds ongoing work: You must keep re-checking terms, copyright considerations, and privacy obligations. In many companies, that means a repeatable legal review process.
Anti-bot tech keeps evolving: Solutions like Cloudflare Bot Management, reCAPTCHA, and DataDome increase the cost of staying stable year over year.

Important: Operational cost usually exceeds build cost. If you start with a “one-and-done script” mindset, it often collapses within six months. Build monitoring, alerting, and change detection from day one.

Is web scraping illegal? Four legal risk areas (Japan)

Here’s the practical answer: web scraping itself isn’t inherently illegal. Technically, you’re sending HTTP requests and receiving public web content—similar to loading a page in a browser.

But the way you scrape, what you collect, and how you use it can trigger legal exposure. There are real cases where people were arrested or faced civil claims. Below are four common risk categories, framed with Japan-specific context.

① Business obstruction (Japan Penal Code Articles 233 / 234-2)

The most common path to a criminal case is causing disruption through excessive access. If your scraper destabilizes or takes down a target service, you may face allegations under Obstruction of Business provisions (e.g., Japan Penal Code Article 233 or Article 234-2).

Case: Okazaki City Central Library incident (2010, Japan)

In 2010, a man operated a crawler to collect “new arrivals” data from the Okazaki City Central Library catalog system. The library experienced access problems, and he was arrested on suspicion of obstructing business and reportedly detained for 22 days before prosecutors ultimately chose not to indict. Reporting at the time also noted that the crawler’s access pattern was not obviously aggressive (around ~1 request per second), and the system’s fragility played a major role. itmedia.co.jp

The lesson is uncomfortable but important: even “polite crawling” can create real risk if the target system is fragile. Estimate impact, throttle requests, and test carefully before production.

② Personal data / privacy law risk (Japan’s APPI)

If you collect personal data from the web (names, emails, phone numbers, etc.), Japan’s Act on the Protection of Personal Information (APPI) can apply depending on how you store and use it. Sensitive categories (“special care-required personal information”) are especially risky. For implementation planning, treat “scraping personal data for outreach lists” as a high-risk area that requires legal review.

At collection time: ensure you have a lawful basis and a clear, documented purpose
Sensitive categories: treat as high-risk and avoid collecting without robust justification and safeguards
Sharing with third parties: can trigger additional obligations

For an English reference, see the translated APPI materials published by Japan’s Personal Information Protection Commission. ppc.go.jp

③ Terms of Service violations (civil liability)

If a site’s Terms of Service explicitly prohibit scraping or automated collection, violating those terms can create civil risk (claims for damages, injunctions, account termination, and more). Many major platforms prohibit automated collection in their terms.

In practice, treat terms review as a mandatory pre-flight checklist item, not an afterthought.

④ Copyright risk (reproduction right) and key exceptions

Scraping typically creates copies of web content (HTML, images, text) on your machines—at least transiently. Under Japanese copyright principles, copying can implicate the reproduction right.

Key exception in Japan: Copyright Act Article 30-4 (information analysis)

Japan’s 2018 copyright amendment introduced (and later implemented) a flexible exception commonly described as the “non-enjoyment purpose” rule. In plain English: if your purpose is not to enjoy the expressive content as a human (for example, using it for testing, data analysis, or machine processing), Article 30-4 can allow use without permission—within necessary scope. bunka.go.jp

That said, the exception is not unlimited. If your use unreasonably prejudices the rights holder’s interests—such as effectively copying a paid, analysis-ready database—risk rises quickly. nagashima.com

Should you follow robots.txt?

robots.txt is a configuration file that tells crawlers which paths the site would prefer bots not to access. In Japan, robots.txt is generally not treated as a binding legal control by itself. It’s best understood as a norm and a signal of intent.

Still, for global products and cross-border risk, ignoring robots.txt can look bad. In the US, for example, robots-related restrictions have appeared in litigation narratives (e.g., as evidence of intentional access). Even when it’s not determinative, it can make it harder to argue good faith.

Common web scraping use cases

“So what do teams actually do with scraping?” Here are the most common production use cases.

Price and inventory monitoring (e-commerce/retail)

Track competitor prices daily and adjust your own pricing dynamically. This kind of monitoring underpins pricing strategy, promotion detection, and in-stock/out-of-stock tracking across retailers (including Japan-specific platforms such as Rakuten).

Competitive and market research

Continuously capture competitor product launches, campaigns, and hiring signals. Automation turns days of manual research into daily decision inputs.

Aggregating real estate and job postings

Collect listings from multiple portals and normalize them into a searchable dataset for internal workflows or an aggregation product. (In Japan, examples include portals like SUUMO; globally, job boards like Indeed are commonly discussed in this context.)

Gather brand or product mentions from media sites and social platforms for brand monitoring and sentiment analysis. PR and communications teams often rely on this for early detection.

Training data collection for AI/ML

Collect large-scale text and image data to train models or build domain-specific AI systems. In Japan, Copyright Act Article 30-4 is often referenced in discussions about enabling text-and-data mining for non-enjoyment purposes, but you still need to account for terms of service and privacy obligations. bunka.go.jp

SEO competitive analysis

Capture SERP rankings, competitor page structure, and backlink signals to plan SEO strategy. Many SEO platforms ultimately rely on crawling and extraction pipelines under the hood.

Three ways to implement web scraping

When you introduce scraping internally, the options typically fall into three buckets. The right choice depends on trade-offs between cost, flexibility, and operational load.

① Build it in-house (Python, etc.)

Pros: Fully customizable, no license cost, you own the technical assets
Cons: You own development, ops, and compliance; higher risk of knowledge silos; dynamic sites require advanced skills
Best for: You have engineers and expect long-term operation, or requirements don’t fit off-the-shelf tools

② Use no-code / low-code tools (Octoparse, ParseHub, Apify, etc.)

Pros: Fast to start, minimal coding, templates and UI-driven workflows
Cons: Limited fine-grained control; recurring subscription costs; vendor lock-in risk
Best for: No engineering team, quick validation, or targets with standard page structures

③ Outsource to a specialist vendor

Pros: You can delegate implementation and operations; reduced failure risk via expert know-how; often includes compliance support
Cons: Higher initial and ongoing cost; coordination overhead
Best for: Large scale, ongoing operations, limited internal expertise, or a need to minimize legal/ops risk

Option	Upfront cost	Operational burden	Flexibility	Legal support
In-house build	Low (depends on engineering time)	High (monitoring and fixes are on you)	◎	× (in-house responsibility)
No-code tools	Low to medium	Medium (adjust configs when sites change)	△	△ (limited)
Outsourcing	Medium to high	Low (vendor can run it end-to-end)	○	◎

Popular languages and tools for scraping

The right language and stack depend on what you’re scraping. Here are the most common production choices.

Python (the default choice for many teams)

Python is a common first pick thanks to mature libraries, huge community support, and strong ergonomics for data pipelines.

Library	Main role	Best for
Requests	HTTP requests	Fast retrieval of static pages
BeautifulSoup	HTML/XML parsing	Readable extraction logic; a classic pairing with Requests
Scrapy	Crawling + extraction framework	Large-scale, concurrent, long-running production scrapers
Selenium	Browser automation	JS-rendered pages, login flows, complex interactions
Playwright	Browser automation	Modern, fast, cross-browser automation; strong alternative to Selenium

JavaScript / Node.js

Node.js is a strong option when your organization is front-end heavy or when you’re targeting JS-heavy SPAs.

Puppeteer: Headless Chrome automation library
Playwright: Cross-browser automation library (Node.js version)
Cheerio: Lightweight HTML parser with a jQuery-like API
Crawlee: Production-grade crawling framework published by Apify

No-code tools

These tools let you build scraping workflows by recording browser-like actions and configuring extraction rules. They’re accessible to non-engineers.

Tool	Strength	Watch-outs
Octoparse	Auto-detection features, lots of templates, beginner-friendly UI	High-volume runs typically require paid cloud plans
ParseHub	Handles complex dynamic sites (JS/AJAX) well	Can have a steeper learning curve
Apify	Large actor ecosystem and managed cloud execution	Not purely no-code; you’ll need to understand input schemas
Bright Data	Large proxy network and “web unlocker” tooling for bot detection hurdles	Often priced for enterprise; can be heavy for individual use

A decision flow for adopting scraping

If you’re starting from scratch (or moving from a prototype into production), the order matters. Before you make anything “work,” confirm terms and alternatives.

Step 1: Check Terms of Service and robots.txt

Review the site’s Terms for keywords like “scraping,” “automated access,” “robots,” or “crawler.” You can read robots.txt at https://<domain>/robots.txt. As a baseline, avoid targets that explicitly forbid automated collection.

Step 2: Look for an API or alternatives

If the service provides an API, evaluate it first. Sometimes commercial datasets can replace scraping (e.g., market research providers or data marketplaces). Always validate: API if sufficient; dataset if sufficient; scrape only for what’s missing.

Step 3: Estimate frequency, volume, and target impact

Model request volume per day, number of pages, and expected load. As the Okazaki incident illustrates, “reasonable” access rates can still cause trouble if the target is fragile. Load-test carefully before production.

Step 4: Choose in-house vs. tool vs. outsourcing

Engineers in-house + long-term operation: In-house builds can make sense
No engineers + standard target sites + fast validation: No-code tools
Large scale + ongoing operation + minimize legal/ops risk: Outsource to specialists

Step 5: Build monitoring and maintenance first

The safest mindset is “monitoring-first scraping”. At minimum, implement these from day one:

Success-rate monitoring: Track success/failure by URL and by site
Extraction quality monitoring: Missing required fields, type error rates
Change monitoring: Detect abnormal day-over-day changes (often a sign selectors broke)

Stuck Maintaining a Scraper in Production?

If your scraper works locally but breaks in production—DOM changes, bot blocks, throttling, or compliance reviews—we can help you design a stable, monitorable scraping pipeline from requirements to operations.

Contact UsFeel free to reach out for scraping consultations and quotes

Get in Touch

Web scraping FAQs

Q. Can web scraping be a crime?

Scraping isn’t inherently criminal. But if your access causes outages or disrupts service, you can face business obstruction allegations. If you mishandle personal data, privacy law risk increases. Terms violations can trigger civil claims. The accurate framing is: scraping can become illegal depending on how you do it.

Q. If I follow robots.txt, am I 100% safe?

No. robots.txt is a signal, not a full compliance strategy. Even if robots.txt allows crawling, Terms of Service can still forbid it. And even if robots.txt forbids crawling, that alone may not define legality. Treat this as a three-part check: Terms, robots.txt, and IP/privacy constraints.

Login flows usually mean you’re explicitly accepting Terms. Many services prohibit automated access post-login, so scraping behind authentication is high-risk by default. If the data is business-critical, look for an API, partner agreement, or alternative licensed dataset.

Q. Can I do anything I want with scraped data?

Not necessarily. Even if collection is lawful, how you store, share, publish, or sell the data can trigger additional rules (copyright, privacy/data protection, trade secret/unfair competition). “You can collect it” does not automatically mean “you can use it freely.”

Q. If a site has an API, can I still scrape it?

You often can technically, but it’s rarely a good idea. APIs are more stable and clearer legally. Some providers also require that you use the API under their Terms. A healthy approach is: use the API by default, scrape only what the API can’t provide.

Q. Is scraping for AI training legal?

In Japan, Article 30-4 is frequently cited as enabling information analysis and text-and-data mining when the purpose is not “enjoyment” of the work. But it’s not a free pass: contractual restrictions (Terms) and privacy laws still apply, and uses that unreasonably harm rights holders can fall outside the exception. bunka.go.jp

Q. Is hobby scraping “safe” because it’s personal?

Personal/hobby intent doesn’t eliminate operational or legal risk. If you cause service disruption, the risk remains. The Okazaki case is often discussed precisely because it began as an individual technical project, yet escalated into a serious incident.

Summary

Web scraping automatically extracts data from websites. The core flow is: fetch → parse → extract/normalize.
Crawling is about discovery (“walking the web”). APIs are official interfaces. Scraping is independent extraction that often fills the gaps.
Scraping isn’t inherently illegal, but you must manage four major risk areas: service disruption, privacy/personal data, Terms of Service, and copyright.
In Japan, Copyright Act Article 30-4 can permit certain information analysis / TDM uses, but Terms and privacy constraints still apply.
Python is a mainstream choice (Requests, BeautifulSoup, Scrapy, Playwright). Dynamic sites usually require headless browsers.
Adopt scraping in this order: Terms/robots review → check API alternatives → estimate impact → choose build/tool/vendor → monitoring-first implementation.

Web scraping is powerful—but there’s a big gap between “a script that runs once” and “a system that runs safely in production.” If you design for compliance and operations from the start, you’ll move faster long-term.

References

About the Author

Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+

Annual Data Collection

24/7

Uptime

High Quality

Data Quality