Web Scraping with PHP: A Beginner-Friendly Guide

PHP web scraping usually comes down to two steps: fetch the HTML over HTTP, then parse it and extract the fields you need from the DOM. This guide shows a practical workflow using cURL (or Guzzle) for retrieval and DOMDocument/XPath or Symfony DomCrawler for extraction—plus the common failure points that trip beginners up.

What You’ll Learn

The core workflow for fetching HTML in PHP and extracting data reliably
When to use DOMDocument + XPath vs. Symfony DomCrawler
Error handling and responsible scraping practices (load, rate limits, site rules)

Web scraping in a nutshell

Web scraping is the end-to-end process of collecting information from web pages (typically HTML), transforming it, and storing it for later use. For beginners, the easiest mental model is a two-stage pipeline:

Fetch the page via HTTP (GET)
Parse the HTML and extract the elements you care about (DOM/CSS/XPath)

Start with static pages first. If a site renders content with JavaScript (a “dynamic” page), the HTML you download often won’t contain the data you see in the browser—so extraction gets harder fast.

What to check before you scrape

Terms of service and robots.txt

Before you write code, check the target site’s terms/guidelines and its robots.txt. Robots.txt is the standard mechanism for publishing crawler access policies, and the Robots Exclusion Protocol is formally documented as RFC 9309 IETF RFC 9309: Robots Exclusion Protocol.

Note: robots.txt rules aren’t “technically enforced,” but they matter operationally. Scraping disallowed paths or sending high-frequency traffic can get your IP blocked—and depending on the situation, it can also create legal or contractual risk.

Decide how you’ll retrieve the data

Static HTML: Fetch with cURL/Guzzle → parse and extract from the DOM
Dynamic rendering: If there’s an underlying API, prefer calling that API (within the site’s rules). Otherwise you may need browser automation/headless browsing.

End-to-end workflow

To understand the implementation quickly, here’s the minimal flow you’ll use in most PHP scrapers:

GET a URL and read the HTML string
Fix encoding issues (only if needed)
Parse the HTML into a DOM
Locate nodes with XPath/CSS selectors and extract values
Store results (CSV/DB/JSON)

Fetching HTML

Fetch with cURL

This approach uses PHP’s built-in cURL extension. In practice, setting a User-Agent, following redirects, applying timeouts, and handling encoding reduces failures significantly. Commonly used options include CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, and CURLOPT_USERAGENT.

<?php
$url = 'https://example.com/';

$ch = curl_init($url);
curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_CONNECTTIMEOUT => 10,
    CURLOPT_TIMEOUT => 20,
    CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
]);

$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$error = curl_error($ch);
curl_close($ch);

if ($html === false) {
    throw new RuntimeException('cURL error: ' . $error);
}
if ($httpCode < 200 || $httpCode >= 300) {
    throw new RuntimeException('HTTP error: ' . $httpCode);
}

echo $html;

Why this matters: Some sites return 403 responses if you don’t send a User-Agent. Always set a UA and reasonable timeouts as a baseline.

Fetch with Guzzle

In production code, using Guzzle as your HTTP client often leads to cleaner structure: exceptions, headers, middleware, and retries are easier to manage. Guzzle is a widely used PHP HTTP client Guzzle Documentation and can be installed via Composer.

composer require guzzlehttp/guzzle

<?php
require __DIR__ . '/vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client([
    'timeout' => 20,
    'connect_timeout' => 10,
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
    ],
]);

$response = $client->request('GET', 'https://example.com/');
$html = (string) $response->getBody();

echo $html;

Parsing HTML

DOMDocument caveats

To load HTML into PHP’s built-in DOM, you’ll typically use DOMDocument::loadHTML(). However, the official documentation warns that loadHTML() parses input using an HTML4 parser. Since modern browsers parse as HTML5, you can see differences in the resulting DOM structure—and it’s not safe to rely on it for sanitizing HTML PHP Manual: DOMDocument::loadHTML.

Common gotcha: The “DOM you see in DevTools” can differ from what DOMDocument builds. If extraction fails, check (1) whether you actually fetched the expected HTML, (2) whether the page is dynamically rendered, and (3) whether HTML4 vs. HTML5 parsing rules are changing the DOM structure.

Extract with XPath

DOMDocument + DOMXPath is the most lightweight option: no extra libraries required. To avoid noisy warnings on imperfect markup, it’s common to enable libxml internal errors and then clear/log them afterward PHP Manual: libxml_use_internal_errors.

<?php
libxml_use_internal_errors(true);

autoload(); // 必要に応じて

$html = '<html><body><h1 class="title">Hello</h1></body></html>';

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//h1[@class='title']");

if ($nodes !== false && $nodes->length > 0) {
    echo trim($nodes->item(0)->textContent);
}

libxml_clear_errors();

Extract with DomCrawler

Symfony’s DomCrawler gives you a clean, jQuery-like API with CSS selectors. You can select nodes using CSS selectors via filter() or use XPath via filterXpath() Symfony Docs: The DOM Crawler.

composer require symfony/dom-crawler symfony/css-selector

<?php
require __DIR__ . '/vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = '<html><body><h1 class="title">Hello</h1></body></html>';
$crawler = new Crawler($html);

$title = $crawler->filter('h1.title')->text();
echo trim($title);

How to choose: If you want minimal dependencies, use DOMDocument + XPath. If you expect a lot of selectors and prefer CSS syntax, DomCrawler is often easier to maintain.

A practical starter template

This is a minimal “fetch → extract” template you can run from a single file while learning (assumes a static page).

<?php
// 1) 取得
function fetchHtml(string $url): string {
    $ch = curl_init($url);
    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT => 20,
        CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
    ]);

    $html = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    $error = curl_error($ch);
    curl_close($ch);

    if ($html === false) {
        throw new RuntimeException('cURL error: ' . $error);
    }
    if ($httpCode < 200 || $httpCode >= 300) {
        throw new RuntimeException('HTTP error: ' . $httpCode);
    }

    return $html;
}

// 2) 抽出（XPath）
function extractTitles(string $html): array {
    libxml_use_internal_errors(true);

    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $xpath = new DOMXPath($dom);

    $nodes = $xpath->query("//h2");
    $titles = [];

    if ($nodes !== false) {
        foreach ($nodes as $node) {
            $titles[] = trim($node->textContent);
        }
    }

    libxml_clear_errors();
    return $titles;
}

$url = 'https://example.com/';
$html = fetchHtml($url);
$titles = extractTitles($html);

foreach ($titles as $t) {
    echo $t . PHP_EOL;
}

Common failure modes (and how to debug them)

Fetched HTML is empty

403/429 (blocked or rate limited): check User-Agent, request frequency, IP reputation, and whether the site requires cookies
Blocked after redirect: confirm you’re following redirects with CURLOPT_FOLLOWLOCATION
Timeouts: tune connect timeout vs. total timeout

You can’t find the element

The page is dynamically rendered and the HTML response doesn’t contain the data
DOMDocument parsing differences change the structure (HTML4 parser warning) PHP Manual: DOMDocument::loadHTML
Your selector is brittle (e.g., depends on frequently changing class names)

Too many libxml warnings

Messy HTML often triggers parsing warnings. Use libxml internal errors and log what you need for debugging PHP Manual: libxml_use_internal_errors.

Library comparison

To make selection easier, here’s a quick comparison separated by “fetch” vs. “extract.”

Purpose	Option	Strengths	Trade-offs
HTTP fetching	cURL	Built-in and easy to keep dependency-free	Can become procedural and repetitive as requirements grow
HTTP fetching	Guzzle	Cleaner HTTP abstractions (PSR-7 ecosystem, middleware, better structure)	Requires Composer and adds dependencies
HTML extraction	DOMDocument + XPath	Lightweight; minimal dependencies	HTML4 vs HTML5 parsing differences can affect results
HTML extraction	Symfony DomCrawler	CSS selectors are readable and often more maintainable	Requires installing Symfony components

Key points from official specs

DOMDocument::loadHTML uses an HTML4 parser; HTML5 parsing rules differ, so the resulting DOM structure may differ from what browsers produce (official documentation warning). PHP Manual: DOMDocument::loadHTML

If your scraper “can’t find anything,” don’t assume your code is wrong. First verify the HTML you fetched, then account for dynamic rendering, and finally consider parser behavior and specs.

Running scrapers responsibly

Keep request frequency low

Sleep between requests (for example, 1–3 seconds)
Cap retries and use backoff
If you see 429/503, stop and increase the interval

Cache your results

Simply avoiding repeated downloads of the same page reduces load and lowers the chance of getting blocked. Even while learning, save HTML locally and iterate on parsing and extraction offline.

Want a scraper that holds up in production?

If you have a PHP scraper working locally but keep hitting blocks, timeouts, or fragile selectors in real-world runs, we can help design and operate a safer, more stable collection pipeline.

Contact UsFeel free to reach out for scraping consultations and quotes

Get in Touch

Summary

PHP scraping basics = fetching (cURL/Guzzle) + extraction (XPath/DomCrawler)
DOMDocument can differ from browser DOMs due to HTML4 vs HTML5 parsing—verify the fetched HTML and parser behavior when elements “disappear”
Check site rules and robots.txt, keep rates low, and cache responses for safer operation

About the Author

Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+

Annual Data Collection

24/7

Uptime

High Quality

Data Quality