AutomationScraping

Web Scraping with PHP: A Beginner-Friendly Guide

Learn PHP web scraping: fetch HTML with cURL or Guzzle, extract data via XPath or Symfony DomCrawler, and avoid common parsing and blocking pitfalls.

Ibuki Yamamoto
Ibuki Yamamoto
2026ćčŽ1月15æ—„ 7min read

Web Scraping with PHP: A Beginner-Friendly Guide

PHP web scraping usually comes down to two steps: fetch the HTML over HTTP, then parse it and extract the fields you need from the DOM. This guide shows a practical workflow using cURL (or Guzzle) for retrieval and DOMDocument/XPath or Symfony DomCrawler for extraction—plus the common failure points that trip beginners up.

What You’ll Learn
  • The core workflow for fetching HTML in PHP and extracting data reliably
  • When to use DOMDocument + XPath vs. Symfony DomCrawler
  • Error handling and responsible scraping practices (load, rate limits, site rules)

Web scraping in a nutshell

Web scraping is the end-to-end process of collecting information from web pages (typically HTML), transforming it, and storing it for later use. For beginners, the easiest mental model is a two-stage pipeline:

  1. Fetch the page via HTTP (GET)
  2. Parse the HTML and extract the elements you care about (DOM/CSS/XPath)

Start with static pages first. If a site renders content with JavaScript (a “dynamic” page), the HTML you download often won’t contain the data you see in the browser—so extraction gets harder fast.

What to check before you scrape

Terms of service and robots.txt

Before you write code, check the target site’s terms/guidelines and its robots.txt. Robots.txt is the standard mechanism for publishing crawler access policies, and the Robots Exclusion Protocol is formally documented as RFC 9309 IETF RFC 9309: Robots Exclusion Protocol.

Note: robots.txt rules aren’t “technically enforced,” but they matter operationally. Scraping disallowed paths or sending high-frequency traffic can get your IP blocked—and depending on the situation, it can also create legal or contractual risk.

Decide how you’ll retrieve the data

  • Static HTML: Fetch with cURL/Guzzle → parse and extract from the DOM
  • Dynamic rendering: If there’s an underlying API, prefer calling that API (within the site’s rules). Otherwise you may need browser automation/headless browsing.

End-to-end workflow

To understand the implementation quickly, here’s the minimal flow you’ll use in most PHP scrapers:

  1. GET a URL and read the HTML string
  2. Fix encoding issues (only if needed)
  3. Parse the HTML into a DOM
  4. Locate nodes with XPath/CSS selectors and extract values
  5. Store results (CSV/DB/JSON)

Fetching HTML

Fetch with cURL

This approach uses PHP’s built-in cURL extension. In practice, setting a User-Agent, following redirects, applying timeouts, and handling encoding reduces failures significantly. Commonly used options include CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, and CURLOPT_USERAGENT.

<?php
$url = 'https://example.com/';

$ch = curl_init($url);
curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_CONNECTTIMEOUT => 10,
    CURLOPT_TIMEOUT => 20,
    CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
]);

$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$error = curl_error($ch);
curl_close($ch);

if ($html === false) {
    throw new RuntimeException('cURL error: ' . $error);
}
if ($httpCode < 200 || $httpCode >= 300) {
    throw new RuntimeException('HTTP error: ' . $httpCode);
}

echo $html;

Why this matters: Some sites return 403 responses if you don’t send a User-Agent. Always set a UA and reasonable timeouts as a baseline.

Fetch with Guzzle

In production code, using Guzzle as your HTTP client often leads to cleaner structure: exceptions, headers, middleware, and retries are easier to manage. Guzzle is a widely used PHP HTTP client Guzzle Documentation and can be installed via Composer.

composer require guzzlehttp/guzzle
<?php
require __DIR__ . '/vendor/autoload.php';

use GuzzleHttp\Client;

$client = new Client([
    'timeout' => 20,
    'connect_timeout' => 10,
    'headers' => [
        'User-Agent' => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
    ],
]);

$response = $client->request('GET', 'https://example.com/');
$html = (string) $response->getBody();

echo $html;

Parsing HTML

DOMDocument caveats

To load HTML into PHP’s built-in DOM, you’ll typically use DOMDocument::loadHTML(). However, the official documentation warns that loadHTML() parses input using an HTML4 parser. Since modern browsers parse as HTML5, you can see differences in the resulting DOM structure—and it’s not safe to rely on it for sanitizing HTML PHP Manual: DOMDocument::loadHTML.

Common gotcha: The “DOM you see in DevTools” can differ from what DOMDocument builds. If extraction fails, check (1) whether you actually fetched the expected HTML, (2) whether the page is dynamically rendered, and (3) whether HTML4 vs. HTML5 parsing rules are changing the DOM structure.

Extract with XPath

DOMDocument + DOMXPath is the most lightweight option: no extra libraries required. To avoid noisy warnings on imperfect markup, it’s common to enable libxml internal errors and then clear/log them afterward PHP Manual: libxml_use_internal_errors.

<?php
libxml_use_internal_errors(true);

autoload(); // ćż…èŠă«ćżœă˜ăŠ

$html = '<html><body><h1 class="title">Hello</h1></body></html>';

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//h1[@class='title']");

if ($nodes !== false && $nodes->length > 0) {
    echo trim($nodes->item(0)->textContent);
}

libxml_clear_errors();

Extract with DomCrawler

Symfony’s DomCrawler gives you a clean, jQuery-like API with CSS selectors. You can select nodes using CSS selectors via filter() or use XPath via filterXpath() Symfony Docs: The DOM Crawler.

composer require symfony/dom-crawler symfony/css-selector
<?php
require __DIR__ . '/vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = '<html><body><h1 class="title">Hello</h1></body></html>';
$crawler = new Crawler($html);

$title = $crawler->filter('h1.title')->text();
echo trim($title);

How to choose: If you want minimal dependencies, use DOMDocument + XPath. If you expect a lot of selectors and prefer CSS syntax, DomCrawler is often easier to maintain.

A practical starter template

This is a minimal “fetch → extract” template you can run from a single file while learning (assumes a static page).

<?php
// 1) ć–ćŸ—
function fetchHtml(string $url): string {
    $ch = curl_init($url);
    curl_setopt_array($ch, [
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_FOLLOWLOCATION => true,
        CURLOPT_CONNECTTIMEOUT => 10,
        CURLOPT_TIMEOUT => 20,
        CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
    ]);

    $html = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    $error = curl_error($ch);
    curl_close($ch);

    if ($html === false) {
        throw new RuntimeException('cURL error: ' . $error);
    }
    if ($httpCode < 200 || $httpCode >= 300) {
        throw new RuntimeException('HTTP error: ' . $httpCode);
    }

    return $html;
}

// 2) 抜ć‡șXPath
function extractTitles(string $html): array {
    libxml_use_internal_errors(true);

    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $xpath = new DOMXPath($dom);

    $nodes = $xpath->query("//h2");
    $titles = [];

    if ($nodes !== false) {
        foreach ($nodes as $node) {
            $titles[] = trim($node->textContent);
        }
    }

    libxml_clear_errors();
    return $titles;
}

$url = 'https://example.com/';
$html = fetchHtml($url);
$titles = extractTitles($html);

foreach ($titles as $t) {
    echo $t . PHP_EOL;
}

Common failure modes (and how to debug them)

Fetched HTML is empty

  • 403/429 (blocked or rate limited): check User-Agent, request frequency, IP reputation, and whether the site requires cookies
  • Blocked after redirect: confirm you’re following redirects with CURLOPT_FOLLOWLOCATION
  • Timeouts: tune connect timeout vs. total timeout

You can’t find the element

  • The page is dynamically rendered and the HTML response doesn’t contain the data
  • DOMDocument parsing differences change the structure (HTML4 parser warning) PHP Manual: DOMDocument::loadHTML
  • Your selector is brittle (e.g., depends on frequently changing class names)

Too many libxml warnings

Messy HTML often triggers parsing warnings. Use libxml internal errors and log what you need for debugging PHP Manual: libxml_use_internal_errors.

Library comparison

To make selection easier, here’s a quick comparison separated by “fetch” vs. “extract.”

Purpose Option Strengths Trade-offs
HTTP fetching cURL Built-in and easy to keep dependency-free Can become procedural and repetitive as requirements grow
HTTP fetching Guzzle Cleaner HTTP abstractions (PSR-7 ecosystem, middleware, better structure) Requires Composer and adds dependencies
HTML extraction DOMDocument + XPath Lightweight; minimal dependencies HTML4 vs HTML5 parsing differences can affect results
HTML extraction Symfony DomCrawler CSS selectors are readable and often more maintainable Requires installing Symfony components

Key points from official specs

DOMDocument::loadHTML uses an HTML4 parser; HTML5 parsing rules differ, so the resulting DOM structure may differ from what browsers produce (official documentation warning). PHP Manual: DOMDocument::loadHTML

If your scraper “can’t find anything,” don’t assume your code is wrong. First verify the HTML you fetched, then account for dynamic rendering, and finally consider parser behavior and specs.

Running scrapers responsibly

Keep request frequency low

  • Sleep between requests (for example, 1–3 seconds)
  • Cap retries and use backoff
  • If you see 429/503, stop and increase the interval

Cache your results

Simply avoiding repeated downloads of the same page reduces load and lowers the chance of getting blocked. Even while learning, save HTML locally and iterate on parsing and extraction offline.

Want a scraper that holds up in production?

If you have a PHP scraper working locally but keep hitting blocks, timeouts, or fragile selectors in real-world runs, we can help design and operate a safer, more stable collection pipeline.

Contact UsFeel free to reach out for scraping consultations and quotes
Get in Touch

Summary

  • PHP scraping basics = fetching (cURL/Guzzle) + extraction (XPath/DomCrawler)
  • DOMDocument can differ from browser DOMs due to HTML4 vs HTML5 parsing—verify the fetched HTML and parser behavior when elements “disappear”
  • Check site rules and robots.txt, keep rates low, and cache responses for safer operation

About the Author

Ibuki Yamamoto
Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+
Annual Data Collection
24/7
Uptime
High Quality
Data Quality