Web Scraping with PHP: A Beginner-Friendly Guide
PHP web scraping usually comes down to two steps: fetch the HTML over HTTP, then parse it and extract the fields you need from the DOM. This guide shows a practical workflow using cURL (or Guzzle) for retrieval and DOMDocument/XPath or Symfony DomCrawler for extractionâplus the common failure points that trip beginners up.
- The core workflow for fetching HTML in PHP and extracting data reliably
- When to use DOMDocument + XPath vs. Symfony DomCrawler
- Error handling and responsible scraping practices (load, rate limits, site rules)
Web scraping in a nutshell
Web scraping is the end-to-end process of collecting information from web pages (typically HTML), transforming it, and storing it for later use. For beginners, the easiest mental model is a two-stage pipeline:
- Fetch the page via HTTP (GET)
- Parse the HTML and extract the elements you care about (DOM/CSS/XPath)
Start with static pages first. If a site renders content with JavaScript (a âdynamicâ page), the HTML you download often wonât contain the data you see in the browserâso extraction gets harder fast.
What to check before you scrape
Terms of service and robots.txt
Before you write code, check the target siteâs terms/guidelines and its robots.txt. Robots.txt is the standard mechanism for publishing crawler access policies, and the Robots Exclusion Protocol is formally documented as RFC 9309 IETF RFC 9309: Robots Exclusion Protocol.
Note: robots.txt rules arenât âtechnically enforced,â but they matter operationally. Scraping disallowed paths or sending high-frequency traffic can get your IP blockedâand depending on the situation, it can also create legal or contractual risk.
Decide how youâll retrieve the data
- Static HTML: Fetch with cURL/Guzzle â parse and extract from the DOM
- Dynamic rendering: If thereâs an underlying API, prefer calling that API (within the siteâs rules). Otherwise you may need browser automation/headless browsing.
End-to-end workflow
To understand the implementation quickly, hereâs the minimal flow youâll use in most PHP scrapers:
- GET a URL and read the HTML string
- Fix encoding issues (only if needed)
- Parse the HTML into a DOM
- Locate nodes with XPath/CSS selectors and extract values
- Store results (CSV/DB/JSON)
Fetching HTML
Fetch with cURL
This approach uses PHPâs built-in cURL extension. In practice, setting a User-Agent, following redirects, applying timeouts, and handling encoding reduces failures significantly. Commonly used options include CURLOPT_RETURNTRANSFER, CURLOPT_FOLLOWLOCATION, and CURLOPT_USERAGENT.
<?php
$url = 'https://example.com/';
$ch = curl_init($url);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 20,
CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$error = curl_error($ch);
curl_close($ch);
if ($html === false) {
throw new RuntimeException('cURL error: ' . $error);
}
if ($httpCode < 200 || $httpCode >= 300) {
throw new RuntimeException('HTTP error: ' . $httpCode);
}
echo $html;Why this matters: Some sites return 403 responses if you donât send a User-Agent. Always set a UA and reasonable timeouts as a baseline.
Fetch with Guzzle
In production code, using Guzzle as your HTTP client often leads to cleaner structure: exceptions, headers, middleware, and retries are easier to manage. Guzzle is a widely used PHP HTTP client Guzzle Documentation and can be installed via Composer.
composer require guzzlehttp/guzzle<?php
require __DIR__ . '/vendor/autoload.php';
use GuzzleHttp\Client;
$client = new Client([
'timeout' => 20,
'connect_timeout' => 10,
'headers' => [
'User-Agent' => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
],
]);
$response = $client->request('GET', 'https://example.com/');
$html = (string) $response->getBody();
echo $html;Parsing HTML
DOMDocument caveats
To load HTML into PHPâs built-in DOM, youâll typically use DOMDocument::loadHTML(). However, the official documentation warns that loadHTML() parses input using an HTML4 parser. Since modern browsers parse as HTML5, you can see differences in the resulting DOM structureâand itâs not safe to rely on it for sanitizing HTML PHP Manual: DOMDocument::loadHTML.
Common gotcha: The âDOM you see in DevToolsâ can differ from what DOMDocument builds. If extraction fails, check (1) whether you actually fetched the expected HTML, (2) whether the page is dynamically rendered, and (3) whether HTML4 vs. HTML5 parsing rules are changing the DOM structure.
Extract with XPath
DOMDocument + DOMXPath is the most lightweight option: no extra libraries required. To avoid noisy warnings on imperfect markup, itâs common to enable libxml internal errors and then clear/log them afterward PHP Manual: libxml_use_internal_errors.
<?php
libxml_use_internal_errors(true);
autoload(); // ćż
èŠă«ćżăăŠ
$html = '<html><body><h1 class="title">Hello</h1></body></html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//h1[@class='title']");
if ($nodes !== false && $nodes->length > 0) {
echo trim($nodes->item(0)->textContent);
}
libxml_clear_errors();Extract with DomCrawler
Symfonyâs DomCrawler gives you a clean, jQuery-like API with CSS selectors. You can select nodes using CSS selectors via filter() or use XPath via filterXpath() Symfony Docs: The DOM Crawler.
composer require symfony/dom-crawler symfony/css-selector<?php
require __DIR__ . '/vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = '<html><body><h1 class="title">Hello</h1></body></html>';
$crawler = new Crawler($html);
$title = $crawler->filter('h1.title')->text();
echo trim($title);How to choose: If you want minimal dependencies, use DOMDocument + XPath. If you expect a lot of selectors and prefer CSS syntax, DomCrawler is often easier to maintain.
A practical starter template
This is a minimal âfetch â extractâ template you can run from a single file while learning (assumes a static page).
<?php
// 1) ććŸ
function fetchHtml(string $url): string {
$ch = curl_init($url);
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_CONNECTTIMEOUT => 10,
CURLOPT_TIMEOUT => 20,
CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; MyScraper/1.0; +https://example.com/bot)',
]);
$html = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$error = curl_error($ch);
curl_close($ch);
if ($html === false) {
throw new RuntimeException('cURL error: ' . $error);
}
if ($httpCode < 200 || $httpCode >= 300) {
throw new RuntimeException('HTTP error: ' . $httpCode);
}
return $html;
}
// 2) æœćșïŒXPathïŒ
function extractTitles(string $html): array {
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//h2");
$titles = [];
if ($nodes !== false) {
foreach ($nodes as $node) {
$titles[] = trim($node->textContent);
}
}
libxml_clear_errors();
return $titles;
}
$url = 'https://example.com/';
$html = fetchHtml($url);
$titles = extractTitles($html);
foreach ($titles as $t) {
echo $t . PHP_EOL;
}Common failure modes (and how to debug them)
Fetched HTML is empty
- 403/429 (blocked or rate limited): check User-Agent, request frequency, IP reputation, and whether the site requires cookies
- Blocked after redirect: confirm youâre following redirects with
CURLOPT_FOLLOWLOCATION - Timeouts: tune connect timeout vs. total timeout
You canât find the element
- The page is dynamically rendered and the HTML response doesnât contain the data
- DOMDocument parsing differences change the structure (HTML4 parser warning) PHP Manual: DOMDocument::loadHTML
- Your selector is brittle (e.g., depends on frequently changing class names)
Too many libxml warnings
Messy HTML often triggers parsing warnings. Use libxml internal errors and log what you need for debugging PHP Manual: libxml_use_internal_errors.
Library comparison
To make selection easier, hereâs a quick comparison separated by âfetchâ vs. âextract.â
| Purpose | Option | Strengths | Trade-offs |
|---|---|---|---|
| HTTP fetching | cURL | Built-in and easy to keep dependency-free | Can become procedural and repetitive as requirements grow |
| HTTP fetching | Guzzle | Cleaner HTTP abstractions (PSR-7 ecosystem, middleware, better structure) | Requires Composer and adds dependencies |
| HTML extraction | DOMDocument + XPath | Lightweight; minimal dependencies | HTML4 vs HTML5 parsing differences can affect results |
| HTML extraction | Symfony DomCrawler | CSS selectors are readable and often more maintainable | Requires installing Symfony components |
Key points from official specs
DOMDocument::loadHTML uses an HTML4 parser; HTML5 parsing rules differ, so the resulting DOM structure may differ from what browsers produce (official documentation warning). PHP Manual: DOMDocument::loadHTML
If your scraper âcanât find anything,â donât assume your code is wrong. First verify the HTML you fetched, then account for dynamic rendering, and finally consider parser behavior and specs.
Running scrapers responsibly
Keep request frequency low
- Sleep between requests (for example, 1â3 seconds)
- Cap retries and use backoff
- If you see 429/503, stop and increase the interval
Cache your results
Simply avoiding repeated downloads of the same page reduces load and lowers the chance of getting blocked. Even while learning, save HTML locally and iterate on parsing and extraction offline.
Want a scraper that holds up in production?
If you have a PHP scraper working locally but keep hitting blocks, timeouts, or fragile selectors in real-world runs, we can help design and operate a safer, more stable collection pipeline.
Summary
- PHP scraping basics = fetching (cURL/Guzzle) + extraction (XPath/DomCrawler)
- DOMDocument can differ from browser DOMs due to HTML4 vs HTML5 parsingâverify the fetched HTML and parser behavior when elements âdisappearâ
- Check site rules and robots.txt, keep rates low, and cache responses for safer operation