If your goal in web scraping (crawling) is to collect only the URLs you actually need—as fast as possible, without missing anything—pure link-following is often the long way around. A faster starting point is an XML sitemap. This guide explains how to design a sitemap-first crawl plan and what to watch out for when you implement it, in beginner-friendly terms.
- The core idea behind sitemap-first crawl design
- A practical, prioritized workflow to collect URLs quickly
- How to handle common pitfalls (lastmod, duplicates, blocking)
Sitemap basics
An XML sitemap is a mechanism for telling search engines (and other crawlers) which URLs exist on a site. For web scraping, you can treat it as the sites official URL index and use it to build a complete target list before you start fetching pages.
Key elements in an XML sitemap
The essential tag is <loc> (the URL). Many sitemaps also include optional tags like <lastmod> (last modified date). The protocol also defines <changefreq> and <priority>, but support and accuracy varyso treat them as hints, not guarantees. For the full definition, see the official protocol.
Key takeaway: For scraping, the highest-value win is using If youre optimizing for speed, the strategy is simple: confirm the full candidate URL set first, then discard unwanted URLs early. XML sitemaps fit this approach extremely well. In production workflows, the following priority order reduces mistakes and surprises: Note: Even if the sitemap contains If a site maintains One caveat: the protocol expects This is where things get practical. Below is a simple Python example for sitemap-based URL discovery using Start with <loc> to lock in your URL universe. If <lastmod> is trustworthy, it also makes incremental (diff) crawling much easier.Designing the fastest crawl
Start by locking down the URL set
/sitemap.xml and /sitemap_index.xml)/product/)How to prioritize what to crawl
<changefreq> or <priority>, they arent always maintained accurately. In scraping, prioritize your own definition of important URLs and treat sitemap hints as optional signals.Thinking in incremental crawls
<lastmod> correctly, you can re-fetch only URLs that changed since your previous crawl. That reduces runtime, bandwidth, and the odds of triggering blocking.<lastmod> to reflect the pages actual modification time, not the sitemap generation time. Some sites get this wrong (for example, every URL shows today). Validate it by comparing against real signals such as timestamps in the HTML, API responses, or other authoritative page metadata.The shortest implementation path
requests and the standard XML parser.Discover sitemaps
robots.txt and look for Sitemap: lines. This is the most reliable approach and it also handles sites that publish multiple sitemap URLs.
import re
import requests
def discover_sitemaps(base_url: str):
robots_url = base_url.rstrip("/") + "/robots.txt"
r = requests.get(robots_url, timeout=20)
r.raise_for_status()
sitemaps = []
for line in r.text.splitlines():
m = re.match(r"(?i)^sitemap:\s*(\S+)", line.strip())
if m:
sitemaps.append(m.group(1))
return sitemaps
print(discover_sitemaps("https://example.com"))Parse sitemaps
A sitemap can be either a direct URL list (urlset) or an index of sitemaps (sitemapindex). Handle both, and always return a final list of URL items.
import requests
import xml.etree.ElementTree as ET
NS = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
def fetch_xml(url: str) -> ET.Element:
r = requests.get(url, timeout=30)
r.raise_for_status()
return ET.fromstring(r.content)
def parse_sitemap(url: str):
root = fetch_xml(url)
# sitemapindex
if root.tag.endswith("sitemapindex"):
locs = [e.find("sm:loc", NS).text for e in root.findall("sm:sitemap", NS)]
urls = []
for loc in locs:
urls.extend(parse_sitemap(loc))
return urls
# urlset
if root.tag.endswith("urlset"):
out = []
for u in root.findall("sm:url", NS):
loc = u.find("sm:loc", NS).text
lastmod_el = u.find("sm:lastmod", NS)
lastmod = lastmod_el.text if lastmod_el is not None else None
out.append({"loc": loc, "lastmod": lastmod})
return out
raise ValueError(f"Unknown sitemap root: {root.tag}")
items = parse_sitemap("https://example.com/sitemap.xml")
print(len(items), items[:3])
Practical tip: Separate URL discovery (sitemap) from page fetching (crawl). Once URL discovery finishes first, deduplication, prioritization, retries, and monitoring become much easier.
Filtering and normalization
Sitemaps often include URLs you dont want (tag pages, internal search results, help pages, language variants, etc.). Filter aggressivelyfor example, by path prefix.
from urllib.parse import urlparse, urlunparse
def normalize_url(u: str) -> str:
p = urlparse(u)
# Drop query strings and fragments (keep them if your requirements need them)
p = p._replace(query="", fragment="")
# Standardize trailing slashes based on the sites behavior
return urlunparse(p)
def filter_urls(items, allow_prefixes):
out = []
for it in items:
loc = normalize_url(it["loc"])
path = urlparse(loc).path
if any(path.startswith(prefix) for prefix in allow_prefixes):
out.append({**it, "loc": loc})
return out
filtered = filter_urls(items, allow_prefixes=["/product/", "/item/"])
print(len(filtered))
Common pitfalls
There are multiple sitemaps
Large sites commonly split sitemaps and tie them together with a sitemap index. Make sure your implementation always supports sitemapindex.
lastmod cant be trusted
<lastmod> is great for incremental crawling, but sloppy operations can result in every URL being marked as modified today. If that happens, switch to a more reliable strategy:
- Use HTTP
ETag/If-Modified-Sinceconditional requests (if the server supports them) - Crawl critical pages on a regular schedule, and de-prioritize the rest
- Extract and compare an last updated field inside the page content (for example, a product update date)
Canonical URLs and duplicates
Best practice is to include only canonical URLs in sitemaps, but real-world sitemaps may include parameterized URLs, alternate language paths, or inconsistent hostnames. URL normalization (query handling, trailing slash rules, case sensitivity, www/non-www) and deduplication are non-negotiable.
Conflicts with robots.txt
Finding a URL in a sitemap doesnt mean you should fetch it. As a rule, do not crawl areas disallowed by robots.txt. Parse its rules and filter blocked paths before you enqueue anything.
Reminder: Web scraping requires careful attention to the sites Terms of Service, applicable laws, and technical constraints (rate limits, anti-bot measures, etc.). A public sitemap does not mean you can scrape anything.
Comparing URL discovery approaches
Finally, heres a quick comparison of three common starting points: sitemaps, link discovery, and list/search pages. In a speed-focused design, teams often use sitemaps as the backbone, then supplement with other methods to catch anything missing.
| Approach | Strengths | Weaknesses | Best for |
|---|---|---|---|
| Sitemaps | Fast URL collection, high coverage | Can be incomplete or poorly maintained | Sites with large numbers of product/article URLs |
| Link discovery | Works even when theres no sitemap | Takes time to reach deep URL layers | Small sites with well-structured navigation |
| List pages as the seed | Reflects category structure well | Pagination/JS can easily hide URLs | When you need category-by-category collection |
Want a faster crawl plan?
If your crawler works but feels slow or unreliable, a sitemap-first design usually gets you to full URL coverage fasterwith fewer retries, less bandwidth, and fewer blocks. We can help you design and operationalize a crawl strategy that fits your requirements.
Summary
- The first step toward a fast crawl is to define your URL set via the sitemap before doing link discovery
- If
<lastmod>is reliable, it enables efficient incremental crawlingbut validate it because misconfiguration is common - Stable operations come from URL normalization, deduplication, and respecting
robots.txt
Official specs and search engine guidelines help you confirm baseline behavior. Refer to them whenever youre unsure about protocol expectations.