AutomationScraping

Sitemap-First Crawling: Find the Right URLs Fast

Use XML sitemaps to discover URLs fast, prioritize crawl targets, and run safer incremental crawls with lastmod, normalization, and robots.txt rules.

Ibuki Yamamoto
Ibuki Yamamoto
February 20, 2026 3min read

If your goal in web scraping (crawling) is to collect only the URLs you actually need—as fast as possible, without missing anything—pure link-following is often the long way around. A faster starting point is an XML sitemap. This guide explains how to design a sitemap-first crawl plan and what to watch out for when you implement it, in beginner-friendly terms.

What You’ll Learn
  • The core idea behind sitemap-first crawl design
  • A practical, prioritized workflow to collect URLs quickly
  • How to handle common pitfalls (lastmod, duplicates, blocking)

Sitemap basics

An XML sitemap is a mechanism for telling search engines (and other crawlers) which URLs exist on a site. For web scraping, you can treat it as the sites official URL index and use it to build a complete target list before you start fetching pages.

Key elements in an XML sitemap

The essential tag is <loc> (the URL). Many sitemaps also include optional tags like <lastmod> (last modified date). The protocol also defines <changefreq> and <priority>, but support and accuracy varyso treat them as hints, not guarantees. For the full definition, see the official protocol.

Key takeaway: For scraping, the highest-value win is using <loc> to lock in your URL universe. If <lastmod> is trustworthy, it also makes incremental (diff) crawling much easier.

Designing the fastest crawl

If youre optimizing for speed, the strategy is simple: confirm the full candidate URL set first, then discard unwanted URLs early. XML sitemaps fit this approach extremely well.

Start by locking down the URL set

  1. Find the sitemap URL (common defaults are /sitemap.xml and /sitemap_index.xml)
  2. Fetch the sitemap (or sitemap index) and enumerate all URLs
  3. Normalize URLs (trailing slashes, query strings, http/https handling)
  4. Keep only the paths you actually need (for example, only /product/)

How to prioritize what to crawl

In production workflows, the following priority order reduces mistakes and surprises:

  • Business importance (your requirements): product pages, store lists, job listingsURLs that contain the data youre collecting
  • Update frequency: crawl frequently-updated areas earlier (but dont blindly trust sitemap tags; see below)
  • Cost: leave heavy pages (JS-heavy, huge HTML) or aggressively rate-limited areas for later

Note: Even if the sitemap contains <changefreq> or <priority>, they arent always maintained accurately. In scraping, prioritize your own definition of important URLs and treat sitemap hints as optional signals.

Thinking in incremental crawls

If a site maintains <lastmod> correctly, you can re-fetch only URLs that changed since your previous crawl. That reduces runtime, bandwidth, and the odds of triggering blocking.

One caveat: the protocol expects <lastmod> to reflect the pages actual modification time, not the sitemap generation time. Some sites get this wrong (for example, every URL shows today). Validate it by comparing against real signals such as timestamps in the HTML, API responses, or other authoritative page metadata.

The shortest implementation path

This is where things get practical. Below is a simple Python example for sitemap-based URL discovery using requests and the standard XML parser.

Discover sitemaps

Start with robots.txt and look for Sitemap: lines. This is the most reliable approach and it also handles sites that publish multiple sitemap URLs.

import re
import requests


def discover_sitemaps(base_url: str):
    robots_url = base_url.rstrip("/") + "/robots.txt"
    r = requests.get(robots_url, timeout=20)
    r.raise_for_status()

    sitemaps = []
    for line in r.text.splitlines():
        m = re.match(r"(?i)^sitemap:\s*(\S+)", line.strip())
        if m:
            sitemaps.append(m.group(1))
    return sitemaps


print(discover_sitemaps("https://example.com"))

Parse sitemaps

A sitemap can be either a direct URL list (urlset) or an index of sitemaps (sitemapindex). Handle both, and always return a final list of URL items.

import requests
import xml.etree.ElementTree as ET

NS = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}


def fetch_xml(url: str) -> ET.Element:
    r = requests.get(url, timeout=30)
    r.raise_for_status()
    return ET.fromstring(r.content)


def parse_sitemap(url: str):
    root = fetch_xml(url)

    # sitemapindex
    if root.tag.endswith("sitemapindex"):
        locs = [e.find("sm:loc", NS).text for e in root.findall("sm:sitemap", NS)]
        urls = []
        for loc in locs:
            urls.extend(parse_sitemap(loc))
        return urls

    # urlset
    if root.tag.endswith("urlset"):
        out = []
        for u in root.findall("sm:url", NS):
            loc = u.find("sm:loc", NS).text
            lastmod_el = u.find("sm:lastmod", NS)
            lastmod = lastmod_el.text if lastmod_el is not None else None
            out.append({"loc": loc, "lastmod": lastmod})
        return out

    raise ValueError(f"Unknown sitemap root: {root.tag}")


items = parse_sitemap("https://example.com/sitemap.xml")
print(len(items), items[:3])

Practical tip: Separate URL discovery (sitemap) from page fetching (crawl). Once URL discovery finishes first, deduplication, prioritization, retries, and monitoring become much easier.

Filtering and normalization

Sitemaps often include URLs you dont want (tag pages, internal search results, help pages, language variants, etc.). Filter aggressivelyfor example, by path prefix.

from urllib.parse import urlparse, urlunparse


def normalize_url(u: str) -> str:
    p = urlparse(u)
    # Drop query strings and fragments (keep them if your requirements need them)
    p = p._replace(query="", fragment="")
    # Standardize trailing slashes based on the sites behavior
    return urlunparse(p)


def filter_urls(items, allow_prefixes):
    out = []
    for it in items:
        loc = normalize_url(it["loc"])
        path = urlparse(loc).path
        if any(path.startswith(prefix) for prefix in allow_prefixes):
            out.append({**it, "loc": loc})
    return out


filtered = filter_urls(items, allow_prefixes=["/product/", "/item/"])
print(len(filtered))

Common pitfalls

There are multiple sitemaps

Large sites commonly split sitemaps and tie them together with a sitemap index. Make sure your implementation always supports sitemapindex.

lastmod cant be trusted

<lastmod> is great for incremental crawling, but sloppy operations can result in every URL being marked as modified today. If that happens, switch to a more reliable strategy:

  • Use HTTP ETag / If-Modified-Since conditional requests (if the server supports them)
  • Crawl critical pages on a regular schedule, and de-prioritize the rest
  • Extract and compare an last updated field inside the page content (for example, a product update date)

Canonical URLs and duplicates

Best practice is to include only canonical URLs in sitemaps, but real-world sitemaps may include parameterized URLs, alternate language paths, or inconsistent hostnames. URL normalization (query handling, trailing slash rules, case sensitivity, www/non-www) and deduplication are non-negotiable.

Conflicts with robots.txt

Finding a URL in a sitemap doesnt mean you should fetch it. As a rule, do not crawl areas disallowed by robots.txt. Parse its rules and filter blocked paths before you enqueue anything.

Reminder: Web scraping requires careful attention to the sites Terms of Service, applicable laws, and technical constraints (rate limits, anti-bot measures, etc.). A public sitemap does not mean you can scrape anything.

Comparing URL discovery approaches

Finally, heres a quick comparison of three common starting points: sitemaps, link discovery, and list/search pages. In a speed-focused design, teams often use sitemaps as the backbone, then supplement with other methods to catch anything missing.

Approach Strengths Weaknesses Best for
Sitemaps Fast URL collection, high coverage Can be incomplete or poorly maintained Sites with large numbers of product/article URLs
Link discovery Works even when theres no sitemap Takes time to reach deep URL layers Small sites with well-structured navigation
List pages as the seed Reflects category structure well Pagination/JS can easily hide URLs When you need category-by-category collection

Want a faster crawl plan?

If your crawler works but feels slow or unreliable, a sitemap-first design usually gets you to full URL coverage fasterwith fewer retries, less bandwidth, and fewer blocks. We can help you design and operationalize a crawl strategy that fits your requirements.

Contact UsFeel free to reach out for scraping consultations and quotes
Get in Touch

Summary

  • The first step toward a fast crawl is to define your URL set via the sitemap before doing link discovery
  • If <lastmod> is reliable, it enables efficient incremental crawlingbut validate it because misconfiguration is common
  • Stable operations come from URL normalization, deduplication, and respecting robots.txt

Official specs and search engine guidelines help you confirm baseline behavior. Refer to them whenever youre unsure about protocol expectations.


About the Author

Ibuki Yamamoto
Ibuki Yamamoto

Web scraping engineer with over 10 years of practical experience, having worked on numerous large-scale data collection projects. Specializes in Python and JavaScript, sharing practical scraping techniques in technical blogs.

Leave it to the
Data Collection Professionals

Our professional team with over 100 million data collection records annually solves all challenges including large-scale scraping and anti-bot measures.

100M+
Annual Data Collection
24/7
Uptime
High Quality
Data Quality