Crawling & Indexing

Define what to crawl. We handle fetching, extraction, deduplication, and indexing. Configure policies to control scope, depth, and freshness.

Source configuration

Define sources by URL, domain, or sitemap

Seed URLs

Start from specific pages and follow links within scope. The crawler discovers outbound links and reports them for depth-based crawling.

Allowed domains

Scope crawling to specific domains and subdomains. Discovered links outside allowed domains are filtered out automatically.

Automatic sitemap discovery

Sitemaps are discovered automatically from robots.txt and /sitemap.xml for seed domains and all allowed domains. Sitemap indexes are resolved recursively.

Crawl policies

Fine-grained control over scope and behavior

Depth limits

Control how many links deep the crawler follows from seed pages.

Page budgets

Set maximum pages per source or per crawl run to control scope and cost.

Per-domain rate limiting

Token-bucket rate limiter per domain. Respects Crawl-delay from robots.txt. Adaptive AIMD control backs off on HTTP 429 responses.

Include/exclude URL patterns

Regex patterns to include or exclude URLs. Focus on the content that matters.

Recrawl scheduling

Set recrawl frequency per corpus. Incremental recrawls use conditional GET (If-None-Match/If-Modified-Since) to skip unchanged pages efficiently.

robots.txt

Respected by default - Disallow rules, Crawl-delay, and Sitemap directives. Override available when you have authorization to crawl restricted paths.

Content extraction

What gets extracted and indexed from each page

Title extraction with fallback hierarchy

Main content with boilerplate removal (Readability-style)

Heading structure (H1-H3)

Meta description and Open Graph metadata

Published date extraction (meta tags, JSON-LD, <time> elements)

Language detection from extracted text

URL canonicalization and tracking parameter stripping

SHA-256 content hashing for change detection

Link discovery and filtering by scope rules

PDF and DOCX text extraction

Connectors

Built-in web connector and custom extensibility

Built-in web connector

SQS-driven, stateless Python workers that handle HTTP fetching with conditional GET, robots.txt compliance, rate limiting, redirect handling, HTML/PDF/DOCX extraction, snapshot storage to S3, and automatic sitemap discovery. Horizontally scalable - if a worker crashes, another picks up the task automatically.

Custom connectors via Apps SDK, coming soon

Build connectors for sources that need API access, authentication, or specialized extraction. Custom connectors extend the Connector base class, define a manifest with config and attribute schemas, and run as apps on the platform.

Reliability

Production-grade crawling infrastructure

Automatic retry with exponential backoff on transient failures

Redirect following with loop detection

Encoding detection (charset headers, meta tags, BOM, fallback)

Large page handling with streaming and configurable size limits

Raw HTML snapshots stored to S3 for reprocessing without refetching

Structured logging and OpenTelemetry tracing per task

Start crawling

Define your sources and let RLVNCE handle the rest.

Get started