Crawling & Indexing

Define what to crawl. We handle fetching, extraction, deduplication, and indexing. Configure policies to control scope, depth, and freshness.

Source configuration

Three ways to define what to crawl

Seed URLs

Start from specific pages and follow links within scope.

Domains

Crawl an entire domain or subdomain with automatic sitemap discovery.

Sitemaps

Provide sitemap URLs for precise page selection and priority hints.

Crawl policies

Fine-grained control over scope and behavior

Depth limits

Control how many links deep the crawler follows from seed pages.

Page budgets

Set maximum pages per source or per crawl run to control scope and cost.

Rate limits

Per-domain rate limiting to be a good citizen. Configurable delays between requests.

Include/exclude rules

URL patterns to include or exclude. Focus on the content that matters.

Recrawl scheduling

Set recrawl frequency per source. Incremental recrawls detect new and changed pages.

robots.txt

Respected by default. Override available when you have authorization to crawl restricted paths.

Connectors

Built-in and custom data sources

The built-in web connector handles standard HTTP crawling. For sources that need API access, authentication, or specialized extraction, build a custom connector using the Connector SDK.

Custom connectors define a manifest, configuration schema, and extraction logic. Deploy them as private apps for your organization or publish to the marketplace.

Content extraction

What gets indexed from each page

Title extraction with fallback hierarchy

Main content with boilerplate removal

Heading structure (h1-h6)

Document metadata (dates, authors, categories)

Canonicalization and deduplication

Content hashing for change detection

Start crawling

Define your sources and let RLVNCE handle the rest.

Get started