Crawling & Indexing
Define what to crawl. We handle fetching, extraction, deduplication, and indexing. Configure policies to control scope, depth, and freshness.
Source configuration
Three ways to define what to crawl
Seed URLs
Start from specific pages and follow links within scope.
Domains
Crawl an entire domain or subdomain with automatic sitemap discovery.
Sitemaps
Provide sitemap URLs for precise page selection and priority hints.
Crawl policies
Fine-grained control over scope and behavior
Depth limits
Control how many links deep the crawler follows from seed pages.
Page budgets
Set maximum pages per source or per crawl run to control scope and cost.
Rate limits
Per-domain rate limiting to be a good citizen. Configurable delays between requests.
Include/exclude rules
URL patterns to include or exclude. Focus on the content that matters.
Recrawl scheduling
Set recrawl frequency per source. Incremental recrawls detect new and changed pages.
robots.txt
Respected by default. Override available when you have authorization to crawl restricted paths.
Connectors
Built-in and custom data sources
The built-in web connector handles standard HTTP crawling. For sources that need API access, authentication, or specialized extraction, build a custom connector using the Connector SDK.
Custom connectors define a manifest, configuration schema, and extraction logic. Deploy them as private apps for your organization or publish to the marketplace.
Content extraction
What gets indexed from each page
Title extraction with fallback hierarchy
Main content with boilerplate removal
Heading structure (h1-h6)
Document metadata (dates, authors, categories)
Canonicalization and deduplication
Content hashing for change detection