Web Scraper¶
The web scraper library extracts clean text content from web pages, optimized for LLM consumption. It was extracted from the recipes server's html_fetcher.py into a standalone reusable library.
Quick Reference¶
| Package | jarvis-web-scraper |
| Source | jarvis-web-scraper/ |
| Tests | 27 tests |
Usage¶
Async API
All methods are async and must be awaited.
from jarvis_web_scraper import WebScraper
scraper = WebScraper()
# Extract clean text from a single URL
result = await scraper.fetch_and_extract(url="https://example.com/article")
if result.ok:
print(result.text_content) # Clean extracted text
print(result.title) # Page title
print(result.word_count) # Word count of extracted text
print(result.fetch_time_ms) # Fetch + extraction latency
else:
print(result.error) # Error message if fetch/parse failed
Batch Fetching¶
Fetch multiple URLs concurrently with batch_fetch():
urls = ["https://example.com/page1", "https://example.com/page2"]
results = await scraper.batch_fetch(urls, max_concurrent=3)
# Returns list[ScrapedPage] in the same order as input
ScrapedPage Fields¶
| Field | Type | Description |
|---|---|---|
url |
str | Final URL after redirects |
title |
str | None | Page <title> |
text_content |
str | Clean extracted text, stripped of nav/ads/boilerplate |
word_count |
int | Word count of text_content |
fetch_time_ms |
int | Total latency in milliseconds |
error |
str | None | Error message if the request failed |
ok |
bool | True if fetch succeeded and content was extracted |
Configuration¶
Pass a FetchConfig to customize scraping behavior:
from jarvis_web_scraper import WebScraper, FetchConfig
config = FetchConfig(
timeout=15,
max_chars=8000,
block_private_hosts=True,
user_agent="Jarvis/1.0",
)
scraper = WebScraper(config=config)
| Parameter | Default | Description |
|---|---|---|
timeout |
15 |
Request timeout in seconds |
max_chars |
8000 |
Maximum characters returned per page (truncates at word boundary) |
max_redirects |
5 |
Maximum HTTP redirects to follow |
block_private_hosts |
True |
Block requests to private/loopback IP ranges (SSRF protection) |
user_agent |
(default UA) | User-Agent header |
headers |
{} |
Additional HTTP headers |
Features¶
- Extracts main content, stripping navigation, ads, and boilerplate
- Returns clean text suitable for LLM context windows
- Handles common web page structures and formats
- Private IP blocking protects against SSRF in multi-tenant setups
Consumers¶
- jarvis-command-center — deep research tool (web search → scrape → summarize)
- jarvis-recipes-server — URL recipe import (HTML parsing)