smolcrawl - simple webcrawling for LLMs

Introduction
Today, I’m excited to announce SmolCrawl: a lightweight Python tool that makes crawling websites and building searchable knowledge bases remarkably simple. Whether you’re a developer looking to create documentation search, a researcher collecting information, or someone who wants to build personal knowledge collections, SmolCrawl streamlines the process of extracting, organizing, and searching web content.
SmolCrawl emerged from a common challenge: the need for an easy way to transform web content into searchable knowledge that works locally. While there are many complex web crawlers and search solutions available, I wanted something that was simple to use, required minimal setup, and produced clean, useful output in multiple formats.
The project bridges the gap between web content and local knowledge management, enabling you to build your own mini search engines and knowledge bases with just a few commands. What makes SmolCrawl different is its focus on simplicity without sacrificing powerful features like content extraction, search indexing, and markdown conversion.
Problem Statement
Creating personal knowledge bases from web content has traditionally been a complex endeavor. Developers and researchers often face several challenges:
-
Content Acquisition: Most web crawlers are either too simplistic (just downloading HTML) or overly complex, requiring extensive configuration and infrastructure.
-
Content Quality: Raw HTML is messy and contains a lot of non-essential elements like navigation, ads, and footers. Extracting the meaningful content requires significant preprocessing.
-
Search Capabilities: Once content is collected, making it searchable often involves setting up complex search infrastructure or relying on third-party services.
-
Technical Overhead: Many existing solutions require significant technical expertise across multiple domains like web scraping, content processing, and search indexing.
-
Flexibility Limitations: Most tools force you into a specific output format, making it difficult to integrate with other systems or workflows.
There’s a clear gap in the market for a lightweight, easy-to-use solution that handles the entire pipeline from web crawling to searchable knowledge base creation. This gap is particularly evident for individual developers, small teams, and researchers who need something more powerful than simple scraping but less complex than enterprise-grade systems.
SmolCrawl aims to fill this gap by providing a balanced solution that’s both powerful and approachable.
Key Features
SmolCrawl combines simplicity with power through a set of carefully designed features:
Simple Web Crawling
With just a single command, you can crawl an entire website. SmolCrawl handles URL discovery, request management, and content extraction automatically:
smolcrawl crawl https://example.com
Intelligent Content Extraction
Not all HTML is created equal. SmolCrawl uses advanced readability algorithms to extract the meaningful content from web pages, filtering out navigation menus, advertisements, footers, and other non-essential elements. This results in clean, focused content that contains just the information you need.
Clean Markdown Conversion
HTML is great for browsers but terrible for reading and processing. SmolCrawl automatically converts HTML content into clean, readable markdown format that preserves the important structural elements while eliminating the complexity of HTML tags.
Fast Search Indexing
Built on Tantivy, a Rust-based search library, SmolCrawl provides blazing-fast full-text search capabilities. This enables you to quickly find relevant information across all your crawled content with proper relevance ranking:
smolcrawl query my_index "your search query" --limit 10
Efficient Caching
SmolCrawl implements disk-based caching to prevent redundant crawling of the same content. This not only speeds up subsequent runs but also helps to be a good citizen of the web by reducing unnecessary requests to target servers.
Flexible Output Options
Different workflows require different outputs. SmolCrawl supports multiple output formats:
- Search indexes for fast querying
- Markdown files for reading and publishing
- XML exports for integration with other systems
Easy-to-Use CLI
All of SmolCrawl’s functionality is accessible through a simple, intuitive command-line interface that follows modern CLI design patterns:
# List available indices
smolcrawl list_indices
# Index a website into a searchable database
smolcrawl index https://example.com my_index_name
These features work together to create a seamless experience from web content to searchable knowledge base, all with minimal configuration and technical overhead.
Getting Started
Getting up and running with SmolCrawl is designed to be quick and painless. Here’s how to start building your knowledge base in minutes:
Installation
SmolCrawl is available on PyPI and can be installed using pip:
pip install smolcrawl
For the latest development version, you can install directly from the repository:
# Clone the repository
git clone https://github.com/bllchmbrs/smolcrawl.git
cd smolcrawl
# Install the package
pip install -e .
SmolCrawl requires Python 3.11 or higher. All dependencies will be automatically installed.
Basic Usage
Crawl a Website
The simplest way to use SmolCrawl is to crawl a website:
smolcrawl crawl https://example.com
This command will crawl the website and display information about the crawled pages.
Create a Searchable Index
To create a searchable index of a website:
smolcrawl index https://example.com my_index_name
This will crawl the website and create a search index named my_index_name
.
Search Your Index
Once you’ve created an index, you can search it:
smolcrawl query my_index_name "your search query"
This will return the most relevant results from your index.
Common Workflows
Documentation Search for an Open-Source Project
# Create a search index for a project's documentation
smolcrawl index https://project-docs.example.com project_docs
# Search for specific information in the documentation
smolcrawl query project_docs "configuration options"
Personal Knowledge Base from Multiple Sources
# Index multiple websites into a single knowledge base
smolcrawl index https://blog.example.com my_knowledge_base
smolcrawl index https://docs.example.com my_knowledge_base
smolcrawl index https://wiki.example.com my_knowledge_base
# Search across all sources
smolcrawl query my_knowledge_base "topic of interest"
Generate Markdown Files for Publishing
# Create markdown files from a website
smolcrawl index https://blog.example.com blog_content --index-type markdown
# The markdown files will be stored in your storage directory
# (default: ./smolcrawl-data/markdown_files/blog_content)
Create XML Export for Integration
# Create an XML export of a website
smolcrawl index https://api-docs.example.com api_docs --index-type xml
# The XML file will be stored in your storage directory
# (default: ./smolcrawl-data/xml_files/api_docs.xml)
With these basic commands, you can quickly build powerful knowledge bases from any web content.
Technical Architecture
SmolCrawl’s architecture is designed to be modular and efficient, ensuring a smooth pipeline from web crawling to searchable knowledge.
Component Structure
SmolCrawl consists of three main components:
-
Crawler: Handles web requests, URL discovery, and content extraction using BeautifulSoupCrawler and readabilipy.
-
Content Processor: Extracts meaningful content from HTML and converts it to markdown using readability algorithms and markdownify.
-
Indexers: Stores and makes content searchable through different backends:
- TantivyIndexer: Creates a full-text search index using Tantivy
- MarkdownFileIndexer: Generates markdown files organized by URL structure
- XmlFileIndexer: Produces a single XML file containing all content
These components are integrated through a simple CLI interface built with Typer.
Crawling and Indexing Process
Here’s how the process works end-to-end:
-
URL Entry Point: The process begins with a single URL provided by the user.
-
Discovery and Crawling: The crawler visits the URL, extracts content, and discovers new URLs within the same domain.
-
Content Extraction: For each page, the readability algorithm identifies and extracts the meaningful content (main article, documentation, etc.) while filtering out navigation, ads, and other non-essential elements.
-
Markdown Conversion: The extracted HTML content is converted to markdown format, preserving headings, lists, links, and other important structural elements.
-
Caching: Processed pages are cached on disk to avoid redundant crawling in subsequent runs.
-
Indexing: Depending on the selected output type, the content is either:
- Added to a Tantivy search index for fast querying
- Written to markdown files mirroring the site’s URL structure
- Compiled into a single XML file with appropriate metadata
-
Querying (for search indexes): When searching, the query is processed by Tantivy, which returns relevant documents with scores based on content similarity.
Performance Considerations
SmolCrawl incorporates several performance optimizations:
-
Disk-based Caching: Prevents redundant crawling and processing of the same URLs across multiple runs.
-
Rust-based Search: Using Tantivy (a Rust implementation similar to Lucene) provides fast search capabilities even for large document collections.
-
Parallel Crawling: The underlying crawler handles multiple requests in parallel to speed up the crawling process.
-
Efficient Storage: Content is stored in optimized formats, with the raw HTML kept only when necessary.
-
Memory Management: SmolCrawl is designed to handle large websites without excessive memory usage by processing pages incrementally.
This architecture balances simplicity with power, allowing SmolCrawl to handle a wide range of use cases from small personal websites to large documentation portals, all while remaining easy to use and requiring minimal resources.
Use Cases
SmolCrawl’s flexibility makes it suitable for a wide range of applications. Here are some of the most compelling use cases:
Creating Documentation Search for Open-Source Projects
Open-source projects often have extensive documentation spread across multiple pages. SmolCrawl can transform this documentation into a searchable knowledge base:
- Index the entire documentation site with a single command
- Provide fast, accurate search across all documentation pages
- Enable offline access to the documentation
- Create custom search experiences for project-specific terminology
For maintainers, this means empowering users to find answers quickly without building and maintaining a dedicated search infrastructure.
Building Personal Research Knowledge Bases
Researchers often collect information from multiple sources. SmolCrawl enables you to:
- Aggregate content from multiple research websites and papers
- Create a unified, searchable database of research material
- Extract and index only the relevant parts of papers and articles
- Build a personal research assistant for quick reference
By making your research material searchable, you can make connections between related concepts more easily and find relevant information faster.
Archiving Website Content for Offline Access
Whether you’re preparing for a trip with limited internet access or want to preserve content that might change or disappear, SmolCrawl provides:
- Complete website archives in clean, readable markdown format
- Preservation of the essential content without the clutter
- Organized storage that mirrors the original site structure
- Offline searchability for accessing information without an internet connection
This is especially valuable for reference materials, documentation, or content you rely on regularly.
Generating Searchable Collections for AI Tools and Agents
AI tools and agents often need structured data to work with. SmolCrawl can:
- Create clean, structured datasets from web content
- Format content in XML or markdown for easy ingestion by AI systems
- Build domain-specific knowledge bases for retrieval-augmented generation (RAG)
- Enable semantic search across collected documents
By providing high-quality, pre-processed content, you can significantly improve the performance of AI systems that work with your data.
Content Monitoring and Analysis
Track changes and analyze content across websites:
- Monitor documentation for updates or changes
- Analyze content trends across multiple sites
- Create a historical archive of changing information
- Compare different versions of the same content over time
These use cases demonstrate SmolCrawl’s versatility in turning web content into valuable, searchable knowledge bases for various purposes. Whether you’re a developer, researcher, content creator, or knowledge worker, SmolCrawl provides a simple yet powerful way to make web content more accessible and useful.
Responsible Crawling Guidelines
Web crawling comes with responsibilities. While SmolCrawl makes it easy to crawl websites, it’s important to do so ethically and responsibly. Here are some guidelines to follow:
Best Practices for Ethical Web Crawling
-
Get Permission When Possible: If you’re crawling a website for commercial purposes or extensive data collection, consider reaching out to the site owner for permission.
-
Identify Your Crawler: Use an identifiable user agent in your requests so site owners can see who is crawling their site. While SmolCrawl doesn’t currently support custom user agents, this is planned for a future release.
-
Limit the Scope: Only crawl what you need. Use SmolCrawl to target specific sections of websites rather than crawling entire domains unnecessarily.
-
Minimize Impact: Schedule large crawling jobs during off-peak hours to reduce the impact on the target server.
-
Be Transparent: If you’re building a service based on crawled data, be clear about your data sources and how the information is being used.
Respecting robots.txt and Rate Limits
-
Check robots.txt Files: Before crawling a website, check its robots.txt file (usually found at
https://example.com/robots.txt
) to see which parts of the site are off-limits to crawlers. -
Honor Crawl-Delay Directives: Some robots.txt files specify a crawl delay, which indicates how long you should wait between requests. While SmolCrawl doesn’t currently implement automatic robots.txt parsing, honoring these guidelines manually is important.
-
Respect Rate Limits: Pay attention to HTTP 429 (Too Many Requests) responses, which indicate you’re hitting the site too frequently. If you receive these, reduce your crawling frequency.
-
Consider Implementing Delays: For intensive crawling operations, consider adding delays between requests using environment variables or configuration options (planned for a future release).
Avoiding Server Overload
-
Parallelize Carefully: While parallel requests can speed up crawling, too many simultaneous connections can overwhelm smaller servers. SmolCrawl’s underlying crawler has some parallelization, but use it judiciously.
-
Implement Exponential Backoff: If you encounter server errors (5xx responses), back off exponentially to give the server time to recover before retrying.
-
Monitor Server Response Times: If response times start increasing significantly, it could be a sign that you’re putting too much load on the server. Consider slowing down your crawling rate.
-
Limit Crawling Depth and Breadth: For large websites, consider limiting the crawl depth or focusing on specific URL patterns to avoid excessive requests.
-
Use Caching Effectively: Take advantage of SmolCrawl’s caching capabilities to avoid recrawling the same content unnecessarily.
By following these guidelines, you can ensure that your use of SmolCrawl remains responsible and respectful of website owners and their resources. Remember that responsible crawling is not just about technical compliance—it’s about being a good citizen of the web ecosystem.
Get Involved
SmolCrawl is an open-source project, and we’d love for you to be part of its development and growth. Here’s how you can get involved:
GitHub Repository
The project is hosted on GitHub at:
https://github.com/bllchmbrs/smolcrawl
Star the repository to show your support and stay updated with new releases and changes!
How to Contribute
Contributing to SmolCrawl is straightforward:
-
Fork the Repository: Create your own fork of the project.
-
Set Up Your Environment: Clone your fork and install the development dependencies:
git clone https://github.com/yourusername/smolcrawl.git cd smolcrawl pip install -e ".[dev]" # Install with development dependencies
-
Create a Feature Branch:
git checkout -b feature/your-amazing-feature
-
Make Your Changes: Implement your feature or fix with clear, well-documented code.
-
Write Tests: Add tests for your changes to ensure they work as expected.
-
Run the Test Suite: Make sure all tests pass:
pytest
-
Submit a Pull Request: Push your changes to your fork and submit a pull request to the main repository.
-
Engage in Code Review: Respond to feedback and make necessary adjustments.
We welcome contributions of all kinds, from documentation improvements to new features. Even if you’re new to open source, we’re happy to help you through the process!
Areas Where Help is Needed
Here are some specific areas where contributions would be especially valuable:
- Documentation: Improving the README, adding more examples, and creating tutorials.
- Testing: Expanding test coverage and adding integration tests.
- Features: Implementing items from the roadmap.
- Bug Fixes: Addressing issues reported by users.
- Examples: Creating example projects and use cases.
- Performance: Optimizing crawling and indexing performance.
Community Channels and Support Options
While we’re still growing our community, here are ways to get help and connect with other users:
- GitHub Issues: For bug reports, feature requests, and general questions.
- Discussions: Use GitHub Discussions for broader topics and community engagement.
- Email Support: For direct assistance, you can reach out to [email protected]
Code of Conduct
We’re committed to providing a welcoming and inclusive environment for everyone. All contributors are expected to adhere to our Code of Conduct, which promotes respect, empathy, and constructive collaboration.
Recognition
Contributors will be acknowledged in our documentation and release notes. We believe in giving credit where it’s due!
Whether you’re a developer, writer, designer, or just an enthusiastic user, there’s a place for you in the SmolCrawl community. Your contributions, feedback, and ideas will help shape the future of this tool and make it even more valuable for knowledge workers everywhere.
Conclusion
SmolCrawl represents a step towards making web content more accessible, organized, and searchable for individuals and small teams. By simplifying the process of crawling websites and creating knowledge bases, we hope to empower developers, researchers, and knowledge workers to build valuable resources from the wealth of information available on the web.
Try SmolCrawl Today
If you’re looking for a simple way to create searchable knowledge bases from web content, I invite you to give SmolCrawl a try:
pip install smolcrawl
In just a few minutes, you can transform a website into a searchable resource that works offline and integrates with your existing workflows. Whether you’re creating documentation search for an open-source project, building a research database, or archiving important content, SmolCrawl provides the tools you need.
Share Your Feedback
Your feedback is incredibly valuable as we continue to develop and improve SmolCrawl. After trying it out, please consider:
- Opening an issue on GitHub with any bugs or challenges you encounter
- Suggesting features that would make SmolCrawl more useful for your workflow
- Sharing your success stories and use cases to inspire others
The Future of Knowledge Management
We believe that personal knowledge management is evolving rapidly, and tools like SmolCrawl play an important role in this evolution. As information continues to grow exponentially, the ability to curate, organize, and search through content becomes increasingly important.
SmolCrawl aims to be part of the solution by providing a bridge between the vast, unstructured web and your personal, searchable knowledge base. By making web content more accessible and useful, we hope to contribute to a future where knowledge is more easily shared, discovered, and applied.
Thank you for your interest in SmolCrawl. We’re excited to see what you’ll build with it and how it will enhance your knowledge management workflows. Together, we can make the web’s information more accessible and useful for everyone.
Happy crawling!