Saturday - Search Robot and Search Engine

Project Overview
Saturday is a powerful search robot and search engine built in Go, designed to efficiently index web content and deliver highly relevant search results. It crawls specified websites, extracts and processes text, and constructs a robust search index stored in BadgerDB, supporting advanced querying with a hybrid ranking approach that combines traditional algorithms and machine learning.
Core Features
Search Robot Features
- Multi-site Crawling: The search robot can crawl multiple websites with configurable depth limits.
- Rate Limiting: The search robot respects server load by controlling request frequency.
- Worker Pool Architecture: Enables concurrent crawling with efficient resource use.
- Sitemap Support: The search robot leverages sitemap.xml files for comprehensive URL discovery (assumed functionality based on typical search robot design).
- Robots.txt Compliance: The search robot adheres to website crawling policies (assumed based on standard practice).
Search Features
- Text Processing: Applies stemming and stop word filtering for enhanced search quality.
- Hybrid Search Ranking: Combines TF-IDF, BM25, and BERT-tiny model for precise results.
- Neural Embeddings: Uses BERT-tiny for generating document and query embeddings, enabling semantic search capabilities.
System Features
- Highly Configurable: Customizes behavior via JSON configuration files.
- REST API: Controls the search robot and search functionality remotely through HTTP endpoints.
- Persistent Storage: Uses BadgerDB for scalable, disk-based index storage.
System Architecture
Saturday is organized into modular components with distinct responsibilities, as derived from the provided code:
- Search Robot: Crawls webpages, extracts links, and processes content (implemented in
srv package via SearchIndex).
- Search Index: Indexes documents and handles queries (part of
indexRepository and srv packages).
- Worker Pool: Manages concurrent task execution with controlled parallelism (
workerPool package).
- Text Processing: Applies English stemming and stop word filtering (
stemmer package).
- Rate Limiter: Ensures respectful request rates to target servers (configurable via
Rate in CrawlRequest).
- Document Handler: Extracts text and metadata from content (assumed functionality within
SearchIndex).
- Logger: Provides asynchronous logging for performance and debugging (
logger package).
- REST Server: Exposes functionality via HTTP endpoints (
srv package).
- Index Repository: Manages persistent storage and retrieval of documents and visited URLs (
indexRepository package).
Technical Implementation
Search Robot Process
The search robot's process, inferred from the srv and indexRepository packages, is systematic and efficient:
- Initialization: Begins with base URLs specified in the configuration (
BaseUrls in CrawlRequest).
- Per URL:
- Fetches page content and extracts links and text (assumed within
SearchIndex.Index).
- Normalizes URLs to avoid redundancy (handled by
visitedURLs in indexRepository).
- Applies text processing using stemming and stop word removal (
stemmer package).
- Indexes processed content in BadgerDB (
IndexDocument in indexRepository).
- Enqueues new links within depth and domain constraints (
MaxDepthCrawl and OnlySameDomain).
Search Functionality
Saturday's search engine, implemented in the srv package, processes queries as follows:
- Query Processing: Tokenizes and stems the user's query using
stemmer.NewEnglishStemmer().
- Document Retrieval: Fetches documents matching query terms from the index (
GetDocumentsByWord).
- Semantic Scoring: Generates embeddings using BERT-tiny model for both query and documents.
- Hybrid Ranking: Combines traditional scoring (TF-IDF, BM25) with cosine similarity of BERT-tiny embeddings.
- Result Ranking: Sorts results by relevance using the combined scoring approach.
- Response: Returns URLs and descriptions (
SearchResponse).
Concurrency Model
The workerPool package implements a worker pool pattern for optimal performance:
- Workers: Configurable number of goroutines process tasks (
WorkerCount).
- Task Queue: Manages tasks with a customizable capacity (
TaskCount).
- Graceful Shutdown: Supports stopping via channels (
Stop()).
- Resource Management: Prevents system overload using atomic counters.
Configuration Options
Saturday is configured via a JSON request to the REST API. Example:
{
"base_urls": ["https://example.com/"],
"worker_count": 10,
"task_count": 100,
"max_links_in_page": 50,
"max_depth_crawl": 3,
"only_same_domain": true,
"rate": 5
}
Key Parameters
base_urls: Starting URLs for the search robot (required).
worker_count: Number of concurrent workers.
task_count: Task queue capacity.
max_links_in_page: Maximum links extracted per page.
max_depth_crawl: Maximum crawl depth.
only_same_domain: Restricts to base URL domains.
rate: Requests per second.
Usage Examples
REST Server Mode
Run Saturday as a REST server:
./saturday --srv-port=8080
API Endpoints Documentation
Authentication and Encryption
All endpoints except /public and /aes require encrypted request bodies using AES encryption.
1. Get Public Key
GET /public
Description: Retrieves the RSA public key for encrypting the AES key.
Response: PEM-encoded RSA public key
Content-Type: application/x-pem-file
Status Codes:
- 200: Success
- 500: Server error getting/encoding public key
2. Set AES Key
POST /aes
Description: Sends RSA-encrypted AES key for subsequent requests
Request Body: RSA-encrypted AES key (base64 encoded)
Status Codes:
- 200: Success
- 400: Invalid request body
- 500: Decryption failed
Search Robot Control
3. Start Crawling
POST /crawl/start
Description: Initiates a new crawling job
Request Body:
{
"base_urls": ["https://example.com"],
"worker_count": 10,
"task_count": 100,
"max_links_in_page": 50,
"max_depth_crawl": 3,
"only_same_domain": true,
"rate": 5
}
Response:
{
"job_id": "uuid-string",
"status": "started"
}
Status Codes:
- 200: Job started successfully
- 400: Invalid request parameters
- 500: Internal server error
4. Stop Crawling
POST /crawl/stop
Description: Stops an active crawling job
Request Body:
{
"job_id": "uuid-string"
}
Response:
{
"status": "stopped"
}
Status Codes:
- 200: Job stopped successfully
- 400: Invalid job ID
- 404: Job not found
5. Get Crawling Status
GET /crawl/status?job_id=uuid-string
Description: Retrieves the current status of a crawling job
Query Parameters:
- job_id: UUID of the crawling job
Response:
{
"status": "running",
"pages_crawled": 150
}
Possible Status Values:
- "initializing": Job is starting up
- "running": Job is actively crawling
- "stopping": Job is gracefully shutting down
- "completed": Job finished successfully
- "failed": Job encountered an error
- "not_found": Job ID doesn't exist
Status Codes:
- 200: Status retrieved successfully
- 400: Missing or invalid job ID
- 404: Job not found
Search Operations
6. Search Indexed Content
POST /search
Description: Searches the indexed content using hybrid ranking
Request Body:
{
"job_id": "uuid-string",
"query": "search terms",
"max_results": 10
}
Response:
[
{
"url": "https://example.com/page1",
"description": "Page description with highlighted search terms"
}
]
Query Parameters:
- query: Search terms (required)
- max_results: Maximum number of results to return (default: 10)
- job_id: Optional ID to search within specific crawl results
Ranking Factors:
- Word frequency (TF-IDF)
- BM25 score
- Semantic similarity (BERT embeddings)
- Word inclusion count
Status Codes:
- 200: Search completed successfully
- 400: Invalid search parameters
- 500: Search operation failed
Response Encryption
All successful responses (except /public) are encrypted using the following format:
{
"data": "base64-encoded-encrypted-response"
}
Error Responses
Error responses follow the standard HTTP status codes and include a plain text error message in the response body.
Technical Requirements
- Go: 1.23.3
- Dependencies:
github.com/dgraph-io/badger/v3 (BadgerDB)
github.com/google/uuid (UUID generation)
golang.org/x/net (html handling)
- Additional dependencies inferred from imports (e.g.,
net/http, sync).
License
Saturday is licensed under the MIT License (assumed based on typical Go projects).