Saturday

module

v0.0.0-...-c485d22 Latest Latest Go to latest Published: Jun 4, 2025 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/box1bs/Saturday

Links

Open Source Insights

README ¶

Saturday - Search Robot and Search Engine

Saturday Logo

Project Overview

Saturday is a powerful search robot and search engine built in Go, designed to efficiently index web content and deliver highly relevant search results. It crawls specified websites, extracts and processes text, and constructs a robust search index stored in BadgerDB, supporting advanced querying with a hybrid ranking approach that combines traditional algorithms and machine learning.

Core Features

Search Robot Features

Multi-site Crawling: The search robot can crawl multiple websites with configurable depth limits.
Rate Limiting: The search robot respects server load by controlling request frequency.
Worker Pool Architecture: Enables concurrent crawling with efficient resource use.
Sitemap Support: The search robot leverages sitemap.xml files for comprehensive URL discovery (assumed functionality based on typical search robot design).
Robots.txt Compliance: The search robot adheres to website crawling policies (assumed based on standard practice).

Search Features

Text Processing: Applies stemming and stop word filtering for enhanced search quality.
Hybrid Search Ranking: Combines TF-IDF, BM25, and BERT-tiny model for precise results.
Neural Embeddings: Uses BERT-tiny for generating document and query embeddings, enabling semantic search capabilities.

System Features

Highly Configurable: Customizes behavior via JSON configuration files.
REST API: Controls the search robot and search functionality remotely through HTTP endpoints.
Persistent Storage: Uses BadgerDB for scalable, disk-based index storage.

System Architecture

Saturday is organized into modular components with distinct responsibilities, as derived from the provided code:

Search Robot: Crawls webpages, extracts links, and processes content (implemented in srv package via SearchIndex).
Search Index: Indexes documents and handles queries (part of indexRepository and srv packages).
Worker Pool: Manages concurrent task execution with controlled parallelism (workerPool package).
Text Processing: Applies English stemming and stop word filtering (stemmer package).
Rate Limiter: Ensures respectful request rates to target servers (configurable via Rate in CrawlRequest).
Document Handler: Extracts text and metadata from content (assumed functionality within SearchIndex).
Logger: Provides asynchronous logging for performance and debugging (logger package).
REST Server: Exposes functionality via HTTP endpoints (srv package).
Index Repository: Manages persistent storage and retrieval of documents and visited URLs (indexRepository package).

Technical Implementation

Search Robot Process

The search robot's process, inferred from the srv and indexRepository packages, is systematic and efficient:

Initialization: Begins with base URLs specified in the configuration (BaseUrls in CrawlRequest).
Per URL:
- Fetches page content and extracts links and text (assumed within SearchIndex.Index).
- Normalizes URLs to avoid redundancy (handled by visitedURLs in indexRepository).
- Applies text processing using stemming and stop word removal (stemmer package).
- Indexes processed content in BadgerDB (IndexDocument in indexRepository).
- Enqueues new links within depth and domain constraints (MaxDepthCrawl and OnlySameDomain).

Search Functionality

Saturday's search engine, implemented in the srv package, processes queries as follows:

Query Processing: Tokenizes and stems the user's query using stemmer.NewEnglishStemmer().
Document Retrieval: Fetches documents matching query terms from the index (GetDocumentsByWord).
Semantic Scoring: Generates embeddings using BERT-tiny model for both query and documents.
Hybrid Ranking: Combines traditional scoring (TF-IDF, BM25) with cosine similarity of BERT-tiny embeddings.
Result Ranking: Sorts results by relevance using the combined scoring approach.
Response: Returns URLs and descriptions (SearchResponse).

Concurrency Model

The workerPool package implements a worker pool pattern for optimal performance:

Workers: Configurable number of goroutines process tasks (WorkerCount).
Task Queue: Manages tasks with a customizable capacity (TaskCount).
Graceful Shutdown: Supports stopping via channels (Stop()).
Resource Management: Prevents system overload using atomic counters.

Configuration Options

Saturday is configured via a JSON request to the REST API. Example:

{
    "base_urls": ["https://example.com/"],
    "worker_count": 10,
    "task_count": 100,
    "max_links_in_page": 50,
    "max_depth_crawl": 3,
    "only_same_domain": true,
    "rate": 5
}

Key Parameters

base_urls: Starting URLs for the search robot (required).
worker_count: Number of concurrent workers.
task_count: Task queue capacity.
max_links_in_page: Maximum links extracted per page.
max_depth_crawl: Maximum crawl depth.
only_same_domain: Restricts to base URL domains.
rate: Requests per second.

Usage Examples

REST Server Mode

Run Saturday as a REST server:

./saturday --srv-port=8080

API Endpoints Documentation

Authentication and Encryption

All endpoints except /public and /aes require encrypted request bodies using AES encryption.

1. Get Public Key

GET /public

Description: Retrieves the RSA public key for encrypting the AES key. Response: PEM-encoded RSA public key Content-Type: application/x-pem-file Status Codes:

200: Success
500: Server error getting/encoding public key

2. Set AES Key

POST /aes

Description: Sends RSA-encrypted AES key for subsequent requests Request Body: RSA-encrypted AES key (base64 encoded) Status Codes:

200: Success
400: Invalid request body
500: Decryption failed

Search Robot Control

3. Start Crawling

POST /crawl/start

Description: Initiates a new crawling job Request Body:

{
    "base_urls": ["https://example.com"],
    "worker_count": 10,
    "task_count": 100,
    "max_links_in_page": 50,
    "max_depth_crawl": 3,
    "only_same_domain": true,
    "rate": 5
}

Response:

{
    "job_id": "uuid-string",
    "status": "started"
}

Status Codes:

200: Job started successfully
400: Invalid request parameters
500: Internal server error

4. Stop Crawling

POST /crawl/stop

Description: Stops an active crawling job Request Body:

{
    "job_id": "uuid-string"
}

Response:

{
    "status": "stopped"
}

Status Codes:

200: Job stopped successfully
400: Invalid job ID
404: Job not found

5. Get Crawling Status

GET /crawl/status?job_id=uuid-string

Description: Retrieves the current status of a crawling job Query Parameters:

job_id: UUID of the crawling job Response:

{
    "status": "running",
    "pages_crawled": 150
}

Possible Status Values:

"initializing": Job is starting up
"running": Job is actively crawling
"stopping": Job is gracefully shutting down
"completed": Job finished successfully
"failed": Job encountered an error
"not_found": Job ID doesn't exist

Status Codes:

200: Status retrieved successfully
400: Missing or invalid job ID
404: Job not found

Search Operations

6. Search Indexed Content

POST /search

Description: Searches the indexed content using hybrid ranking Request Body:

{
    "job_id": "uuid-string",
    "query": "search terms",
    "max_results": 10
}

Response:

[
    {
        "url": "https://example.com/page1",
        "description": "Page description with highlighted search terms"
    }
]

Query Parameters:

query: Search terms (required)
max_results: Maximum number of results to return (default: 10)
job_id: Optional ID to search within specific crawl results

Ranking Factors:

Word frequency (TF-IDF)
BM25 score
Semantic similarity (BERT embeddings)
Word inclusion count

Status Codes:

200: Search completed successfully
400: Invalid search parameters
500: Search operation failed

Response Encryption

All successful responses (except /public) are encrypted using the following format:

{
    "data": "base64-encoded-encrypted-response"
}

Error Responses

Error responses follow the standard HTTP status codes and include a plain text error message in the response body.

Technical Requirements

Go: 1.23.3
Dependencies:
- github.com/dgraph-io/badger/v3 (BadgerDB)
- github.com/google/uuid (UUID generation)
- golang.org/x/net (html handling)
- Additional dependencies inferred from imports (e.g., net/http, sync).

License

Saturday is licensed under the MIT License (assumed based on typical Go projects).

Directories ¶

Path	Synopsis
cmd
app command
configs
internal
app/index
app/index/spell_checker
app/index/tree
app/scraper
encrypt
logo
model
repository
server
server/validation
logs
logger
pkg
parser
textHandling
workerPool

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL