Documentation
¶
Overview ¶
Package internal provides caching functionality for content extraction results. It implements a thread-safe LRU cache with TTL support to improve performance for repeated extractions of the same content.
Package internal provides centralized constant definitions for internal use.
Package internal provides character encoding detection and conversion functionality. It supports 15+ encodings including Unicode variants, Western European, and East Asian character sets, with intelligent auto-detection capabilities.
Package internal provides implementation details for the cybergodev/html library. It contains content extraction, table processing, and text manipulation functionality that is not part of the public API.
Package internal provides URL parsing and resolution utilities.
Index ¶
- func CalculateContentDensity(n *html.Node) float64
- func CleanContentNode(node *html.Node) *html.Node
- func CleanText(text string, whitespaceRegex *regexp.Regexp) string
- func ConvertToUTF8(data []byte, charset string) ([]byte, error)
- func CountChildElements(n *html.Node, tag string) int
- func CountTags(n *html.Node) int
- func DetectAndConvertToUTF8(data []byte) ([]byte, string, error)
- func DetectAndConvertToUTF8String(data []byte, forcedEncoding string) (string, string, error)
- func DetectAudioType(url string) string
- func DetectCharsetFromBytes(data []byte) string
- func DetectVideoType(url string) string
- func ExtractBaseFromURL(url string) string
- func ExtractDomain(url string) string
- func ExtractTextWithStructureAndImages(node *html.Node, sb *strings.Builder, _ int, imageCounter *int, ...)
- func FindElementByTag(doc *html.Node, tagName string) *html.Node
- func GetLinkDensity(node *html.Node) float64
- func GetTextContent(node *html.Node) string
- func GetTextLength(node *html.Node) int
- func IsBlockElement(tag string) bool
- func IsDifferentDomain(baseURL, targetURL string) bool
- func IsExternalURL(url string) bool
- func IsInlineElement(tag string) bool
- func IsNonContentElement(tag string) bool
- func IsValidURL(url string) bool
- func IsVideoURL(url string) bool
- func MatchesPattern(value string, patterns map[string]bool) bool
- func NormalizeBaseURL(baseURL string) string
- func RemoveTagContent(content, tag string) string
- func ReplaceHTMLEntities(text string) string
- func ResolveURL(baseURL, relativeURL string) string
- func SanitizeHTML(htmlContent string) string
- func ScoreAttributes(n *html.Node) int
- func ScoreContentNode(node *html.Node) int
- func SelectBestCandidate(candidates map[*html.Node]int) *html.Node
- func ShouldRemoveElement(n *html.Node) bool
- func WalkNodes(node *html.Node, fn func(*html.Node) bool)
- type Cache
- type EncodingDetector
- func (ed *EncodingDetector) DetectAndConvert(data []byte) ([]byte, string, error)
- func (ed *EncodingDetector) DetectCharset(data []byte) string
- func (ed *EncodingDetector) DetectCharsetBasic(data []byte) string
- func (ed *EncodingDetector) DetectCharsetSmart(data []byte) EncodingMatch
- func (ed *EncodingDetector) ToUTF8(data []byte, charset string) ([]byte, error)
- type EncodingMatch
Examples ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CalculateContentDensity ¶
CalculateContentDensity calculates text-to-tag ratio. This is the exported version that uses the internal calculateDensityFromMetrics.
func ConvertToUTF8 ¶ added in v1.2.0
ConvertToUTF8 is a convenience function that converts data to UTF-8
func CountChildElements ¶
CountChildElements counts child elements of specific tag type.
func DetectAndConvertToUTF8 ¶ added in v1.2.0
DetectAndConvertToUTF8 is a convenience function that detects charset and converts to UTF-8
func DetectAndConvertToUTF8String ¶ added in v1.2.0
DetectAndConvertToUTF8String detects encoding and converts to UTF-8 string. If forcedEncoding is not empty, it will use that encoding instead of auto-detection. Returns a UTF-8 string and the detected/used encoding.
func DetectAudioType ¶
DetectAudioType detects the audio MIME type from a URL
func DetectCharsetFromBytes ¶ added in v1.2.0
DetectCharsetFromBytes is a convenience function that detects charset from byte data
func DetectVideoType ¶
DetectVideoType detects the video MIME type from a URL
func ExtractBaseFromURL ¶ added in v1.2.0
ExtractBaseFromURL extracts the base URL (scheme://domain/) from a URL. Returns the base URL including trailing slash, or empty string for invalid URLs.
func ExtractDomain ¶ added in v1.2.0
ExtractDomain extracts the domain from a URL. Returns the domain portion (scheme://domain) or empty string for invalid URLs.
func GetLinkDensity ¶
func GetTextContent ¶
Example ¶
ExampleGetTextContent demonstrates the GetTextContent function with HTML entities.
html := `<p> © 2025 — All rights reserved </p>` doc, _ := stdxhtml.Parse(strings.NewReader(html)) result := GetTextContent(doc) fmt.Println(result)
Output: © 2025 — All rights reserved
func GetTextLength ¶
func IsBlockElement ¶
func IsDifferentDomain ¶ added in v1.2.0
IsDifferentDomain checks if two URLs have different domains. Returns false if either URL is not external.
func IsExternalURL ¶
IsExternalURL checks if a URL is an external HTTP(S) URL or protocol-relative URL.
func IsInlineElement ¶ added in v1.2.0
IsInlineElement returns true if the tag is a known inline element. Inline elements should not add newlines or paragraph spacing.
func IsNonContentElement ¶
func IsValidURL ¶ added in v1.2.0
IsValidURL checks if a URL is valid and safe for processing. This is a centralized URL validation function with size limits for security.
func IsVideoURL ¶
IsVideoURL checks if a URL is a video based on extension or embed pattern
func MatchesPattern ¶
MatchesPattern is the exported version of matchesPattern for testing purposes. It checks if value contains any pattern from the map with word boundaries.
func NormalizeBaseURL ¶ added in v1.2.0
NormalizeBaseURL ensures a base URL ends with a slash. Returns empty string for non-HTTP URLs (javascript:, data:, mailto:, etc.).
func RemoveTagContent ¶
RemoveTagContent removes all occurrences of the specified HTML tag and its content. This function uses string-based parsing as the primary method to handle edge cases like unclosed tags, malformed HTML, and to preserve original character case.
func ReplaceHTMLEntities ¶
ReplaceHTMLEntities replaces HTML entities with their corresponding characters. It handles both named entities (like &, ) and numeric entities (like A, A). For unknown entities, it falls back to the standard library's html.UnescapeString. Optimized with a fast path for the most common entities.
Example ¶
ExampleReplaceHTMLEntities demonstrates the ReplaceHTMLEntities function.
input := " © 2025 — Test €100" result := ReplaceHTMLEntities(input) fmt.Println(result)
Output: © 2025 — Test €100
func ResolveURL ¶ added in v1.2.0
ResolveURL resolves a relative URL against a base URL. Handles absolute URLs, protocol-relative URLs, absolute paths, and relative paths.
func SanitizeHTML ¶
func ScoreAttributes ¶
func ScoreContentNode ¶
ScoreContentNode calculates a relevance score for content extraction. Higher scores indicate more likely main content. Negative scores suggest non-content elements. This function has been optimized to reduce DOM traversals by combining multiple metrics.
func ShouldRemoveElement ¶
Types ¶
type EncodingDetector ¶ added in v1.2.0
type EncodingDetector struct {
// User-specified encoding override (optional)
ForcedEncoding string
// Smart detection options
EnableSmartDetection bool // Enable intelligent encoding detection
MaxSampleSize int // Max bytes to analyze for statistical detection
}
EncodingDetector handles charset detection and conversion
func NewEncodingDetector ¶ added in v1.2.0
func NewEncodingDetector() *EncodingDetector
NewEncodingDetector creates a new encoding detector with smart detection enabled
func (*EncodingDetector) DetectAndConvert ¶ added in v1.2.0
func (ed *EncodingDetector) DetectAndConvert(data []byte) ([]byte, string, error)
DetectAndConvert detects charset and converts to UTF-8 in one step
func (*EncodingDetector) DetectCharset ¶ added in v1.2.0
func (ed *EncodingDetector) DetectCharset(data []byte) string
DetectCharset attempts to detect the character encoding from HTML content
func (*EncodingDetector) DetectCharsetBasic ¶ added in v1.2.0
func (ed *EncodingDetector) DetectCharsetBasic(data []byte) string
DetectCharsetBasic performs basic charset detection (BOM, meta tags, UTF-8 validation)
func (*EncodingDetector) DetectCharsetSmart ¶ added in v1.2.0
func (ed *EncodingDetector) DetectCharsetSmart(data []byte) EncodingMatch
DetectCharsetSmart performs intelligent charset detection using statistical analysis