internal

package

v1.2.0 Latest Latest Go to latest Published: Feb 7, 2026 License: MIT Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cybergodev/html

Links

Open Source Insights

Documentation ¶

Overview ¶

Package internal provides caching functionality for content extraction results. It implements a thread-safe LRU cache with TTL support to improve performance for repeated extractions of the same content.

Package internal provides centralized constant definitions for internal use.

Package internal provides character encoding detection and conversion functionality. It supports 15+ encodings including Unicode variants, Western European, and East Asian character sets, with intelligent auto-detection capabilities.

Package internal provides implementation details for the cybergodev/html library. It contains content extraction, table processing, and text manipulation functionality that is not part of the public API.

Package internal provides URL parsing and resolution utilities.

Index ¶

func CalculateContentDensity(n *html.Node) float64
func CleanContentNode(node *html.Node) *html.Node
func CleanText(text string, whitespaceRegex *regexp.Regexp) string
func ConvertToUTF8(data []byte, charset string) ([]byte, error)
func CountChildElements(n *html.Node, tag string) int
func CountTags(n *html.Node) int
func DetectAndConvertToUTF8(data []byte) ([]byte, string, error)
func DetectAndConvertToUTF8String(data []byte, forcedEncoding string) (string, string, error)
func DetectAudioType(url string) string
func DetectCharsetFromBytes(data []byte) string
func DetectVideoType(url string) string
func ExtractBaseFromURL(url string) string
func ExtractDomain(url string) string
func ExtractTextWithStructureAndImages(node *html.Node, sb *strings.Builder, _ int, imageCounter *int, ...)
func FindElementByTag(doc *html.Node, tagName string) *html.Node
func GetLinkDensity(node *html.Node) float64
func GetTextContent(node *html.Node) string
func GetTextLength(node *html.Node) int
func IsBlockElement(tag string) bool
func IsDifferentDomain(baseURL, targetURL string) bool
func IsExternalURL(url string) bool
func IsInlineElement(tag string) bool
func IsNonContentElement(tag string) bool
func IsValidURL(url string) bool
func IsVideoURL(url string) bool
func MatchesPattern(value string, patterns map[string]bool) bool
func NormalizeBaseURL(baseURL string) string
func RemoveTagContent(content, tag string) string
func ReplaceHTMLEntities(text string) string
func ResolveURL(baseURL, relativeURL string) string
func SanitizeHTML(htmlContent string) string
func ScoreAttributes(n *html.Node) int
func ScoreContentNode(node *html.Node) int
func SelectBestCandidate(candidates map[*html.Node]int) *html.Node
func ShouldRemoveElement(n *html.Node) bool
func WalkNodes(node *html.Node, fn func(*html.Node) bool)
type Cache
- func NewCache(maxEntries int, ttl time.Duration) *Cache
- func (c *Cache) Clear()
- func (c *Cache) Get(key string) any
- func (c *Cache) Set(key string, value any)
type EncodingDetector
- func NewEncodingDetector() *EncodingDetector
- func (ed *EncodingDetector) DetectAndConvert(data []byte) ([]byte, string, error)
- func (ed *EncodingDetector) DetectCharset(data []byte) string
- func (ed *EncodingDetector) DetectCharsetBasic(data []byte) string
- func (ed *EncodingDetector) DetectCharsetSmart(data []byte) EncodingMatch
- func (ed *EncodingDetector) ToUTF8(data []byte, charset string) ([]byte, error)
type EncodingMatch

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CalculateContentDensity ¶

func CalculateContentDensity(n *html.Node) float64

CalculateContentDensity calculates text-to-tag ratio. This is the exported version that uses the internal calculateDensityFromMetrics.

func CleanContentNode ¶

func CleanContentNode(node *html.Node) *html.Node

func CleanText ¶

func CleanText(text string, whitespaceRegex *regexp.Regexp) string

func ConvertToUTF8 ¶ added in v1.2.0

func ConvertToUTF8(data []byte, charset string) ([]byte, error)

ConvertToUTF8 is a convenience function that converts data to UTF-8

func CountChildElements ¶

func CountChildElements(n *html.Node, tag string) int

CountChildElements counts child elements of specific tag type.

func CountTags ¶

func CountTags(n *html.Node) int

func DetectAndConvertToUTF8 ¶ added in v1.2.0

func DetectAndConvertToUTF8(data []byte) ([]byte, string, error)

DetectAndConvertToUTF8 is a convenience function that detects charset and converts to UTF-8

func DetectAndConvertToUTF8String ¶ added in v1.2.0

func DetectAndConvertToUTF8String(data []byte, forcedEncoding string) (string, string, error)

DetectAndConvertToUTF8String detects encoding and converts to UTF-8 string. If forcedEncoding is not empty, it will use that encoding instead of auto-detection. Returns a UTF-8 string and the detected/used encoding.

func DetectAudioType ¶

func DetectAudioType(url string) string

DetectAudioType detects the audio MIME type from a URL

func DetectCharsetFromBytes ¶ added in v1.2.0

func DetectCharsetFromBytes(data []byte) string

DetectCharsetFromBytes is a convenience function that detects charset from byte data

func DetectVideoType ¶

func DetectVideoType(url string) string

DetectVideoType detects the video MIME type from a URL

func ExtractBaseFromURL ¶ added in v1.2.0

func ExtractBaseFromURL(url string) string

ExtractBaseFromURL extracts the base URL (scheme://domain/) from a URL. Returns the base URL including trailing slash, or empty string for invalid URLs.

func ExtractDomain ¶ added in v1.2.0

func ExtractDomain(url string) string

ExtractDomain extracts the domain from a URL. Returns the domain portion (scheme://domain) or empty string for invalid URLs.

func ExtractTextWithStructureAndImages ¶

func ExtractTextWithStructureAndImages(node *html.Node, sb *strings.Builder, _ int, imageCounter *int, tableFormat string)

func FindElementByTag ¶

func FindElementByTag(doc *html.Node, tagName string) *html.Node

func GetLinkDensity ¶

func GetLinkDensity(node *html.Node) float64

func GetTextContent ¶

func GetTextContent(node *html.Node) string

Example ¶

ExampleGetTextContent demonstrates the GetTextContent function with HTML entities.

html := `<p>&nbsp;&copy; 2025 &mdash; All rights reserved&nbsp;</p>`
doc, _ := stdxhtml.Parse(strings.NewReader(html))
result := GetTextContent(doc)
fmt.Println(result)

Output:

© 2025 — All rights reserved

func GetTextLength ¶

func GetTextLength(node *html.Node) int

func IsBlockElement ¶

func IsBlockElement(tag string) bool

func IsDifferentDomain ¶ added in v1.2.0

func IsDifferentDomain(baseURL, targetURL string) bool

IsDifferentDomain checks if two URLs have different domains. Returns false if either URL is not external.

func IsExternalURL ¶

func IsExternalURL(url string) bool

IsExternalURL checks if a URL is an external HTTP(S) URL or protocol-relative URL.

func IsInlineElement ¶ added in v1.2.0

func IsInlineElement(tag string) bool

IsInlineElement returns true if the tag is a known inline element. Inline elements should not add newlines or paragraph spacing.

func IsNonContentElement ¶

func IsNonContentElement(tag string) bool

func IsValidURL ¶ added in v1.2.0

func IsValidURL(url string) bool

IsValidURL checks if a URL is valid and safe for processing. This is a centralized URL validation function with size limits for security.

func IsVideoURL ¶

func IsVideoURL(url string) bool

IsVideoURL checks if a URL is a video based on extension or embed pattern

func MatchesPattern ¶

func MatchesPattern(value string, patterns map[string]bool) bool

MatchesPattern is the exported version of matchesPattern for testing purposes. It checks if value contains any pattern from the map with word boundaries.

func NormalizeBaseURL ¶ added in v1.2.0

func NormalizeBaseURL(baseURL string) string

NormalizeBaseURL ensures a base URL ends with a slash. Returns empty string for non-HTTP URLs (javascript:, data:, mailto:, etc.).

func RemoveTagContent ¶

func RemoveTagContent(content, tag string) string

RemoveTagContent removes all occurrences of the specified HTML tag and its content. This function uses string-based parsing as the primary method to handle edge cases like unclosed tags, malformed HTML, and to preserve original character case.

func ReplaceHTMLEntities ¶

func ReplaceHTMLEntities(text string) string

ReplaceHTMLEntities replaces HTML entities with their corresponding characters. It handles both named entities (like &,  ) and numeric entities (like A, A). For unknown entities, it falls back to the standard library's html.UnescapeString. Optimized with a fast path for the most common entities.

Example ¶

ExampleReplaceHTMLEntities demonstrates the ReplaceHTMLEntities function.

input := "&nbsp;&copy; 2025 &mdash; Test &euro;100"
result := ReplaceHTMLEntities(input)
fmt.Println(result)

Output:

© 2025 — Test €100

func ResolveURL ¶ added in v1.2.0

func ResolveURL(baseURL, relativeURL string) string

ResolveURL resolves a relative URL against a base URL. Handles absolute URLs, protocol-relative URLs, absolute paths, and relative paths.

func SanitizeHTML ¶

func SanitizeHTML(htmlContent string) string

func ScoreAttributes ¶

func ScoreAttributes(n *html.Node) int

func ScoreContentNode ¶

func ScoreContentNode(node *html.Node) int

ScoreContentNode calculates a relevance score for content extraction. Higher scores indicate more likely main content. Negative scores suggest non-content elements. This function has been optimized to reduce DOM traversals by combining multiple metrics.

func SelectBestCandidate ¶

func SelectBestCandidate(candidates map[*html.Node]int) *html.Node

func ShouldRemoveElement ¶

func ShouldRemoveElement(n *html.Node) bool

func WalkNodes ¶

func WalkNodes(node *html.Node, fn func(*html.Node) bool)

Types ¶

type Cache ¶

type Cache struct {
	// contains filtered or unexported fields
}

func NewCache ¶

func NewCache(maxEntries int, ttl time.Duration) *Cache

func (*Cache) Clear ¶

func (c *Cache) Clear()

func (*Cache) Get ¶

func (c *Cache) Get(key string) any

func (*Cache) Set ¶

func (c *Cache) Set(key string, value any)

type EncodingDetector ¶ added in v1.2.0

type EncodingDetector struct {
	// User-specified encoding override (optional)
	ForcedEncoding string

	// Smart detection options
	EnableSmartDetection bool // Enable intelligent encoding detection
	MaxSampleSize        int  // Max bytes to analyze for statistical detection
}

EncodingDetector handles charset detection and conversion

func NewEncodingDetector ¶ added in v1.2.0

func NewEncodingDetector() *EncodingDetector

NewEncodingDetector creates a new encoding detector with smart detection enabled

func (*EncodingDetector) DetectAndConvert ¶ added in v1.2.0

func (ed *EncodingDetector) DetectAndConvert(data []byte) ([]byte, string, error)

DetectAndConvert detects charset and converts to UTF-8 in one step

func (*EncodingDetector) DetectCharset ¶ added in v1.2.0

func (ed *EncodingDetector) DetectCharset(data []byte) string

DetectCharset attempts to detect the character encoding from HTML content

func (*EncodingDetector) DetectCharsetBasic ¶ added in v1.2.0

func (ed *EncodingDetector) DetectCharsetBasic(data []byte) string

DetectCharsetBasic performs basic charset detection (BOM, meta tags, UTF-8 validation)

func (*EncodingDetector) DetectCharsetSmart ¶ added in v1.2.0

func (ed *EncodingDetector) DetectCharsetSmart(data []byte) EncodingMatch

DetectCharsetSmart performs intelligent charset detection using statistical analysis

func (*EncodingDetector) ToUTF8 ¶ added in v1.2.0

func (ed *EncodingDetector) ToUTF8(data []byte, charset string) ([]byte, error)

ToUTF8 converts the given data from the detected charset to UTF-8

type EncodingMatch ¶ added in v1.2.0

type EncodingMatch struct {
	Charset    string
	Confidence int  // 0-100
	Score      int  // Detailed score
	Valid      bool // Whether decoding produced valid UTF-8
}

EncodingMatch represents a detected encoding with confidence score

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL