doc2text

package module
v0.3.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 12, 2026 License: GPL-3.0 Imports: 12 Imported by: 0

README

doc2text

performant tool for translating .doc/.docx files into raw text (.docx files can be (badly tbh) converted to json)

  • only one dependency (CGO, credits grobian/antiword)
  • portable, NO external dependencies required
  • OLE2 support

usage

  1. run in your console (go 1.25+)
    • go install github.com/pyrorhythm/doc2text/cmd/doc2text@latest
  2. you can dl binary from releases
  3. build yourelf:
    • run in your console: just
    • needs just, go and gcc (or clang, needs modification in makefile) to be installed

contributing

do what the fuck you want to, prs eventually will be checked by myself

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ExtractRawTextFromDoc

func ExtractRawTextFromDoc(docPath string) (string, error)

ExtractRawTextFromDoc extracts plain text from a .doc file

func ExtractRawTextFromDocx

func ExtractRawTextFromDocx(docxPath string) (string, error)

ExtractRawTextFromDocx extracts plain text from a DOCX file

func ExtractTextFromDoc

func ExtractTextFromDoc(docPath string) (string, error)

ExtractTextFromDoc extracts plain text from a .doc file

func ExtractTextFromDocCGO

func ExtractTextFromDocCGO(docPath string) (string, error)

func IsDocFile

func IsDocFile(filename string) bool

IsDocFile checks if a file has a .doc extension

func IsDocxFile

func IsDocxFile(filename string) bool

IsDocxFile checks if a file has a .docx extension

Types

type AbstractNum

type AbstractNum struct {
	AbstractNumID int              `xml:"abstractNumId,attr"`
	Levels        []NumberingLevel `xml:"lvl"`
}

AbstractNum represents an abstract numbering definition

type AbstractNumIdVal

type AbstractNumIdVal struct {
	Val int `xml:"val,attr"`
}

AbstractNumIdVal represents the value of an abstractNumId element

type ContentElement

type ContentElement struct {
	Type    string     `json:"type"`
	Level   int        `json:"level,omitempty"`
	Text    string     `json:"text,omitempty"`
	Rows    [][]string `json:"rows,omitempty"`
	Items   []string   `json:"items,omitempty"`
	Ordered bool       `json:"ordered,omitempty"`
}

type Document

type Document struct {
	Paragraphs []Paragraph `xml:"body>p"`
	Tables     []Table     `xml:"body>tbl"`
}

type DocumentOutput

type DocumentOutput struct {
	Metadata Metadata         `json:"metadata"`
	Content  []ContentElement `json:"content"`
}

func ExtractFromDoc

func ExtractFromDoc(docPath string) (*DocumentOutput, error)

ExtractFromDoc extracts content from a .doc file and returns it in the same format as DOCX

func ExtractFromDocx

func ExtractFromDocx(docxPath string) (*DocumentOutput, error)

type IlvlVal

type IlvlVal struct {
	Val int `xml:"val,attr"`
}

type JcVal

type JcVal struct {
	Val string `xml:"val,attr"`
}

type Metadata

type Metadata struct {
	Source      string `json:"source"`
	ExtractedAt string `json:"extracted_at"`
}

type Num

type Num struct {
	NumID         int              `xml:"numId,attr"`
	AbstractNumId AbstractNumIdVal `xml:"abstractNumId"`
}

Num maps a numId to an abstractNumId

type NumFmtVal

type NumFmtVal struct {
	Val string `xml:"val,attr"`
}

NumFmtVal represents the value of a numFmt element

type NumIdVal

type NumIdVal struct {
	Val int `xml:"val,attr"`
}

type NumPr

type NumPr struct {
	Ilvl  IlvlVal  `xml:"ilvl"`
	NumId NumIdVal `xml:"numId"`
}

type Numbering

type Numbering struct {
	AbstractNums []AbstractNum `xml:"abstractNum"`
	Nums         []Num         `xml:"num"`
}

Numbering contains all numbering definitions

func (*Numbering) GetNumberFormat

func (n *Numbering) GetNumberFormat(numID int, level int) string

GetNumberFormat returns the number format for a given numId and level

func (*Numbering) IsOrdered

func (n *Numbering) IsOrdered(numID int) bool

IsOrdered returns true if the list is ordered (numbered), false for bullets

type NumberingLevel

type NumberingLevel struct {
	Level     int       `xml:"ilvl,attr"`
	NumFormat NumFmtVal `xml:"numFmt"` // "decimal", "bullet", "lowerLetter", etc.
}

NumberingLevel represents a level in a numbering definition

type PStyleVal

type PStyleVal struct {
	Val string `xml:"val,attr"`
}

type Paragraph

type Paragraph struct {
	Properties ParagraphProperties `xml:"pPr"`
	Runs       []Run               `xml:"r"`
}

type ParagraphProperties

type ParagraphProperties struct {
	PStyle PStyleVal `xml:"pStyle"`
	Jc     JcVal     `xml:"jc"`
	NumPr  NumPr     `xml:"numPr"`
}

type Run

type Run struct {
	Properties RunProperties `xml:"rPr"`
	Texts      []Text        `xml:"t"`
}

type RunProperties

type RunProperties struct {
	Bold      bool   `xml:"b"`
	Italic    bool   `xml:"i"`
	FontSize  string `xml:"sz"`
	FontColor string `xml:"color"`
}

type Table

type Table struct {
	Rows []TableRow `xml:"tr"`
}

type TableCell

type TableCell struct {
	Text string `xml:"p>r>t"`
}

type TableRow

type TableRow struct {
	Cells []TableCell `xml:"tc"`
}

type Text

type Text struct {
	Text string `xml:",chardata"`
}

Directories

Path Synopsis
cmd
doc2text command

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL