go-tokenizer

A general-purpose tokenizer and Markdown parser with HTML rendering for Go.

Features

Lexical Scanner: Tokenizes text into identifiers, numbers, strings, operators, and punctuation
Markdown Parser: Converts Markdown text into an Abstract Syntax Tree (AST)
HTML Renderer: Renders Markdown AST to HTML with proper escaping
Configurable: Optional features like comment parsing, newline handling, and float parsing

Installation

go get github.com/mutablelogic/go-tokenizer

Requires Go 1.23 or later.

Quick Start

Tokenizing Text

package main

import (
    "fmt"
    "strings"
    
    "github.com/mutablelogic/go-tokenizer"
)

func main() {
    scanner := tokenizer.NewScanner(strings.NewReader("hello world 123"), tokenizer.Pos{})
    for {
        tok := scanner.Next()
        if tok.Kind == tokenizer.EOF {
            break
        }
        fmt.Printf("%s: %q\n", tok.Kind, tok.Value)
    }
}

Output:

Ident: "hello"
Space: " "
Ident: "world"
Space: " "
NumberInteger: "123"

Parsing Markdown

package main

import (
    "fmt"
    "strings"
    
    "github.com/mutablelogic/go-tokenizer"
    "github.com/mutablelogic/go-tokenizer/pkg/markdown"
    "github.com/mutablelogic/go-tokenizer/pkg/markdown/html"
)

func main() {
    input := `# Hello World

This is **bold** and _italic_ text.

- Item 1
- Item 2
- Item 3
`
    doc := markdown.Parse(strings.NewReader(input), tokenizer.Pos{})
    output := html.RenderString(doc)
    fmt.Println(output)
}

Output:

<h1>Hello World</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p><ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>

Packages

`tokenizer` (root package)

The lexical scanner that breaks input text into tokens.

Token Types:

Ident - Identifiers (hello, world)
NumberInteger, NumberFloat, NumberHex, NumberOctal, NumberBinary - Numbers
String, QuotedString - String literals
Hash, Asterisk, Underscore, Backtick, Tilde - Special characters
Space, Newline - Whitespace
Comment - Comments (when enabled)
And more...

Scanner Features:

// Enable features with bitwise OR
scanner := tokenizer.NewScanner(r, pos, 
    tokenizer.HashComment |      // # style comments
    tokenizer.LineComment |      // // style comments  
    tokenizer.BlockComment |     // /* */ style comments
    tokenizer.NewlineToken |     // Emit newlines as separate tokens
    tokenizer.UnderscoreToken |  // Emit underscores as separate tokens
    tokenizer.NumberFloatToken,  // Parse floating point numbers
)

`pkg/ast`

Defines the AST node types and tree traversal.

// Node interface
type Node interface {
    Kind() Kind
    Children() []Node
}

// Walk the AST
ast.Walk(doc, func(node ast.Node, depth int) error {
    fmt.Printf("%s%s\n", strings.Repeat("  ", depth), node.Kind())
    return nil
})

`pkg/markdown`

Parses Markdown text into an AST.

Supported Syntax:

Headings: # H1 through ###### H6
Paragraphs: Text separated by blank lines
Emphasis: _italic_ or *italic*
Strong: __bold__ or **bold**
Strikethrough: ~~deleted~~
Inline code: `code`
Code blocks: ```language ... ```
Links: [text](url) or <url>
Images: ![alt](url)
Blockquotes: > quoted text
Unordered lists: - item, * item, or + item
Ordered lists: 1. item or 1) item
Horizontal rules: ---, ***, or ___

`pkg/markdown/html`

Renders Markdown AST to HTML.

// Render to string
output := html.RenderString(doc)

// Render to io.Writer with indentation
renderer := html.NewRenderer(w).WithIndent(true)
err := renderer.Render(doc)

Features:

Proper HTML escaping for XSS prevention
Optional indented output for readability
Language classes on code blocks: <code class="language-go">

AST Node Types

Kind	Description	HTML Output
`Document`	Root node	(container)
`Paragraph`	Text block	`<p>...</p>`
`Heading`	H1-H6	`<h1>...</h1>`
`Text`	Plain text	(escaped text)
`Emphasis`	Italic	`<em>...</em>`
`Strong`	Bold	`<strong>...</strong>`
`Strikethrough`	Deleted	`<del>...</del>`
`Code`	Inline code	`<code>...</code>`
`CodeBlock`	Fenced code	`<pre><code>...</code></pre>`
`Link`	Hyperlink	`<a href="...">...</a>`
`Image`	Image	`<img src="..." alt="..."/>`
`Blockquote`	Quote	`<blockquote>...</blockquote>`
`List`	Ordered/Unordered	`<ol>...</ol>` or `<ul>...</ul>`
`ListItem`	List item	`<li>...</li>`
`HorizontalRule`	Divider	`<hr/>`

License

Apache 2.0 - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pkg		pkg
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
doc.go		doc.go
errors.go		errors.go
go.mod		go.mod
pos.go		pos.go
pos_test.go		pos_test.go
scanner.go		scanner.go
scanner_test.go		scanner_test.go
token.go		token.go
token_test.go		token_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

go-tokenizer

Features

Installation

Quick Start

Tokenizing Text

Parsing Markdown

Packages

`tokenizer` (root package)

`pkg/ast`

`pkg/markdown`

`pkg/markdown/html`

AST Node Types

License

About

Uh oh!

Releases

Uh oh!

Languages

License

mutablelogic/go-tokenizer

Folders and files

Latest commit

History

Repository files navigation

go-tokenizer

Features

Installation

Quick Start

Tokenizing Text

Parsing Markdown

Packages

tokenizer (root package)

pkg/ast

pkg/markdown

pkg/markdown/html

AST Node Types

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Languages

`tokenizer` (root package)

`pkg/ast`

`pkg/markdown`

`pkg/markdown/html`