Skip to content

mutablelogic/go-tokenizer

Repository files navigation

go-tokenizer

A general-purpose tokenizer and Markdown parser with HTML rendering for Go.

Go Reference

Features

  • Lexical Scanner: Tokenizes text into identifiers, numbers, strings, operators, and punctuation
  • Markdown Parser: Converts Markdown text into an Abstract Syntax Tree (AST)
  • HTML Renderer: Renders Markdown AST to HTML with proper escaping
  • Configurable: Optional features like comment parsing, newline handling, and float parsing

Installation

go get github.com/mutablelogic/go-tokenizer

Requires Go 1.23 or later.

Quick Start

Tokenizing Text

package main

import (
    "fmt"
    "strings"
    
    "github.com/mutablelogic/go-tokenizer"
)

func main() {
    scanner := tokenizer.NewScanner(strings.NewReader("hello world 123"), tokenizer.Pos{})
    for {
        tok := scanner.Next()
        if tok.Kind == tokenizer.EOF {
            break
        }
        fmt.Printf("%s: %q\n", tok.Kind, tok.Value)
    }
}

Output:

Ident: "hello"
Space: " "
Ident: "world"
Space: " "
NumberInteger: "123"

Parsing Markdown

package main

import (
    "fmt"
    "strings"
    
    "github.com/mutablelogic/go-tokenizer"
    "github.com/mutablelogic/go-tokenizer/pkg/markdown"
    "github.com/mutablelogic/go-tokenizer/pkg/markdown/html"
)

func main() {
    input := `# Hello World

This is **bold** and _italic_ text.

- Item 1
- Item 2
- Item 3
`
    doc := markdown.Parse(strings.NewReader(input), tokenizer.Pos{})
    output := html.RenderString(doc)
    fmt.Println(output)
}

Output:

<h1>Hello World</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p><ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>

Packages

tokenizer (root package)

The lexical scanner that breaks input text into tokens.

Token Types:

  • Ident - Identifiers (hello, world)
  • NumberInteger, NumberFloat, NumberHex, NumberOctal, NumberBinary - Numbers
  • String, QuotedString - String literals
  • Hash, Asterisk, Underscore, Backtick, Tilde - Special characters
  • Space, Newline - Whitespace
  • Comment - Comments (when enabled)
  • And more...

Scanner Features:

// Enable features with bitwise OR
scanner := tokenizer.NewScanner(r, pos, 
    tokenizer.HashComment |      // # style comments
    tokenizer.LineComment |      // // style comments  
    tokenizer.BlockComment |     // /* */ style comments
    tokenizer.NewlineToken |     // Emit newlines as separate tokens
    tokenizer.UnderscoreToken |  // Emit underscores as separate tokens
    tokenizer.NumberFloatToken,  // Parse floating point numbers
)

pkg/ast

Defines the AST node types and tree traversal.

// Node interface
type Node interface {
    Kind() Kind
    Children() []Node
}

// Walk the AST
ast.Walk(doc, func(node ast.Node, depth int) error {
    fmt.Printf("%s%s\n", strings.Repeat("  ", depth), node.Kind())
    return nil
})

pkg/markdown

Parses Markdown text into an AST.

Supported Syntax:

  • Headings: # H1 through ###### H6
  • Paragraphs: Text separated by blank lines
  • Emphasis: _italic_ or *italic*
  • Strong: __bold__ or **bold**
  • Strikethrough: ~~deleted~~
  • Inline code: `code`
  • Code blocks: ```language ... ```
  • Links: [text](url) or <url>
  • Images: ![alt](url)
  • Blockquotes: > quoted text
  • Unordered lists: - item, * item, or + item
  • Ordered lists: 1. item or 1) item
  • Horizontal rules: ---, ***, or ___

pkg/markdown/html

Renders Markdown AST to HTML.

// Render to string
output := html.RenderString(doc)

// Render to io.Writer with indentation
renderer := html.NewRenderer(w).WithIndent(true)
err := renderer.Render(doc)

Features:

  • Proper HTML escaping for XSS prevention
  • Optional indented output for readability
  • Language classes on code blocks: <code class="language-go">

AST Node Types

Kind Description HTML Output
Document Root node (container)
Paragraph Text block <p>...</p>
Heading H1-H6 <h1>...</h1>
Text Plain text (escaped text)
Emphasis Italic <em>...</em>
Strong Bold <strong>...</strong>
Strikethrough Deleted <del>...</del>
Code Inline code <code>...</code>
CodeBlock Fenced code <pre><code>...</code></pre>
Link Hyperlink <a href="...">...</a>
Image Image <img src="..." alt="..."/>
Blockquote Quote <blockquote>...</blockquote>
List Ordered/Unordered <ol>...</ol> or <ul>...</ul>
ListItem List item <li>...</li>
HorizontalRule Divider <hr/>

License

Apache 2.0 - see LICENSE for details.

About

General Tokenizer and Abstract Syntax Tree Generator

Topics

Resources

License

Stars

Watchers

Forks

Languages