This repository serves as the spec test suite for the lexmatch feature in the MoonBit language.
MoonBit is committed to being the perfect language for data processing tasks. Currently, the match expression is the primary way to destructure and analyze data. However, when it comes to string processing, it is not as powerful as regular expressions. The lexmatch feature combines the capabilities of match and regular expressions, providing a more flexible and powerful way to analyze and destructure strings.
The lexmatch feature in MoonBit comes in two forms: lexmatch expressions and lexmatch? expressions.
-
lexmatchexpression: Similar to the existingmatchexpression, but the patterns inlexmatchexpressions are "lex patterns" which differ from ordinary patterns. These patterns can be used to matchStringViewandBytesViewtypes and can capture substrings during matching. -
lexmatch?expression: Similar to the existingisexpression, but the pattern on the right side is a "lex pattern" used to check whether a value ofStringVieworBytesViewtype conforms to a certain lexical structure, and can capture substrings. The overall result of the expression is a boolean value indicating whether the match succeeded.
///|
pub fn wordcount(
input : BytesView,
lines : Int,
words : Int,
chars : Int,
) -> (Int, Int, Int) {
lexmatch input with longest {
("\n", rest) => wordcount(rest, lines + 1, words, chars)
("[^ \t\r\n]+" as word, rest) =>
wordcount(rest, lines, words + 1, chars + word.length())
(".", rest) => wordcount(rest, lines, words, chars + 1)
"" => (lines, words, chars)
_ => panic()
}
}The above example demonstrates how to use the lexmatch expression to perform lexical analysis on input, counting the number of lines, words, and characters. The patterns here include the ability to capture substrings, for example, the pattern ("[^ \t\r\n]+" as word, rest) can match a sequence of non-whitespace characters and capture it as word.
///|
pub fn downloadable_protocol(url: StringView) -> StringView? {
if url lexmatch? (("(?i:ftp|http(s)?)" as protocol) "://", _) with longest {
Some(protocol)
} else {
None
}
}This example demonstrates how to use the lexmatch? expression to check whether a url starts with a downloadable protocol (ftp, http, https) and capture that protocol. This also demonstrates the use of the case-insensitive modifier (?i:...).
-
Target: The
StringVieworBytesViewbeing matched bylexmatch. -
Match Strategy: The strategy used to match patterns, which can be
longest(longest match) orfirst(first match, default, currently unavailable). This proposal focuses on thelongestmatch strategy. -
Catch-all case: A branch whose left side is a variable or wildcard
_, which can match any target. It must be placed at the end of thelexmatchbranches to handle unmatched cases. -
Lex Pattern: The pattern part on the left side of the
lexmatchbranch (before=>), which differs from the guard part (currentlylexmatchdoes not support guards).A lex pattern can be one of the following forms:
- Bare regex pattern: A regex pattern that matches the entire target. Example:
"" - Regex pattern + rest variable: A regex pattern that matches a prefix of the target, with the rest variable bound to the remaining suffix. The rest variable can be a variable or wildcard
_. This form requires parentheses for readability. Example:("\n", rest)
- Bare regex pattern: A regex pattern that matches the entire target. Example:
-
Regex Pattern: Regex patterns have three forms:
- Regex literal: A string literal representing a regex pattern. Example:
"[^ \t\r\n]+". Note that regex literals must be enclosed in double quotes and do not require double escaping. For example, to match a backslash character, use"\\"instead of"\\\\". - Capture: A regex pattern followed by
asand a variable name to capture the matched substring. Example:"[^ \t\r\n]+" as word. If the lex pattern is a bare regex pattern of this form, parentheses are required. - Sequence: A sequence of regex patterns separated by spaces. Example:
"//" ("[^\r\n]*" as comment). If the lex pattern is a bare regex pattern of this form, parentheses are required.
Regex patterns can be nested to form more complex patterns.
- Regex literal: A string literal representing a regex pattern. Example:
- Literal Characters: Characters other than special characters (
\,[,],(,),{,},.,*,+,?,|,^,$) match their literal value. Examples:a,Z,0,@, etc. .: Matches any single character, including newlines- Escape Characters:
\n: Newline character\r: Carriage return character\t: Tab character\\: Backslash character\[,\],\(,\),\{,\},\.,\*,\+,\?,\|,\^,\$: Match the corresponding literal character\xhh: Matches a character with hexadecimal valuehh(hhis two hexadecimal digits)\uhhhh: Matches a character with Unicode code pointhhhh(hhhhis four hexadecimal digits). Note that this escape sequence is invalid when the target isBytesView.\u{h...}: Matches a character with Unicode code pointh...(h...is one or more hexadecimal digits). Note that this escape sequence is invalid when the target isBytesView.
- Character Classes:
\s: Matches any whitespace character in the ASCII range, equivalent to[ \t\r\n\f\v]\S: Matches any non-whitespace character in the ASCII range, equivalent to[^ \t\r\n\f\v]\d: Matches any digit character in the ASCII range, equivalent to[0-9]\D: Matches any non-digit character in the ASCII range, equivalent to[^0-9]\w: Matches any word character in the ASCII range, equivalent to[a-zA-Z0-9_]\W: Matches any non-word character in the ASCII range, equivalent to[^a-zA-Z0-9_]
- Character Sets:
[abc]: Matches charactera,b, orc[a-z]: Matches any character fromatoz[^abc]: Matches any character excepta,b, andc[^a-z]: Matches any character not in the rangeatoz[\d\s]: Matches any digit or whitespace character
- Quantifiers:
*: Matches the preceding sub-expression zero or more times+: Matches the preceding sub-expression one or more times?: Matches the preceding sub-expression zero or one time{n}: Matches the preceding sub-expression exactly n times{n,}: Matches the preceding sub-expression at least n times{n,m}: Matches the preceding sub-expression at least n times, but no more than m times
- Anchors:
$: Matches the end position of the input string
- Scoped Modifiers:
(?i:...): Case-insensitive matching for the sub-expression within the parentheses
The lexmatch expression works similarly to the match expression, but with the following differences:
- The target of a
lexmatchexpression must beStringVieworBytesView. - Except for the catch-all branch, the left side of each
lexmatchbranch must be a lex pattern. - A match strategy can be specified after the
withkeyword. If not specified, the default strategy isfirst(currently unavailable). - Regex patterns in lex patterns match the target using the specified match strategy.
- If a regex pattern matches the target, any capture variables in the pattern will be bound to the corresponding matched substrings.
- If a regex pattern followed by a comma and rest variable matches the target, the regex pattern will match a prefix of the target, and the rest variable will be bound to the remaining suffix.
- If no lex pattern matches the target, the catch-all branch will be executed.
-
When capturing a single character, the matched substring is a
CharorByte, not aStringVieworBytesView. Example:"[+\-]" as sign -
Regex literals are essentially equivalent to JavaScript regular expression syntax with the v flag enabled, but with the following differences:
- The
"(abc)"regex pattern does not introduce a capture group. To capture a matched substring, use theassyntax. Example:"abc" as groupinstead of"(abc)". - Non-multiline mode by default (multiline mode is not currently supported),
$matches the end of the string..matches all characters, including newlines. - Scoped modifier syntax (like
(?i:...)) can only enable one modifier per group. For example,(?im:...)is not currently supported. Negated modifiers like(?-i:...)are also not supported. Currently, the only supported modifier isi(case-insensitive). - For future consideration of adding interpolation support, regex literals do not support using "{" to match the left brace character. Similarly, the right brace character does not support using "}" to match. If you need to match brace characters, please use literal characters
[{]and[}].
- The
pub fn search_marker(str: StringView) -> StringView? {
for curr = str {
lexmatch curr with longest {
"" => return None
("MARKER", right) => return Some(right)
(".", rest) => continue rest
_ => panic()
}
}
}Q: Why not use regex patterns directly in match expressions?
A: match expressions are designed for structural pattern matching, while lexmatch expressions are designed for lexical analysis. Mixing these two concepts can lead to confusion and complexity. By introducing a separate expression for lexical analysis, we can maintain semantic clarity and focus.