This is just my pet-project for practice different skills with some simple program.
The most common task that can be solved with regular expressions, or regexps, is finding whether or not given string (or word) match (or belongs to) specified regexp (or language). It can be said that any regexp defines some set of words. Sometimes it is useful to somehow enumerate the whole set and iterate through it. The most obvious reason to do that is bruteforcing something.
(Sort of such functionality can be found in John the Ripper software,
but it's rules pretty differ from regular expression syntax)
Just write regular expression and see set of words it generates:
$ ./reglan "abc"
abc
$ ./reglan "(abc|def|ghi)"
abc
def
ghi
$ ./reglan "[0-1]{1,2}"
0
1
00
01
10
11
$ ./reglan "a*b?c+"
c
ac
bc
cc
aac
abc
acc
bcc
ccc
aaac
aabc
aacc
abcc
accc
bccc
cccc
aaaac
............
Backreferences:
$ ./reglan "([a-z]+)@\1\.(com|net|org)"
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
.............
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
.............
References to "dictionary", i.e. file that contains one possible word per line:
$ ./reglan "(19[6-9][0-9])_(?F./names.txt)"
1960_jack
1960_john
1960_james
1960_max
1960_bill
1960_scott
1961_jack
1961_john
1961_james
1961_max
1961_bill
1961_scott
1962_jack
...........
1998_max
1998_bill
1998_scott
1999_jack
1999_john
1999_james
1999_max
1999_bill
1999_scott
$
Skip arbitrary number of words:
$ ./reglan -o 1234567890 -n 3 "[1-9]\d*"
1234567891
1234567892
1234567893
$ ./reglan -o 65536 -n 3 "[0-9a-f]{8}"
00010000
00010001
00010002
$
Combine with John the Ripper:
$ ./reglan "(19[4-9][0-9]|20[01][0-9])-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1])" | sudo john --stdin /etc/shadow
'.'- any character'[abc]'- any of specified characters'[0-9]'- any character within range'[^abc0-9]'- any character that is not specified or within range'*'- any number of repetitions'+'- one or more repetition'?'- zero or one repetition'{m,n}'- frommtonrepetitions, inclusively'{m,}'-mor more repetitions'{m}'- exacltymrepetitions'\\'- escape character'\x0a'- char code'\t', '\r', '\n', '\f', '\v\'- whitespace characters'\d', '\D', '\s', '\S', '\w', '\W'- charset classes'ab|cd'- alternative subexprs'(...)'- grouping subexprs'(?:...)'- grouping subexprs without numeration (matter for backrefs)'(?F<path>)'- any line from file<path>\N- backreference to subexpression with numberN(between1and9), strictly equals to corresponding subexression in each word
$ ./reglan -u
Usage: ./reglan [-v] [-h] [-u] [-p] [-d] [-c] [-o <offset>] [-n <max_number>] [<regexpr>]*
-v print version
-h, -u print usage (this info)
-p print parsed expression
-d print debug info, i.e. expression template state for each word
-c do not print generated words, but print their total count after generation finishs
-o <N> skip <N> words from the beginning (default - do not skip anything)
-n <N> limit generated words count to this value (default - unlimited)
Unless -c is given, reglan will output every word in strict order,
one per line, until either all possible words printed or <max_number>
limit reached, for every <regexpr> separately.
- Unicode is not supported (yet)
- Syntactically wrong input may lead to undefined behaviour
- Undefined backreferences will be silently dropped
- Cyclic backreferences (e.g.
(\1)or(\2)(\1)) lead to undefined behaviour - Although it is guaranteed (I hope) that every word of specified
language will be printed at least once, it is not guaranteed that
it will be printed at most once, i.e. some words can be printed
multiple times; e.g. regexp
'a?a?'gives list of four words['', 'a', 'a', 'aa'] - internally
long longis used for storing word numbers, hence the limit on-s <N>and-n <N>arguments
It shoud be enough to run make from src directory. Only standard
library is needed, no third-party dependencies.
There is simple shellscript that do the following:
- run
make allfromsrcdirectory - copy all binaries (
reglan,libreglan.a,libreglan.so) and headerreglan.htotestdirectory - run simple python-based test script
test.pywhich exptects that executablereglanis near; it generates some random finite regular languages and checks thatreglanoutputs only proper words and that total amount of words is correct - compiles simple
test.cprogram twice: statically linked withlibreglan.aand dynamically linkes withlibreglan.so, then checks that produced executable runs correctly
Python extension can be built:
$ cd python && python setup.py build && sudo python setup.py install
$ python
Type "help", "copyright", "credits" or "license" for more information.
>>> import reglan
>>> list(reglan.reglan("[012][34][56]"))
['035', '036', '046', '046', '136', '136', '146', '146', '236', '236', '246', '246']
>>> for word in reglan.reglan("a*b?c{2,4}")[100:105]:
... print(word)
...
aaaaaaaaaaaaaaaaabcc
aaaaaaaaaaaaaaaaaccc
aaaaaaaaaaaaaaaabccc
aaaaaaaaaaaaaaaacccc
aaaaaaaaaaaaaaabcccc
- better documentation, including:
- comments everywhere
- doxygen comments
- man and/or info format
- add flexibility to extended syntax:
- (?:)
- aliases, some way to define "subexpressions" and then reuse them;
- some way to reproduce
johnfunctionality, e.g. uppercase-lowercase word, capitalize first/last letter etc. - unicode support;
- test with different environments, including Windows;
- add wrappers for different languages/platforms:
- nodejs, ruby, perl, php
- java, .net/mono
- mysql, mssql, postgres
- packages:
- deb
- rpm
- improve performance:
- algorithm / structures
- OpenMP
- SSE/AVX
- educational-aimed GUI