WIP: Add timit recipe#96
WIP: Add timit recipe#96luomingshuang wants to merge 10 commits intok2-fsa:masterfrom luomingshuang:add-timit-recipe
Conversation
| # | ||
| # - $dl_dir/lm | ||
| # This directory contains the language model(LM) downloaded from | ||
| # https://huggingface.co/luomingshuang/timit_lm, and the LM is based |
There was a problem hiding this comment.
Could you please describe how lm_tgmed.arpa is obtained?
Is it possible to train it inside icefall?
There was a problem hiding this comment.
Em....the lm_tgmed.arpa is obtained by this train_lms.sh which is followed by kaldi. About training lm inside icefall, I think it is a good idea. I have ever considered this problem that if we can train lm with python. There are some methods for it by using KenLM. Maybe I can have look.
There was a problem hiding this comment.
ok, train_lms.sh uses https://github.com/danpovey/kaldi_lm.git
I will wrap it to Python with pybind11 when I have time.
| 2021-10-28 13:20:42,952 INFO [decode.py:360] Wrote detailed error stats to tdnn_lstm_ctc/exp/errs-TEST-lm_scale_2.0.txt | ||
| 2021-10-28 13:20:42,986 INFO [decode.py:374] | ||
| For TEST, PER of different settings are: | ||
| lm_scale_0.1 20.82 best for TEST |
There was a problem hiding this comment.
Could you use a smaller lm scale value as it reaches the edge?
| recordings=m["recordings"], | ||
| supervisions=m["supervisions"], | ||
| ) | ||
| if "train" in partition: |
There was a problem hiding this comment.
Please note that in librispeech, the names of the training datasets begin with
train (lowercase).
In TIMIT, I find that it is TRAIN (uppercase) , see line 52 in this file, so this if
statement is never executed.
Please change train to TRAIN and re-run your experiments.
There was a problem hiding this comment.
Oh....will do it....
| load_dicts = json.load(load_f) | ||
| for load_dict in load_dicts: | ||
| text = load_dict["text"] | ||
| phones_list = list(filter(None, text.split(" "))) |
There was a problem hiding this comment.
Could it be changed to
phones_list = text.split()?
It's simpler and easier to understand.
| phones_list = list(filter(None, text.split(" "))) | ||
|
|
||
| for phone in phones_list: | ||
| if phone not in phones: |
There was a problem hiding this comment.
Could you use a set to represent phones, not a list?
set is more efficient for looking up.
|
|
||
| with open(lexicon, "w") as f: | ||
| for phone in sorted(phones): | ||
| f.write(str(phone) + " " + str(phone)) |
There was a problem hiding this comment.
phone is already of type str, can we remove str here?
| # We assume that you have installed the git-lfs, if not, you could install it | ||
| # using: `sudo apt-get install git-lfs && git-lfs install` | ||
| [ ! -e $dl_dir/lm ] && mkdir -p $dl_dir/lm | ||
| git clone https://huggingface.co/luomingshuang/timit_lm $dl_dir/lm |
There was a problem hiding this comment.
Please add a check that lm_tgmed.arpa is downloaded correctly.
Some users may forget to run git lfs install.
You can add an extra statement
( cd $dl_dir/lm && git lfs pull )|
|
||
| if [ $stage -le 6 ] && [ $stop_stage -ge 6 ]; then | ||
| log "Stage 6: Prepare G" | ||
| # We assume you have install kaldilm, if not, please install |
There was a problem hiding this comment.
typo: install -> installed
| --read-symbol-table="data/lang_phone/words.txt" \ | ||
| --disambig-symbol='#0' \ | ||
| --max-order=4 \ | ||
| $dl_dir/lm/lm_tgmed.arpa > data/lm/G_4_gram.fst.txt |
There was a problem hiding this comment.
tgmed means this arpa is a tri-gram, of medium size, I think.
Please use a 4-gram arpa to generate G_4_gram.fst.txt, if you need it for decoding/rescoring.
| @@ -0,0 +1,97 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
This file is shared across various recipes.
Could you make it a symlink, like what we are doing in the librispeech recipe?
| @@ -0,0 +1,400 @@ | |||
| FADG0_SI1279 TEST/DR4/FADG0/SI1279.WAV | |||
There was a problem hiding this comment.
Can this file be generated by some scripts? If so, we don't need to check it in.
There was a problem hiding this comment.
Em....about {train, dev, test} spliting files, I don't find there are some scripts to generate them. In kaldi, there are placed in a list file. In speechbrain, there are placed in timit_prepare.py which listing the speakers with a list. A option for us is to learn speechbrain. We can use a list to contain speakers' names in data preparing process. I will add it to Lhotse.
Add timit recipe for icefall. This script uses phone as modeling units and it aims to compute the PER. Our target output is a list of phones. The split for {dev, test} is following kaldi ({kaldi-timit-dev, kaldi-timit-test}. At present, this script contains tdnn_lstm_ctc for use. And I will add other models and methods (such as comformer and crdnn, mmi) for it.
In fact, I have done some experiments for timit based on snowfall. k2-fsa/snowfall#247
The current result is not the best. I will continue to improve it.
log-train-2021-10-28-15-24-21.txt
https://tensorboard.dev/experiment/twUbZTxoTAK32bPCJsYF7Q/#scalars
TODOs: