This repository contains the code to generate the word2vec model for Prometheus.
Checkout the word2vec submodule (the reference implementation) and then build it.
git submodule init
cd word2vec
makeWe use the following plaintext corpora:
- sv - Vocab size: 1163288 Words in train file: 284410463
- en - Vocab size: 4891175, Words in train file: 2989787812
To allow for an unknown vector first create the vocabulary, manually append it a unknown word and then train the model.
./word2vec -train <input.txt> -save-vocab vocab.txt
echo "__UNKNOWN__ 0" >> vocab.txt./word2vec -train <input.txt> -binary 1 -output <model.bin> -size 300 -window 5 -sample 1e-4 -negative 5 -hs 0 -cbow 1 -iter 3 -read-vocab vocab.txt -threads 4Thanks to Marcus Klang there exists a way to create an extremly fast binary model. This model is read using memory mapping in Java at near IO speed.
It can be created from the text model file. To produce it, run the train command with the -binary 0 flag.
cd vectortool
mvn package
cd target
java -jar vectortool-1.0-SNAPSHOT.jar convert ../../model.txt model.optOnce the model is created it can be accessed using:
java -jar closest ../../model.optIt is also possible to read it from your Java/Scala program, for how that is done, look in the Word2vec.java class.