Prometheus word2vec

This repository contains the code to generate the word2vec model for Prometheus.

Setup

Checkout the word2vec submodule (the reference implementation) and then build it.

git submodule init
cd word2vec
make

Training

We use the following plaintext corpora:

sv - Vocab size: 1163288 Words in train file: 284410463
en - Vocab size: 4891175, Words in train file: 2989787812

To allow for an unknown vector first create the vocabulary, manually append it a unknown word and then train the model.

./word2vec -train <input.txt> -save-vocab vocab.txt
echo "__UNKNOWN__ 0" >> vocab.txt

./word2vec -train <input.txt> -binary 1 -output <model.bin> -size 300 -window 5 -sample 1e-4 -negative 5 -hs 0 -cbow 1 -iter 3 -read-vocab vocab.txt -threads 4

Produce Optimized Model

Thanks to Marcus Klang there exists a way to create an extremly fast binary model. This model is read using memory mapping in Java at near IO speed.

It can be created from the text model file. To produce it, run the train command with the -binary 0 flag.

cd vectortool
mvn package
cd target
java -jar vectortool-1.0-SNAPSHOT.jar convert ../../model.txt model.opt

Once the model is created it can be accessed using:

java -jar closest ../../model.opt

It is also possible to read it from your Java/Scala program, for how that is done, look in the Word2vec.java class.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
vectortool		vectortool
word2vec @ 4c9eb09		word2vec @ 4c9eb09
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Prometheus word2vec

Setup

Training

Produce Optimized Model

About

Uh oh!

Releases

Packages

Languages

Prometheus-Extractor/prometheus-word2vec

Folders and files

Latest commit

History

Repository files navigation

Prometheus word2vec

Setup

Training

Produce Optimized Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages