KGE Pre-Training

This is the repository for the paper Beyond Link Prediction: On Pre-Training Knowledge Graph Embeddings published at Rep4NLP@ACL2024. In this paper, we encourage the study of new ways to train knowledge graph embedding models, so they may better capture a knowledge graph, and thus potentially be more useful at downstream tasks.

Installation

This codebase is based on LibKGE, so the installation process is similar:

git clone https://github.com/uma-pi1/kge-pretraining.git
cd kge-pretraining
pip install -e .

Note that many dependencies are given with a specific version number in setup.py, and this code may not work if those versions are not correctly installed.

Training Models

The examples folder includes some examples of configuration files used to train models, including hyperparameter optimization. For example, here's how to train a DistMult model on all training objectives except the two tasks for link prediction:

cd examples/distmult_fb15k-237_no_link_prediction
kge resume . --job.device <your_device_id>

Each trial from the hyperparameter optimization process will create a different training job, whose folder is one of the many resulting numbered subfolders inside the job folder were we run the command to start training. Each training job folder contains all checkpoints from that job.

To train models using different combinations of the pre-training objectives studied in the paper, create new configuration files following those found in the examples folder. For pre-training on Wikidata5M, please get the training data from LibKGE.

How To Evaluate Models

The examples folder also includes several scripts that should make it easy to evaluate all checkpoints both on the pre-training tasks as well as on downstream tasks. For example, here's how to get the performance on all pre-training tasks of all checkpoints that resulted from the job we ran in the example above for training models:

cd examples
./rank_eval_all_checkpoints.sh --folder distmult_fb15k-237_no_link_prediction --device <your_device_id>

Similarly, here's how you can evaluate those same checkpoints on the entity classification tasks for the FB15K-237 dataset:

cd examples
./237_ec_all_checkpoints.sh --folder distmult_fb15k-237_no_link_prediction --device <your_device_id>

The examples folder contains various other scripts for evaluating on other combinations of pre-training tasks or on the downstream tasks for the YAGO3-10 and Wikidata5M datasets. To evaluate on downstream tasks from Wikidata5M, you first need to uncompress the training data of each task.

How to Add New Training Objectives

The main extensions to LibKGE for training models are found in the following files:

train_1vsAll_hybrid.py: 1vsAll generalized for more tasks
train_KvsAll_hybrid.py: KvsAll generalized for more tasks
train_1vsAll_hybrid.py: NegSamp generalized for more tasks

Note that they all follow the internal notation of representing task queries using an underscore to mark the target of query, and the symbol ^ to mark a wildcard slot. E.g. s^_ represents the (subject) entity neighborhood task. To add a new training objective, we suggest adding a new training file following any of the files listed above as an example. Similarly, to evaluate a new task you implement for a training objective, you should add support for that task in the RankingEvaluation job, where we implemented the more general evaluation protocol that includes all tasks included in our study.

How to Add New Downstream Tasks

Each downstream task is associated to an existing pre-training graph, so you would first have to add that data to the corresponding graph, e.g. as done with date_of_birth for FB15K-237. The main extension to LibKGE for evaluating on downstream tasks is found in the file downstream_task.py. So, in addition to adding your new data, you would have to modify that file to support your new downstream tasks.

How to Cite

If you use our data, code or compare against our results, please cite the following publication:

@inproceedings{ruffinelli2024beyond,
  title={Beyond Link Prediction: On Pre-Training Knowledge Graph Embeddings},
  author={Ruffinelli, Daniel and Gemulla, Rainer},
  booktitle={Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
examples		examples
kge		kge
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KGE Pre-Training

Installation

Training Models

How To Evaluate Models

How to Add New Training Objectives

How to Add New Downstream Tasks

How to Cite

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

uma-pi1/kge-pretraining

Folders and files

Latest commit

History

Repository files navigation

KGE Pre-Training

Installation

Training Models

How To Evaluate Models

How to Add New Training Objectives

How to Add New Downstream Tasks

How to Cite

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages