Skip to content

Commit 5da46d8

Browse files
Merge pull request #68 from christinaexyou/add-lmeval-tier1-tasks
feat: Add LMEval Tier 1 tasks
2 parents a041a18 + 35f68ea commit 5da46d8

File tree

1 file changed

+47
-0
lines changed

1 file changed

+47
-0
lines changed
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,50 @@
11
= LM-Eval
22

33
image::lm-eval-architecture.svg[LM-Eval architecture diagram]
4+
5+
== LM-Eval Task Support
6+
TrustyAI supports a subset of LMEval tasks to ensure reproducibility and reliability of the evaluation results. Tasks are categorized into three tiers based on our level of support: *Tier 1*, *Tier 2*, and *Tier 3*.
7+
8+
=== Tier 1 Tasks
9+
These tasks are fully supported by TrustyAI with guaranteed fixes and maintenance. They have been tested, validated, and monitored in the CI for reliability and reproducibility(footnote:[Tier 1 tasks were selected according their presence on the OpenLLM leaderboard or their popularity (>10,0000 downloads on HuggingFace).]).
10+
11+
[cols="1,2a", options="header"]
12+
|===
13+
|Name |https://github.com/opendatahub-io/lm-evaluation-harness/tree/incubation/lm_eval/tasks[Task Group Description]
14+
| `arc_easy` | Tasks involving complex reasoning over a diverse set of questions.
15+
| `bbh` | Tasks focused on deep semantic understanding through hypothesization and reasoning.
16+
| `bbh_fewshot_snarks` | Tasks focused on deep semantic understanding through hypothesization and reasoning.
17+
| `belebele_ckb_Arab` | Language understanding tasks in a variety of languages and scripts.
18+
| `cb` | A suite of challenging tasks designed to test a range of language understanding skills.
19+
| `ceval-valid_law` | Tasks that evaluate language understanding and reasoning in an educational context.
20+
| `commonsense_qa` | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge.
21+
| `gpqa_main_n_shot` | Tasks designed for general public question answering and knowledge verification.
22+
| `gsm8k` | A benchmark of grade school math problems aimed at evaluating reasoning capabilities.
23+
| `hellaswag` | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity.
24+
| `humaneval` | Code generation task that measure functional correctness for synthesizing programs from docstrings.
25+
| `ifeval` | Interactive fiction evaluation tasks for narrative understanding and reasoning.
26+
| `kmmlu_direct_law` | Knowledge-based multi-subject multiple choice questions for academic evaluation.
27+
| `lambada_openai` | Tasks designed to predict the endings of text passages, testing language prediction skills.
28+
| `lambada_standard` |
29+
Tasks designed to predict the endings of text passages, testing language prediction skills.
30+
| `leaderboard_math_algebra_hard` | Task group used by Hugging Face's Open LLM Leaderboard v2. Those tasks are static and will not change through time
31+
| `mbpp` | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions.
32+
| `minerva_math_precalc` | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills.
33+
| `mmlu_anatomy` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
34+
| `mmlu_pro_law` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
35+
| `mmlu_pro_plus_law` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported.
36+
| `openbookqa` | Open-book question answering tasks that require external knowledge and reasoning.
37+
| `piqa` | Physical Interaction Question Answering tasks to test physical commonsense reasoning.
38+
| `rte` | General Language Understanding Evaluation benchmark to test broad language abilities.
39+
| `sciq` | Science Question Answering tasks to assess understanding of scientific concepts.
40+
| `social_iqa` | Social Interaction Question Answering to evaluate common sense and social reasoning.
41+
| `triviaqa` | A large-scale dataset for trivia question answering to test general knowledge.
42+
| `truthfulqa_mc2` | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses.
43+
| `wikitext` | wikitext Tasks based on text from Wikipedia articles to assess language modeling and generation.
44+
| `winogrande` | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge.
45+
| `wmdp_bio` | wmdp A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions.
46+
| `wsc273` | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution.
47+
| `xlsum_es` | Collection of tasks in Spanish encompassing various evaluation areas.
48+
| `xnli_tr` | Cross-Lingual Natural Language Inference to test understanding across different languages.
49+
| `xwinograd_zh` | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages.
50+
|===

0 commit comments

Comments
 (0)