|
1 | 1 | = LM-Eval |
2 | 2 |
|
3 | 3 | image::lm-eval-architecture.svg[LM-Eval architecture diagram] |
| 4 | + |
| 5 | +== LM-Eval Task Support |
| 6 | +TrustyAI supports a subset of LMEval tasks to ensure reproducibility and reliability of the evaluation results. Tasks are categorized into three tiers based on our level of support: *Tier 1*, *Tier 2*, and *Tier 3*. |
| 7 | + |
| 8 | +=== Tier 1 Tasks |
| 9 | +These tasks are fully supported by TrustyAI with guaranteed fixes and maintenance. They have been tested, validated, and monitored in the CI for reliability and reproducibility(footnote:[Tier 1 tasks were selected according their presence on the OpenLLM leaderboard or their popularity (>10,0000 downloads on HuggingFace).]). |
| 10 | + |
| 11 | +[cols="1,2a", options="header"] |
| 12 | +|=== |
| 13 | +|Name |https://github.com/opendatahub-io/lm-evaluation-harness/tree/incubation/lm_eval/tasks[Task Group Description] |
| 14 | +| `arc_easy` | Tasks involving complex reasoning over a diverse set of questions. |
| 15 | +| `bbh` | Tasks focused on deep semantic understanding through hypothesization and reasoning. |
| 16 | +| `bbh_fewshot_snarks` | Tasks focused on deep semantic understanding through hypothesization and reasoning. |
| 17 | +| `belebele_ckb_Arab` | Language understanding tasks in a variety of languages and scripts. |
| 18 | +| `cb` | A suite of challenging tasks designed to test a range of language understanding skills. |
| 19 | +| `ceval-valid_law` | Tasks that evaluate language understanding and reasoning in an educational context. |
| 20 | +| `commonsense_qa` | CommonsenseQA, a multiple-choice QA dataset for measuring commonsense knowledge. |
| 21 | +| `gpqa_main_n_shot` | Tasks designed for general public question answering and knowledge verification. |
| 22 | +| `gsm8k` | A benchmark of grade school math problems aimed at evaluating reasoning capabilities. |
| 23 | +| `hellaswag` | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. |
| 24 | +| `humaneval` | Code generation task that measure functional correctness for synthesizing programs from docstrings. |
| 25 | +| `ifeval` | Interactive fiction evaluation tasks for narrative understanding and reasoning. |
| 26 | +| `kmmlu_direct_law` | Knowledge-based multi-subject multiple choice questions for academic evaluation. |
| 27 | +| `lambada_openai` | Tasks designed to predict the endings of text passages, testing language prediction skills. |
| 28 | +| `lambada_standard` | |
| 29 | +Tasks designed to predict the endings of text passages, testing language prediction skills. |
| 30 | +| `leaderboard_math_algebra_hard` | Task group used by Hugging Face's Open LLM Leaderboard v2. Those tasks are static and will not change through time |
| 31 | +| `mbpp` | A benchmark designed to measure the ability to synthesize short Python programs from natural language descriptions. |
| 32 | +| `minerva_math_precalc` | Mathematics-focused tasks requiring numerical reasoning and problem-solving skills. |
| 33 | +| `mmlu_anatomy` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. |
| 34 | +| `mmlu_pro_law` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. |
| 35 | +| `mmlu_pro_plus_law` | Massive Multitask Language Understanding benchmark for broad domain language evaluation. Several variants are supported. |
| 36 | +| `openbookqa` | Open-book question answering tasks that require external knowledge and reasoning. |
| 37 | +| `piqa` | Physical Interaction Question Answering tasks to test physical commonsense reasoning. |
| 38 | +| `rte` | General Language Understanding Evaluation benchmark to test broad language abilities. |
| 39 | +| `sciq` | Science Question Answering tasks to assess understanding of scientific concepts. |
| 40 | +| `social_iqa` | Social Interaction Question Answering to evaluate common sense and social reasoning. |
| 41 | +| `triviaqa` | A large-scale dataset for trivia question answering to test general knowledge. |
| 42 | +| `truthfulqa_mc2` | A QA task aimed at evaluating the truthfulness and factual accuracy of model responses. |
| 43 | +| `wikitext` | wikitext Tasks based on text from Wikipedia articles to assess language modeling and generation. |
| 44 | +| `winogrande` | A large-scale dataset for coreference resolution, inspired by the Winograd Schema Challenge. |
| 45 | +| `wmdp_bio` | wmdp A benchmark with the objective of minimizing performance, based on potentially-sensitive multiple-choice knowledge questions. |
| 46 | +| `wsc273` | The Winograd Schema Challenge, a test of commonsense reasoning and coreference resolution. |
| 47 | +| `xlsum_es` | Collection of tasks in Spanish encompassing various evaluation areas. |
| 48 | +| `xnli_tr` | Cross-Lingual Natural Language Inference to test understanding across different languages. |
| 49 | +| `xwinograd_zh` | Cross-lingual Winograd schema tasks for coreference resolution in multiple languages. |
| 50 | +|=== |
0 commit comments