|
29 | 29 |
|
30 | 30 | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
31 | 31 |
|
32 | | -paperai is an AI application for medical and scientific papers. |
| 32 | +`paperai` is an AI application for medical and scientific papers. |
33 | 33 |
|
34 | 34 |  |
35 | 35 |
|
36 | | -Applications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning. |
| 36 | +⚡ Supercharge research tasks with AI-driven report generation. A `paperai` application goes through repositories of articles and generates bulk answers to questions backed by Large Language Model (LLM) prompts and Retrieval Augmented Generation (RAG) pipelines. |
| 37 | + |
| 38 | +A `paperai` configuration file enables bulk LLM inference operations in a performant manner. Think of it like kicking off hundreds of ChatGPT prompts over your data. |
37 | 39 |
|
38 | 40 |  |
39 | 41 |  |
40 | 42 |
|
41 | | -paperai and/or NeuML has been recognized in the following articles: |
42 | | - |
43 | | -- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201) |
44 | | -- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus) |
45 | | -- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447) |
| 43 | +`paperai` can generate reports in Markdown, CSV and annotate answers directly on PDFs (when available). |
46 | 44 |
|
47 | 45 | ## Installation |
48 | 46 |
|
@@ -88,6 +86,7 @@ The following notebooks and applications demonstrate the capabilities provided b |
88 | 86 | | Notebook | Description | | |
89 | 87 | |:----------|:-------------|------:| |
90 | 88 | | [Introducing paperai](https://github.com/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) | Overview of the functionality provided by paperai | [](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) | |
| 89 | +| [Medical Research Project](https://github.com/neuml/paperai/blob/master/examples/02_Medical_Research_Project.ipynb) | Research young onset colon cancer | [](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/02_Medical_Research_Project.ipynb) | |
91 | 90 |
|
92 | 91 | ### Applications |
93 | 92 |
|
@@ -126,28 +125,122 @@ paperai <path to model directory> |
126 | 125 |
|
127 | 126 | A prompt will come up. Queries can be typed directly into the console. |
128 | 127 |
|
| 128 | +## Report schema |
| 129 | +
|
| 130 | +The following steps through an example `paperai` report configuration file and describes each section. |
| 131 | +
|
| 132 | +```yaml |
| 133 | +name: ColonCancer |
| 134 | +options: |
| 135 | + llm: Intelligent-Internet/II-Medical-8B-1706-GGUF/II-Medical-8B-1706.Q4_K_M.gguf |
| 136 | + system: You are a medical literature document parser. You extract fields from data. |
| 137 | + template: | |
| 138 | + Quickly extract the following field using the provided rules and context. |
| 139 | +
|
| 140 | + Rules: |
| 141 | + - Keep it simple, don't overthink it |
| 142 | + - ONLY extract the data |
| 143 | + - NEVER explain why the field is extracted |
| 144 | + - NEVER restate the field name only give the field value |
| 145 | + - Say no data if the field can't be found within the context |
| 146 | +
|
| 147 | + Field: |
| 148 | + {question} |
| 149 | +
|
| 150 | + Context: |
| 151 | + {context} |
| 152 | +
|
| 153 | + context: 5 |
| 154 | + params: |
| 155 | + maxlength: 4096 |
| 156 | + stripthink: True |
| 157 | +
|
| 158 | +Research: |
| 159 | + query: colon cancer young adults |
| 160 | + columns: |
| 161 | + - name: Date |
| 162 | + - name: Study |
| 163 | + - name: Study Link |
| 164 | + - name: Journal |
| 165 | + - {name: Sample Size, query: number of patients, question: Sample Size} |
| 166 | + - {name: Objective, query: objective, question: Study Objective} |
| 167 | + - {name: Causes, query: possible causes, question: List of possible causes} |
| 168 | + - {name: Detection, query: diagnosis, question: List of ways to diagnose} |
| 169 | +``` |
| 170 | + |
| 171 | +### Configuration |
| 172 | + |
| 173 | +The following shows the top level configuration options. |
| 174 | + |
| 175 | +| Field | Description | |
| 176 | +|:------------ |:-------------| |
| 177 | +| name | Report name | |
| 178 | +| options | RAG pipeline options - set the LLM, prompt templates, max length and more| |
| 179 | +| report | Each unique top level parameter sets the report name. In the example above, it's called `Research` | |
| 180 | +| query | Vector query that identifies the top n documents | |
| 181 | +| columns | List of columns | |
| 182 | + |
| 183 | +### Standard columns |
| 184 | + |
| 185 | +Standard columns use the article data store metadata to simply copy fields into a report. Set the column `name` to one of the values below. |
| 186 | + |
| 187 | +| Field | Description | |
| 188 | +|:------------ |:-------------| |
| 189 | +| Id | Article unique identifier | |
| 190 | +| Date | Article publication date | |
| 191 | +| Study | Title of the article | |
| 192 | +| Study Link | HTTP link to the study | |
| 193 | +| Journal | Publication name | |
| 194 | +| Source | Data source name | |
| 195 | +| Entry | Article entry date | |
| 196 | +| Matches | Sections that caused this article to match the report query | |
| 197 | + |
| 198 | +### Generated columns |
| 199 | + |
| 200 | +The most novel feature of `paperai` is being able to generate dynamic columns driven by a RAG pipeline. Each field takes the following parameters. |
| 201 | + |
| 202 | +| Parameter | Description | |
| 203 | +|:------------ |:-------------| |
| 204 | +| name | Column name | |
| 205 | +| query | search/similarity query | |
| 206 | +| question | llm question parameter | |
| 207 | + |
| 208 | +For each matching article, the `query` sorts each section by relevance to that query. This can be a vector query, keyword query or hybrid query. This is controlled by the embeddings index configuration. The `question` is plugged into the RAG pipeline template along with the top n matching context elements from the query. The generated column is stored as `name` in the report output. |
| 209 | + |
129 | 210 | ## Building a report file |
130 | 211 |
|
131 | | -Reports support generating output in multiple formats. An example report call: |
| 212 | +Reports can generate output in multiple formats. An example report call: |
132 | 213 |
|
133 | 214 | ``` |
134 | | -python -m paperai.report report.yml 50 md <path to model directory> |
| 215 | +python -m paperai.report crc.yml 10 csv <path to model directory> |
135 | 216 | ``` |
136 | 217 |
|
| 218 | +In the example above, a file named Research.csv will be created with the top 10 most relevant articles. |
| 219 | + |
137 | 220 | The following report formats are supported: |
138 | 221 |
|
139 | 222 | - Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file. |
140 | 223 | - CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file. |
141 | 224 | - Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files. |
142 | 225 |
|
143 | | -In the example above, a file named report.md will be created. Example report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks). |
| 226 | +See the [examples](https://github.com/neuml/paperai/tree/master/examples) directory for report examples. Additional historical report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks). |
144 | 227 |
|
145 | 228 | ## Tech Overview |
146 | 229 |
|
147 | | -paperai is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. Embeddings are built over the full corpus. |
| 230 | +paperai is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index, a SQLite database with the articles and an LLM. These components are joined together in a [txtai RAG pipeline](https://neuml.github.io/txtai/pipeline/text/rag/). |
| 231 | + |
| 232 | +Each article is parsed into sections and stored in a data store along with the article metadata. Embeddings are built over the full corpus. The LLM analyzes context-limited requests and generates outputs. |
148 | 233 |
|
149 | 234 | Multiple entry points exist to interact with the model. |
150 | 235 |
|
151 | 236 | - paperai.report - Builds a report for a series of queries. For each query, the top scoring articles are shown along with matches from those articles. There is also a highlights section showing the most relevant results. |
152 | 237 | - paperai.query - Runs a single query from the terminal |
153 | 238 | - paperai.shell - Allows running multiple queries from the terminal |
| 239 | + |
| 240 | +## Recognition |
| 241 | + |
| 242 | +paperai and/or NeuML has been recognized in the following articles. |
| 243 | + |
| 244 | +- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201) |
| 245 | +- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus) |
| 246 | +- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447) |
0 commit comments