Skip to content

Commit 17f92bf

Browse files
committed
Update README to better show capabilities of paperai, closes #84
1 parent 8a701f9 commit 17f92bf

File tree

4 files changed

+260
-102
lines changed

4 files changed

+260
-102
lines changed

README.md

Lines changed: 104 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -29,20 +29,18 @@
2929

3030
-------------------------------------------------------------------------------------------------------------------------------------------------------
3131

32-
paperai is an AI application for medical and scientific papers.
32+
`paperai` is an AI application for medical and scientific papers.
3333

3434
![demo](https://raw.githubusercontent.com/neuml/paperai/master/demo.png)
3535

36-
Applications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning.
36+
⚡ Supercharge research tasks with AI-driven report generation. A `paperai` application goes through repositories of articles and generates bulk answers to questions backed by Large Language Model (LLM) prompts and Retrieval Augmented Generation (RAG) pipelines.
37+
38+
A `paperai` configuration file enables bulk LLM inference operations in a performant manner. Think of it like kicking off hundreds of ChatGPT prompts over your data.
3739

3840
![architecture](https://raw.githubusercontent.com/neuml/paperai/master/images/architecture.png#gh-light-mode-only)
3941
![architecture](https://raw.githubusercontent.com/neuml/paperai/master/images/architecture-dark.png#gh-dark-mode-only)
4042

41-
paperai and/or NeuML has been recognized in the following articles:
42-
43-
- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201)
44-
- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus)
45-
- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447)
43+
`paperai` can generate reports in Markdown, CSV and annotate answers directly on PDFs (when available).
4644

4745
## Installation
4846

@@ -88,6 +86,7 @@ The following notebooks and applications demonstrate the capabilities provided b
8886
| Notebook | Description | |
8987
|:----------|:-------------|------:|
9088
| [Introducing paperai](https://github.com/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) | Overview of the functionality provided by paperai | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) |
89+
| [Medical Research Project](https://github.com/neuml/paperai/blob/master/examples/02_Medical_Research_Project.ipynb) | Research young onset colon cancer | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/02_Medical_Research_Project.ipynb) |
9190

9291
### Applications
9392

@@ -126,28 +125,122 @@ paperai <path to model directory>
126125
127126
A prompt will come up. Queries can be typed directly into the console.
128127
128+
## Report schema
129+
130+
The following steps through an example `paperai` report configuration file and describes each section.
131+
132+
```yaml
133+
name: ColonCancer
134+
options:
135+
llm: Intelligent-Internet/II-Medical-8B-1706-GGUF/II-Medical-8B-1706.Q4_K_M.gguf
136+
system: You are a medical literature document parser. You extract fields from data.
137+
template: |
138+
Quickly extract the following field using the provided rules and context.
139+
140+
Rules:
141+
- Keep it simple, don't overthink it
142+
- ONLY extract the data
143+
- NEVER explain why the field is extracted
144+
- NEVER restate the field name only give the field value
145+
- Say no data if the field can't be found within the context
146+
147+
Field:
148+
{question}
149+
150+
Context:
151+
{context}
152+
153+
context: 5
154+
params:
155+
maxlength: 4096
156+
stripthink: True
157+
158+
Research:
159+
query: colon cancer young adults
160+
columns:
161+
- name: Date
162+
- name: Study
163+
- name: Study Link
164+
- name: Journal
165+
- {name: Sample Size, query: number of patients, question: Sample Size}
166+
- {name: Objective, query: objective, question: Study Objective}
167+
- {name: Causes, query: possible causes, question: List of possible causes}
168+
- {name: Detection, query: diagnosis, question: List of ways to diagnose}
169+
```
170+
171+
### Configuration
172+
173+
The following shows the top level configuration options.
174+
175+
| Field | Description |
176+
|:------------ |:-------------|
177+
| name | Report name |
178+
| options | RAG pipeline options - set the LLM, prompt templates, max length and more|
179+
| report | Each unique top level parameter sets the report name. In the example above, it's called `Research` |
180+
| query | Vector query that identifies the top n documents |
181+
| columns | List of columns |
182+
183+
### Standard columns
184+
185+
Standard columns use the article data store metadata to simply copy fields into a report. Set the column `name` to one of the values below.
186+
187+
| Field | Description |
188+
|:------------ |:-------------|
189+
| Id | Article unique identifier |
190+
| Date | Article publication date |
191+
| Study | Title of the article |
192+
| Study Link | HTTP link to the study |
193+
| Journal | Publication name |
194+
| Source | Data source name |
195+
| Entry | Article entry date |
196+
| Matches | Sections that caused this article to match the report query |
197+
198+
### Generated columns
199+
200+
The most novel feature of `paperai` is being able to generate dynamic columns driven by a RAG pipeline. Each field takes the following parameters.
201+
202+
| Parameter | Description |
203+
|:------------ |:-------------|
204+
| name | Column name |
205+
| query | search/similarity query |
206+
| question | llm question parameter |
207+
208+
For each matching article, the `query` sorts each section by relevance to that query. This can be a vector query, keyword query or hybrid query. This is controlled by the embeddings index configuration. The `question` is plugged into the RAG pipeline template along with the top n matching context elements from the query. The generated column is stored as `name` in the report output.
209+
129210
## Building a report file
130211

131-
Reports support generating output in multiple formats. An example report call:
212+
Reports can generate output in multiple formats. An example report call:
132213

133214
```
134-
python -m paperai.report report.yml 50 md <path to model directory>
215+
python -m paperai.report crc.yml 10 csv <path to model directory>
135216
```
136217

218+
In the example above, a file named Research.csv will be created with the top 10 most relevant articles.
219+
137220
The following report formats are supported:
138221

139222
- Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.
140223
- CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.
141224
- Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.
142225

143-
In the example above, a file named report.md will be created. Example report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks).
226+
See the [examples](https://github.com/neuml/paperai/tree/master/examples) directory for report examples. Additional historical report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks).
144227

145228
## Tech Overview
146229

147-
paperai is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. Embeddings are built over the full corpus.
230+
paperai is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index, a SQLite database with the articles and an LLM. These components are joined together in a [txtai RAG pipeline](https://neuml.github.io/txtai/pipeline/text/rag/).
231+
232+
Each article is parsed into sections and stored in a data store along with the article metadata. Embeddings are built over the full corpus. The LLM analyzes context-limited requests and generates outputs.
148233

149234
Multiple entry points exist to interact with the model.
150235

151236
- paperai.report - Builds a report for a series of queries. For each query, the top scoring articles are shown along with matches from those articles. There is also a highlights section showing the most relevant results.
152237
- paperai.query - Runs a single query from the terminal
153238
- paperai.shell - Allows running multiple queries from the terminal
239+
240+
## Recognition
241+
242+
paperai and/or NeuML has been recognized in the following articles.
243+
244+
- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201)
245+
- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus)
246+
- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447)

images/architecture-dark.png

-4.56 KB
Loading

0 commit comments

Comments
 (0)