neuml
diff --git a/‎README.md‎
Lines changed: 104 additions & 11 deletions b/‎README.md‎
Lines changed: 104 additions & 11 deletions
diff --git a/‎images/architecture-dark.png‎
-4.56 KB b/‎images/architecture-dark.png‎
-4.56 KB
@@ -29,20 +29,18 @@
 
 -------------------------------------------------------------------------------------------------------------------------------------------------------
 
-paperai is an AI application for medical and scientific papers.
+`paperai` is an AI application for medical and scientific papers.
 
 ![demo](https://raw.githubusercontent.com/neuml/paperai/master/demo.png)
 
-Applications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning.
+⚡ Supercharge research tasks with AI-driven report generation. A `paperai` application goes through repositories of articles and generates bulk answers to questions backed by Large Language Model (LLM) prompts and Retrieval Augmented Generation (RAG) pipelines.
+
+A `paperai` configuration file enables bulk LLM inference operations in a performant manner. Think of it like kicking off hundreds of ChatGPT prompts over your data.
 
 ![architecture](https://raw.githubusercontent.com/neuml/paperai/master/images/architecture.png#gh-light-mode-only)
 ![architecture](https://raw.githubusercontent.com/neuml/paperai/master/images/architecture-dark.png#gh-dark-mode-only)
 
-paperai and/or NeuML has been recognized in the following articles:
-
-- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201)
-- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus)
-- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447)
+`paperai` can generate reports in Markdown, CSV and annotate answers directly on PDFs (when available).
 
 ## Installation
 
@@ -88,6 +86,7 @@ The following notebooks and applications demonstrate the capabilities provided b
 | Notebook  | Description  |       |
 |:----------|:-------------|------:|
 | [Introducing paperai](https://github.com/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) | Overview of the functionality provided by paperai | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/01_Introducing_paperai.ipynb) |
+| [Medical Research Project](https://github.com/neuml/paperai/blob/master/examples/02_Medical_Research_Project.ipynb) | Research young onset colon cancer | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/paperai/blob/master/examples/02_Medical_Research_Project.ipynb) |
 
 ### Applications
 
@@ -126,28 +125,122 @@ paperai <path to model directory>
 
 A prompt will come up. Queries can be typed directly into the console.
 
+## Report schema
+
+The following steps through an example `paperai` report configuration file and describes each section.
+
+```yaml
+name: ColonCancer
+options:
+    llm: Intelligent-Internet/II-Medical-8B-1706-GGUF/II-Medical-8B-1706.Q4_K_M.gguf
+    system: You are a medical literature document parser. You extract fields from data.
+    template: |
+        Quickly extract the following field using the provided rules and context.
+
+        Rules:
+          - Keep it simple, don't overthink it
+          - ONLY extract the data
+          - NEVER explain why the field is extracted
+          - NEVER restate the field name only give the field value
+          - Say no data if the field can't be found within the context
+
+        Field:
+        {question}
+
+        Context:
+        {context}
+
+    context: 5
+    params:
+        maxlength: 4096
+        stripthink: True
+
+Research:
+    query: colon cancer young adults
+    columns:
+        - name: Date
+        - name: Study
+        - name: Study Link
+        - name: Journal
+        - {name: Sample Size, query: number of patients, question: Sample Size}
+        - {name: Objective, query: objective, question: Study Objective}
+        - {name: Causes, query: possible causes, question: List of possible causes}
+        - {name: Detection, query: diagnosis, question: List of ways to diagnose}
+```
+
+### Configuration
+
+The following shows the top level configuration options.
+
+| Field  | Description  |
+|:------------ |:-------------|
+| name | Report name |
+| options | RAG pipeline options - set the LLM, prompt templates, max length and more|
+| report | Each unique top level parameter sets the report name. In the example above, it's called `Research` |
+| query | Vector query that identifies the top n documents |
+| columns | List of columns |
+
+### Standard columns
+
+Standard columns use the article data store metadata to simply copy fields into a report. Set the column `name` to one of the values below.
+
+| Field  | Description  |
+|:------------ |:-------------|
+| Id | Article unique identifier |
+| Date | Article publication date |
+| Study | Title of the article |
+| Study Link | HTTP link to the study | 
+| Journal | Publication name | 
+| Source | Data source name | 
+| Entry | Article entry date |
+| Matches | Sections that caused this article to match the report query | 
+
+### Generated columns
+
+The most novel feature of `paperai` is being able to generate dynamic columns driven by a RAG pipeline. Each field takes the following parameters.
+
+| Parameter  | Description  |
+|:------------ |:-------------|
+| name | Column name |
+| query | search/similarity query |
+| question | llm question parameter |
+
+For each matching article, the `query` sorts each section by relevance to that query. This can be a vector query, keyword query or hybrid query. This is controlled by the embeddings index configuration. The `question` is plugged into the RAG pipeline template along with the top n matching context elements from the query. The generated column is stored as `name` in the report output.
+
 ## Building a report file
 
-Reports support generating output in multiple formats. An example report call:
+Reports can generate output in multiple formats. An example report call:
 
 ```
-python -m paperai.report report.yml 50 md <path to model directory>
+python -m paperai.report crc.yml 10 csv <path to model directory>
 ```
 
+In the example above, a file named Research.csv will be created with the top 10 most relevant articles.
+
 The following report formats are supported:
 
 - Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.
 - CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.
 - Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.
 
-In the example above, a file named report.md will be created. Example report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks).
+See the [examples](https://github.com/neuml/paperai/tree/master/examples) directory for report examples. Additional historical report configuration files can be found [here](https://github.com/neuml/cord19q/tree/master/tasks).
 
 ## Tech Overview
 
-paperai is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. Embeddings are built over the full corpus.
+paperai is a combination of a [txtai](https://github.com/neuml/txtai) embeddings index, a SQLite database with the articles and an LLM. These components are joined together in a [txtai RAG pipeline](https://neuml.github.io/txtai/pipeline/text/rag/).
+
+Each article is parsed into sections and stored in a data store along with the article metadata. Embeddings are built over the full corpus. The LLM analyzes context-limited requests and generates outputs.
 
 Multiple entry points exist to interact with the model.
 
 - paperai.report - Builds a report for a series of queries. For each query, the top scoring articles are shown along with matches from those articles. There is also a highlights section showing the most relevant results.
 - paperai.query - Runs a single query from the terminal
 - paperai.shell - Allows running multiple queries from the terminal
+
+## Recognition
+
+paperai and/or NeuML has been recognized in the following articles.
+
+- [Machine-Learning Experts Delve Into 47,000 Papers on Coronavirus Family](https://www.wsj.com/articles/machine-learning-experts-delve-into-47-000-papers-on-coronavirus-family-11586338201)
+- [Data scientists assist medical researchers in the fight against COVID-19](https://cloud.google.com/blog/products/ai-machine-learning/how-kaggle-data-scientists-help-with-coronavirus)
+- [CORD-19 Kaggle Challenge Awards](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/discussion/161447)