Skip to content

Commit b7926f6

Browse files
committed
update docs
1 parent 1876b75 commit b7926f6

File tree

10 files changed

+598
-238
lines changed

10 files changed

+598
-238
lines changed

README.md

Lines changed: 44 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# augmenta
1+
# Augmenta
22

33
[![lifecycle](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental)
44
[![PyPI](https://img.shields.io/pypi/v/augmenta.svg)](https://pypi.org/project/augmenta/)
@@ -10,9 +10,9 @@ Augmenta is an AI agent for enhancing datasets with information from the interne
1010

1111
## Why?
1212

13-
Large Language Models (LLMs) can be powerful tools for processing a lot of information very quickly. However, they don't do it entirely accurately. LLMs are prone to [hallucinations](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)), making them unreliable sources of truth, particularly when it comes to tasks that require domain-specific knowledge.
13+
Large Language Models (LLMs) can be powerful tools for processing large volumes of information very quickly. However, they are prone to [hallucinations](https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)), making them unreliable sources of truth, particularly when it comes to tasks that require domain-specific knowledge.
1414

15-
Augmenta aims to address this shortcoming by allowing LLMs to search and browse the internet for information. This technique is known as "search-based [Retrieval-Augmented Generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation)", or "[grounding](https://techcommunity.microsoft.com/blog/fasttrackforazureblog/grounding-llms/3843857)", and has been shown to significantly improve output quality. It does not, however, eliminate the risk of hallucinations entirely, so you should always verify the results before publishing them.
15+
Augmenta aims to address this shortcoming by "[grounding](https://techcommunity.microsoft.com/blog/fasttrackforazureblog/grounding-llms/3843857)" LLMs with information from the internet. This has been shown to significantly improve output quality. It does not, however, eliminate hallucinations entirely, so you should always verify the results before publishing them.
1616

1717
## Installation
1818

@@ -23,15 +23,18 @@ Augmenta aims to address this shortcoming by allowing LLMs to search and browse
2323
If you're using [uv](https://docs.astral.sh/uv/), open your terminal and run the following command to install Augmenta:
2424

2525
```bash
26-
uvx install git+https://github.com/Global-Witness/augmenta.git
26+
uvx install augmenta
2727
```
2828

2929
You may wish to do this in a virtual environment to avoid conflicts with other Python packages. This will limit Augmenta's scope to the current directory.
3030

3131
```bash
3232
uv venv
33-
source .venv/bin/activate # On Windows: .venv\Scripts\activate
34-
uv pip install git+https://github.com/Global-Witness/augmenta.git
33+
# on Linux/macOS
34+
source .venv/bin/activate
35+
# on Windows
36+
.venv\Scripts\activate
37+
uv pip install augmenta
3538
```
3639

3740
</details>
@@ -45,15 +48,18 @@ First, make sure you have Python 3.10 or later and [`pipx`](https://pipx.pypa.io
4548
Then, open your terminal and run the following command to install Augmenta:
4649

4750
```bash
48-
pipx install git+https://github.com/Global-Witness/augmenta.git
51+
pipx install augmenta
4952
```
5053

5154
You may wish to do this in a virtual environment to avoid conflicts with other Python packages. This will limit Augmenta's scope to the current directory.
5255

5356
```bash
5457
python -m venv .venv
55-
source .venv/bin/activate # On Windows: .venv\Scripts\activate
56-
pip install git+https://github.com/Global-Witness/augmenta.git
58+
# on Linux/macOS
59+
source .venv/bin/activate
60+
# on Windows
61+
.venv\Scripts\activate
62+
pip install augmenta
5763
```
5864

5965
</details>
@@ -62,31 +68,36 @@ pip install git+https://github.com/Global-Witness/augmenta.git
6268
## Usage
6369

6470
> [!TIP]
65-
> If you would rather follow an example, [go here](https://github.com/Global-Witness/orcl?tab=readme-ov-file#augmenta).
71+
> If you would rather follow an example, [go here](https://github.com/Global-Witness/augmenta/tree/main/docs/examples/donations).
6672
67-
Start by creating a new directory for your project. This will contain all your data, configuration files, as well as some temporary files that Augmenta will create while it runs.
73+
Each Augmenta project is a self-contained directory containing all the files needed to make it run:
74+
75+
- **input data**: a CSV file in a [tidy format](https://research-hub.auckland.ac.nz/managing-research-data/organising-and-describing-data/tidy-data), where each row is an entity you want to process (eg. company), and each column is a different attribute of that entity (eg. industry, address, revenue, etc.)
76+
- **configuration file**: a YAML file that tells Augmenta how to process your data (see below)
77+
- **credentials**: a `.env` file containing your API keys (see below)
78+
- **cache**: Augmenta will automatically create some cache files while it runs, which you can ignore
6879

69-
Copy the data you want processed into this directory (or a subdirectory). Augmenta currently supports CSV files in a [tidy format](https://research-hub.auckland.ac.nz/managing-research-data/organising-and-describing-data/tidy-data), where each row will contain an entity you want to process (eg. company), and each column will contain a different attribute of that entity (eg. industry, address, revenue, etc.).
7080

7181
### Configuration file
7282

7383
The LLM needs instructions on how to process your data. Create a new file called `config.yaml` (you can change the name if you prefer) somewhere in your project directory and open it with a text editor. Copy this into it:
7484

7585
```yaml
76-
input_csv: path/to/original_data.csv
77-
output_csv: path/to/processed_data.csv
86+
input_csv: data/donations.csv
87+
output_csv: data/donations_classified.csv
7888
model:
7989
provider: openai
8090
name: gpt-4o-mini
8191
search:
82-
engine: brightdata_google
92+
engine: brave
93+
results: 20
8394
prompt:
84-
system: You are an expert researcher whose job is to classify organisations based on the industry they belong to.
95+
system: You are an expert researcher whose job is to classify individuals and companies based on their industry.
8596
user: |
8697
# Instructions
8798
8899
Your job is to research "{{DonorName}}", a donor to a political party in the UK. Your will determine what industry {{DonorName}} belongs to. The entity could be a company, a trade group, a union, an individual, etc.
89-
100+
90101
If {{DonorName}} is an individual, you should classify them based on their profession or the industry they are closest associated with. If the documents are about multiple individuals, or if it's not clear which individual the documents refer to, please set the industry to "Don't know" and the confidence level to 1. For example, there's no way to know for certain that someone named "John Smith" in the documents is the same person as the donor in the Electoral Commission.
91102
92103
We also know that the donor is a {{DonorStatus}}.
@@ -95,8 +106,8 @@ prompt:
95106
96107
In most cases, you should start by searching for {{DonorName}} without any additional parameters. Where relevant, remove redundant words like "company", "limited", "plc", etc from the search query. If you need to perform another search, try to refine it by adding relevant keywords like "industry", "job", "company", etc. Note that each case will be different, so be flexible and adaptable. Unless necessary, limit your research to two or three searches.
97108
98-
With each search, select a few sources with `mcp-server-fetch-python` that are most likely to provide relevant information. Access them using the tools provided. Be critical and use common sense. ALWAYS cite your sources.
99-
109+
With each search, select a few sources that are most likely to provide relevant information. Access them using the tools provided. Be critical and use common sense. Use the sequential thinking tool to think about your next steps. ALWAYS cite your sources.
110+
100111
Now, please proceed with your analysis and classification of {{DonorName}}.
101112
structure:
102113
industry:
@@ -135,35 +146,40 @@ examples:
135146
industry: Financial and insurance activities
136147
explanation: |
137148
According to [the Wall Street Journal](https://www.wsj.com/market-data/quotes/SFNC/company-people/executive-profile/247375783), Mr. Charles Alexander DANIEL-HOBBS is the Chief Financial Officer and Executive Vice President of Simmons First National Corp, a bank holding company.
138-
149+
139150
A Charles Alexander DANIEL-HOBBS also operates several companies, such as [DIBDEN PROPERTY LIMITED](https://find-and-update.company-information.service.gov.uk/company/10126637), which Companies House classifies as "Other letting and operating of own or leased real estate". However, the information is not clear on whether these are the same person.
151+
confidence: 2
140152
- input: "Unite the Union"
141153
output:
142154
industry: Trade union
143155
explanation: |
144156
Unite is [one of the two largest trade unions in the UK](https://en.wikipedia.org/wiki/Unite_the_Union), with over 1.2 million members. It represents various industries, such as construction, manufacturing, transport, logistics and other sectors.
157+
confidence: 7
145158
- input: "Google UK Limited"
146159
output:
147160
industry: Information and communication
148161
explanation: |
149162
Google UK Limited is a [subsidiary of Google LLC](https://about.google/intl/ALL_uk/google-in-uk/), a multinational technology company that specializes in Internet-related services and products.
150163
151-
The company [provides various web based business services](https://www.bloomberg.com/profile/company/1200719Z:LN), including a web based search engine which includes various options such as web, image, directory, and news searches.
164+
The company [provides various web based business services](https://www.bloomberg.com/profile/company/1200719Z:LN), including a web based search engine which includes various options such as web, image, directory, and news searches.
165+
confidence: 10
152166
- input: "John Smith"
153167
output:
154168
industry: Don't know
155169
explanation: |
156170
The documents about John Smith refer to multiple people (a [British polician](https://en.wikipedia.org/wiki/John_Smith_(Labour_Party_leader)), an [explorer](https://en.wikipedia.org/wiki/John_Smith_(explorer)), a [singer-songwriter](https://johnsmithjohnsmith.com/)), so there's no way to accurately assess what industry this particular individual belongs to.
171+
confidence: 1
172+
logfire: true
157173
```
158174
159-
You will need to edit this file to suit your project. Let's break all this down:
175+
You will need to adapt this configuration file to suit your project. Let's break it all this down:
160176
161-
- `input_csv` and `output_csv` are the names of the data you want to process and where you want to save the results, respectively.
177+
- `input_csv` and `output_csv` are the paths to your original data and where you want to save the results, respectively.
162178
- `model`: The LLM you want to use. You can find a list of supported models [here](https://ai.pydantic.dev/models/). Note that you need to provide both a `provider` and model `name` (ie. `anthropic` and `claude-3.5-sonnet`). You will also likely need to set up an API key (see [credentials below](#credentials)).
163179
- `search`: The search engine you want to use. You can find a list of supported search engines [here](/docs/search.md). You will also likely need to set up an API key here (see [credentials](#credentials)).
164180
- `prompt`: LLMs take in a [system prompt](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/system-prompts) and a user prompt. Think of the system prompt as explaining to the LLM what its role is, and the user prompt as the instructions you want it to follow. You can use double curly braces (`{{ }}`) to refer to columns in your input CSV. Therea are some tips on writing good prompts [here](docs/prompt.md).
165181
- `structure`: The structure of the output data. You can think of this as the columns you want added to your original CSV.
166-
- `examples`: Examples of the output data. These will help the AI better understand what you're trying to do.
182+
- `examples` (optional): Examples of the output data. These will help the AI better understand what you're trying to do.
167183

168184
### Credentials
169185

@@ -185,11 +201,11 @@ Make sure you have saved both your `config.yaml` and `.env` files. Open a **new*
185201
augmenta config.yaml
186202
```
187203

188-
It might be a few seconds before you see any output, but once it does, you will see a progress bar.
204+
It might be a few seconds before you see a progress bar.
189205

190206
By default, Augmenta will save your progress so that you can resume if the process gets interrupted at any point. You can find options for working with the cache [here](docs/cache.md).
191207

192-
Start with a subset of your data (5-10 rows) to test your configuration and that you are happy with the results. [Adjust your prompt often](docs/prompt.md). You can then rerun Augmenta on the full dataset.
208+
Start with a subset of your data (5-10 rows) to test your configuration and that you are happy with the results. [Adjust your prompt often](docs/prompt.md). You can then run Augmenta on the full dataset.
193209

194210
#### Monitoring
195211

@@ -203,7 +219,7 @@ Add `logfire: true` to your YAML and run Augmenta in verbose mode:
203219
augmenta -v config.yaml
204220
```
205221

206-
If everything is set up correctly, you should have a link to your logfire dashboard in the terminal. You will be able to monitor how Augmenta is running, which tools are using, any potential errors or inconsistencies, etc.
222+
If everything is set up correctly, you should have a link to your logfire dashboard in the terminal. You will be able to monitor how Augmenta is running, which tools it is using, any potential errors or inconsistencies, etc.
207223

208224
![Screenshot of a Logfire dashboard showing an Augmenta run](docs/logfire-demo.png "Logfire demo")
209225

@@ -213,26 +229,4 @@ If everything is set up correctly, you should have a link to your logfire dashbo
213229
- [Adding new tools to Augmenta](/docs/tools.md)
214230
- [Writing a good prompt](/docs/prompt.md)
215231
- [How caching works](/docs/cache.md)
216-
- [An example in action](/docs/examples/donations/README.md)
217-
218-
## Development
219-
220-
Create a new virtual environment:
221-
222-
```bash
223-
cd augmenta
224-
python -m venv .venv
225-
source .venv/bin/activate # On Windows: .venv\Scripts\activate
226-
```
227-
228-
Now install the dependencies and test dependencies:
229-
230-
```bash
231-
python -m pip install -e '.[test]'
232-
```
233-
234-
To run the tests:
235-
236-
```bash
237-
python -m pytest
238-
```
232+
- [An example in action](/docs/examples/donations/README.md)

0 commit comments

Comments
 (0)