Skip to content

Commit c911863

Browse files
authored
Merge pull request #991 from bebatut/host-contamination-removal
Add 2 workflows (long and short-reads) for host and contamination removal from microbiome data
2 parents 379166f + 7420550 commit c911863

File tree

10 files changed

+1053
-0
lines changed

10 files changed

+1053
-0
lines changed
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
version: 1.2
2+
workflows:
3+
- name: main
4+
subclass: Galaxy
5+
publish: true
6+
primaryDescriptorPath: /host-or-contamination-removal-on-long-reads.ga
7+
testParameterFiles:
8+
- /host-or-contamination-removal-on-long-reads-tests.yml
9+
authors:
10+
- name: Paul Zierep
11+
orcid: 0000-0003-2982-388X
12+
- name: "B\xE9r\xE9nice Batut"
13+
orcid: 0000-0001-9852-1987
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Changelog
2+
3+
## [0.1] 2025-12-03
4+
5+
First release.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Host or Contamination removal on long-reads
2+
3+
The extraction of microbiome DNA or RNA is usually contaminated by host and human DNA or RNA (but also other contaminant). It is an important to get rid of all host/contamination sequences and to only retain microbiome sequences, both in order to speed up further steps and to avoid host/contamination sequences compromising the analysis.
4+
5+
This workflow takes Nanopore fastq(.gz) files and executes the following steps:
6+
1. Mapping of the reads against a reference genome of the host or contaminant (e.g. human) using **Minimap 2**
7+
2. Filtering of the generated BAM using **BAMtools** and **Samtools** to keep only the reads that do not align
8+
3. Generation of mapping statistics using **QualiMap**
9+
4. Aggregation of the mapping statistics using **MultiQC**
10+
11+
## Input Datasets
12+
13+
- A list of datasets corresponding to reads in `fastqsanger` or `fastqsanger.gz` format.
14+
- Reference genome
15+
- Profile for mapping
16+
17+
## Output Datasets
18+
19+
- A list of datasets corresponding to unmapped reads in `fastqsanger` or `fastqsanger.gz`.
20+
- A list of reports of QualiMap for each sample that could be used as inputs for extra MultiQC
21+
- MultiQC report of the mapping statistics in HTML
22+
23+
## When to use this workflow
24+
25+
Use this workflow for **long-read sequencing data** (e.g., Nanopore, PacBio). For short-read Illumina data, see the [Host or Contamination removal on short-reads](../host-contamination-removal-short-reads/) workflow.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
- doc: Test outline for host-or-contamination-removal-on-long-reads
2+
job:
3+
Long-reads:
4+
class: Collection
5+
collection_type: list
6+
elements:
7+
- class: File
8+
identifier: Spike3bBarcode10
9+
location: https://zenodo.org/record/12190648/files/collection_of_all_samples_Spike3bBarcode10.fastq.gz
10+
filetype: fastqsanger.gz
11+
- class: File
12+
identifier: Spike3bBarcode12
13+
location: https://zenodo.org/record/12190648/files/collection_of_all_samples_Spike3bBarcode12.fastq.gz
14+
filetype: fastqsanger.gz
15+
Host/Contaminant Reference Genome (long-reads): apiMel3
16+
Profile of preset options for the mapping (long-read): map-pb
17+
outputs:
18+
QualiMap Statistics:
19+
element_tests:
20+
Spike3bBarcode10:
21+
elements:
22+
genome_results:
23+
asserts:
24+
has_text:
25+
text: "Spike3bBarcode10"
26+
has_text:
27+
text: "586,300,787 bp"
28+
coverage_across_reference:
29+
asserts:
30+
has_text:
31+
text: "#Position (bp)"
32+
has_n_lines:
33+
value: 416
34+
coverage_histogram:
35+
asserts:
36+
has_text:
37+
text: "Number of genomic locations"
38+
has_n_lines:
39+
value: 10
40+
genome_fraction_coverage:
41+
asserts:
42+
has_text:
43+
text: "#Coverage (X)"
44+
has_n_lines:
45+
value: 51
46+
duplication_rate_histogram:
47+
asserts:
48+
has_text:
49+
text: "#Duplication rate"
50+
has_text:
51+
text: "17.0"
52+
homopolymer_indels:
53+
asserts:
54+
has_text:
55+
text: "#Type of indel"
56+
has_text:
57+
text: "polyN"
58+
insert_size_across_reference:
59+
asserts:
60+
has_size:
61+
value: 0
62+
insert_size_histogram:
63+
asserts:
64+
has_size:
65+
value: 0
66+
mapped_reads_clipping_profile:
67+
asserts:
68+
has_text:
69+
text: "#Read position (bp)"
70+
has_text:
71+
text: "38.123"
72+
mapped_reads_gc-content_distribution:
73+
asserts:
74+
has_text:
75+
text: "#GC Content (%)"
76+
has_n_lines:
77+
value: 100
78+
mapped_reads_nucleotide_content:
79+
asserts:
80+
has_text:
81+
text: "6.25"
82+
mapping_quality_across_reference:
83+
asserts:
84+
has_text:
85+
text: "#Position (bp)"
86+
has_n_lines:
87+
value: 416
88+
mapping_quality_histogram:
89+
asserts:
90+
has_text:
91+
text: "#Mapping quality"
92+
has_n_lines:
93+
value: 13
94+
Spike3bBarcode12:
95+
elements:
96+
genome_results:
97+
asserts:
98+
has_text:
99+
text: "Spike3bBarcode12"
100+
has_text:
101+
text: "586,300,787 bp"
102+
coverage_across_reference:
103+
asserts:
104+
has_text:
105+
text: "#Position (bp)"
106+
has_n_lines:
107+
value: 416
108+
coverage_histogram:
109+
asserts:
110+
has_text:
111+
text: "Number of genomic locations"
112+
has_n_lines:
113+
value: 6
114+
genome_fraction_coverage:
115+
asserts:
116+
has_text:
117+
text: "#Coverage (X)"
118+
has_n_lines:
119+
value: 51
120+
duplication_rate_histogram:
121+
asserts:
122+
has_text:
123+
text: "#Duplication rate"
124+
has_text:
125+
text: "8.0"
126+
homopolymer_indels:
127+
asserts:
128+
has_text:
129+
text: "#Type of indel"
130+
has_text:
131+
text: "polyN"
132+
insert_size_across_reference:
133+
asserts:
134+
has_size:
135+
value: 0
136+
insert_size_histogram:
137+
asserts:
138+
has_size:
139+
value: 0
140+
mapped_reads_clipping_profile:
141+
asserts:
142+
has_text:
143+
text: "#Read position (bp)"
144+
has_text:
145+
text: "0.03930972"
146+
mapped_reads_gc-content_distribution:
147+
asserts:
148+
has_text:
149+
text: "#GC Content (%)"
150+
has_n_lines:
151+
value: 100
152+
mapped_reads_nucleotide_content:
153+
asserts:
154+
has_text:
155+
text: "16.0"
156+
mapping_quality_across_reference:
157+
asserts:
158+
has_text:
159+
text: "#Position (bp)"
160+
has_n_lines:
161+
value: 416
162+
mapping_quality_histogram:
163+
asserts:
164+
has_text:
165+
text: "#Mapping quality"
166+
has_n_lines:
167+
value: 4
168+
MultiQC HTML Report:
169+
asserts:
170+
has_text:
171+
text: "Spike3bBarcode10"
172+
has_text:
173+
text: "Spike3bBarcode12"
174+
Reads without Host or Contamination:
175+
element_tests:
176+
Spike3bBarcode10:
177+
asserts:
178+
has_text:
179+
text: "@0a0c4d2c-291f-46a4-87d5-625efbfed6a0"
180+
Spike3bBarcode12:
181+
asserts:
182+
has_text:
183+
text: "@0a0c4e88-893a-4284-9119-ab4274e05445"

0 commit comments

Comments
 (0)