Main Output files
There are three main output files generated by ICEscreen that can found in the directory ICEscreen_results/*{genbank file name analysed}*/detected_mobile_elements/
:
*{genbank file name analysed}*_detected_SP_withMEIds.tsv
: List of the signature proteins detected by the tool and their possible assignment to an ICE or IME element.*{genbank file name analysed}*_detected_ME.tsv
: List of the ICEs and IMEs elements detected by the tool, including information about the signature proteins they contain.*{genbank file name analysed}*_detected_ME.summary
: Statistical summary by category of result.
File *{genbank file name analysed}*_detected_SP_withMEIds.tsv
This file lists the signature proteins (SP) detected by the tool and their possible assignment to an ICE or IME element. It is a comma separated table with a header. Each line represents a signature protein detected by ICEscreen.
Description of the columns
Information associated with the ICE/IME element detected by ICEscreen
N° Column | Column name | Description |
---|---|---|
1 | ICE_IME_id | ICE / IME identifier attributed to a signature proteins (SP) associated with high confidence to a structure. In other words, all the different SP with a similar ICE_IME_id belong to the same ICE / IME structure according to the software. |
2 | ICE_IME_id_need_manual_curation | ICE / IME identifier attributed to a signature proteins (SP) associated to a structure with lower confidence. In other words, This SP could be attributed to the ICE / IME structure but this requires some manual curation. This identifier is identical to the column “ICE_IME_id”. |
3 | Segment_number | Number of the segment in which the signature protein is located. One of the first step of the algorithm is to cut the ordered list of SP into smaller segments where two subsequent SPs can not be separated by more than 100 CDSs. |
4 | Comments_ICE_IME_structure | Specifies information regarding special decisions made by the algorithm on attributing the SP to a structure or not. An example of comment: “The SP NP_XXXXXX is not a VirB4, constitutes a conjugaison module by itself, and no integrase has been attributed to the element 3, please manually check.”. |
5 | Is_hit_blast | Indicates whether the signature protein was detected by BlastP (1: Yes, 0: No). |
6 | Is_hit_HMM | Indicates whether the signature protein was detected by HMM (1: Yes, 0: No). |
Information extracted from the GenBank annotation
N° Column | Column name | Description |
---|---|---|
7 | CDS_num | CDS number of the signature protein on the genome when all the CDSs of the genome are ordered according to their start position. |
8 | Genome_accession | Identifier of the genome that harbour the CDS. |
9 | Genome_accession_rank | Rank of the genome that harbour the CDS in the original genbank file. If multiple genomes are stored back to back in a gbff file, the rank of the first record is 1, the rank of the second record is 2, and so on. |
10 | CDS_locus_tag | The locus tag (unique identifier) of the CDS. |
11 | CDS_protein_id | The protein id associated with the CDS. Different CDSs can have a similar protein id if they code a similar protein. |
12 | CDS_strand | Strand of the gene encoding the CDS. |
13 | CDS_start | CDS start position. |
14 | CDS_end | CDS end position. |
15 | CDS_length | CDS length (in aa). |
16 | Is_pseudo | Indicates whether the CDS is annotated as pseudogene in the GenBank annotation. |
Information on BlastP results
Information on raw BlastP results and BlastP results enriched with ICEscreen annotation
N° Column | Column name | Description |
---|---|---|
17 and 18 | CDS_Protein_type and CDS_Protein_type_blast | Protein type (genbank annotation) of the CDS (VirB4, Coupling protein, Relaxase, DDE transposase, Tyrosine integrase, Serine integrase). |
19 | Description_of_blast_most_similar_ref_SP | Description (genbank annotation) of the reference SP most similar to the CDS. |
20 | Id_of_blast_most_similar_ref_SP | Protein identifier of the reference SP most similar to the CDS (BlastP search). |
21 | Length_of_blast_most_similar_ref_SP | Size of the reference SP most similar to the CDS. |
22 | Blast_ali_length | BlastP alignment length. |
23 | Blast_ali_start_CDS | Start (in pb) of the alignment with regard to the CDS protein sequence. |
24 | Blast_ali_end_CDS | Stop (in pb) of the alignment with regard to the CDS protein sequence. |
25 | Blast_ali_start_Query_blast | Start (in pb) of the alignment with regard to the protein sequence of the reference SP most similar to the CDS. |
26 | Blast_ali_end_Query_blast | Stop (in pb) of the alignment with regard to the protein sequence of the reference SP most similar to the CDS. |
27 | Blast_ali_identity_perc | BlastP Alignment Identity Percentage. |
28 | Blast_ali_E-value | E-value of the BlastP alignment. |
29 | Blast_ali_bitscore | BlastP Alignment bitscore. |
30 | CDS_coverage_blast | BlastP Alignment coverage for the CDS of the analysed genome. |
31 | Blast_ali_coverage_most_similar_ref_SP | BlastP Alignment coverage for most similar reference SP used as a blast query. |
32 | Protein_type_of_blast_most_similar_ref_SP | Protein type of the most similar reference SP used as a blast query (VirB4, Coupling protein, Relaxase, integrase). |
33 | Associated_element_type_of_blast_most_similar_ref_SP | Is the mobile element of the most similar reference SP used as a blast query an ICE or an IME. |
34 | ICE_superfamily_of_most_similar_ref_SP | ICE superfamily of the mobile element which harbors the protein used as query for the BlastP search. |
35 | ICE_family_of_most_similar_ref_SP | ICE family of the mobile element which harbors the protein used as query for the BlastP search. |
36 | IME_superfamily_of_most_similar_ref_SP | IME superfamily of the mobile element which harbors the protein used as query for the BlastP search . This family matches the IME relaxase family. |
37 | Relaxase_family_domain_of_most_similar_ref_SP | Family of the domain of the relaxase of the mobile element which harbors the protein used as query for the BlastP search. |
38 | Relaxase_family_MOB_of_most_similar_ref_SP | MOB of the Relaxase of the mobile element which harbors the protein used as query for the BlastP search (if the mobile element is an IME). |
39 | Coupling_type_of_most_similar_ref_SP | Type of the coupling protein of the mobile element which harbors the protein used as query for the BlastP search (if the mobile element is an IME). |
40 | False_positives | Indicates whether ICEscreen considers the signature protein to be a false positive. |
41 | SP_blast_validation | Is the prediction of the signature protein by BlastP validated or not ? |
42 | Use_annotation | Is the expert DINAMIC annotation of the most similar reference SP transferred to this CDS or not ? (based on percentage idenitity). |
Information on the results obtained with HMM
N° Column | Column name | Description |
---|---|---|
43 | Protein_type_of_matching_HMM_profile | Type of the protein associated with the HMM profile used for the HMM search (VirB4, Coupling protein, Relaxase, DDE transposase, Tyrosine integrase, Serine integrase). |
44 | Description_of_matching_HMM_profile | Description of the HMM profile with the best score (best HMM result). |
45 | Profile_name | Name of the HMM profile with the best score (best HMM result). |
46 | Length_of_matching_HMM_profile | Length of the HMM profile with the best score (best HMM result). |
47 | HMM_ali_i-Evalue | i-Evalue of the best HMM result (The i-Evalue for HMM is the equivalent to the E-value for BlastP). |
48 | HMM_ali_E-value | E-value of the best HMM result (taking into account multi-alignments). |
49 | HMM_ali_Score | Score of the best HMM result. |
50 | HMM_ali_Bias | Bias of the best HMM result. |
51 | HMM_ali_Global_score | Global score of the best HMM result. |
52 | HMM_ali_Global_bias | Global bias of the best HMM result. |
53 | HMM_coverage | Percentage coverage of the HMM profile found in the alignment of the best HMM result (Proportion of the HMM profile found in the HMM alignment). |
54 | CDS_coverage_hmm | Percentage coverage of the CDS found in the alignment of the best HMM result (Proportion of the CDS found in the HMM alignment). |
File *{genbank file name analysed}*_detected_ME.tsv
This file lists the ICEs and IMEs mobile elements (MEs) detected by ICEscreen including the SPs they contain. It is a tab separated table with a header. Information in this file is similar to the output file _withICEIMEIds.tsv (option -m) but centered around a list of ICE / IME structures instead of a list of SPs. The identifiers for the CDSs reported in this file are the locus tag if present or [Protein identifier]-[CDS start position]
if not (in case of gbff file wih multiple genome records, the genome identifier is added like so [Genome identifier]-[Protein identifier]-[CDS start position]
.
Description of the columns
N° Column | Column name | Description |
---|---|---|
1 | ICE_IME_id | Identifier of the mobile element detected by ICEscreen. |
2 | Segment_number | Segment number in which the element is located. |
3 | Genome_accession | Identifier of the genome that harbour the CDS. |
4 | Category_of_element | Type of the mobile element defined by its composition in total SPs (transfer module + integrase): ICE, IME, complete, partial, etc. |
5 | Category_of_integrase | Number and type of integrase(s). |
6 | Host_ICE_IME_ids | The other mobile element that host this mobile element. |
7 | Guest_ICE_IME_ids | List of other mobile elements that are integrated in this mobile element. |
8 | Colocalized_ICE_IME_ids | Other mobile elements located in the same segment. The mobile elements located in the same segment are not necessarily in accretion. |
9 | ICEline_format | Description of the element’s SPs in the ICEline format. The ICEline format is a single line description of the content of SP and the number of CDS between each of them. See the section on ICEline format for more details. |
10 | ICE_consensus_superfamily_SP_conj_module | ICE Superfamily if consensus between the different SP of the conjugation module. |
11 | ICE_consensus_family_SP_conj_module | ICE family if consensus between the different SP of the conjugation module. |
12 | IME_relaxase_family_domains_blast | List of the IME relaxase family domains. |
13 | HMM_family_SP_conj_module | Family of the SPs of the conjugation module as reported by SPs hits with the HMM profiles. This information is displayed only if no family was reported by BlastP hits. In other words, this is a listing of the type of HMM profile used for relaxase and / or coupling protein and / or VirB4. |
14 | Integrase_upstream | Integrase(s) whose genomic position is located before the SPs of the conjugation module of the mobile element. |
15 | Integrase_downstream | Integrase(s) whose genomic position is located after the SPs of the conjugation module of the mobile element. |
16 | Relaxase | Relaxase of the mobile element. |
17 | Coupling_protein | Coupling protein of the mobile element. |
18 | VirB4 | VirB4 of the mobile element. |
19 | List_SP_ordered_genomic_position | List of the signature proteins reported in the columns above ordered according to their position on the genome and separated by commas. |
20 | Start_of_most_upstream_SP | The start of the most upstream signature protein. This is not to be mistaken for the start of the element however. |
21 | Stop_of_most_downstream_SP | The stop of the most upstream signature protein. This is not to be mistaken for the stop of the element however. |
22 | Other_potential_SP_conj_module_need_manual_curation_and_review | List of other signature proteins of the conjugation module that can potentially be associated with the transfer module of the mobile element, manual verification is required. |
23 | Other_potential_integrase_need_manual_curation_and_review | List of other integrase(s) that can potentially be associated with the integration module of the mobile element, manual verification is required. |
24 | Comments_regarding_structure | Detailed explanation of why some signature proteins need manual curation or could not be associated with the element with high confidence. Other types of comments include the rational of the algorithm for special case situations. |
File *{genbank file name analysed}*_detected_ME.summary
This file summarizes the statistics regarding the ICEs and IMEs structures and the SPs: (1) the number of elements detected and their type, (2) the number of signature proteins detected, their type and their relationship with a mobile element, and (3) the segments and their content. Example of statistics includes “Total number of segments”, “Number of segments with one element”, “Number of segments with several elements”, “Number of segments with nested elements”, “Number of segments with no element”, “Number of complete ICE (4 types of SP)”, etc. If multiple genomes are stored back to back in a gbff file, the summaries for each genome identifiers are displayed back to back according to their order in the original genbank file.
Detailled architecture of the ICEscreen output folder, including intermediate files
All the files generated by the different steps of ICEscreen are stored in a folder named ICEscreen_results
. Below is the folder’s content:
└── ICEscreen_results/
└── *{genbank file name analysed}*/
└── detected_mobile_elements/
├── *{genbank file name analysed}*_detected_ME.summary
├── *{genbank file name analysed}*_detected_ME.tsv
├── *{genbank file name analysed}*_detected_SP_withMEIds.tsv
├── standard_genome_annotation_formats/
| ├── *{genbank file name analysed}*_icescreen.embl.gz
| ├── *{genbank file name analysed}*_icescreen.gb.gz
| ├── *{genbank file name analysed}*_icescreen.gff.gz
| ├── *{genbank file name analysed}*_source.fa.gz
| └── *{genbank file name analysed}*_source.gff.gz
├── param.conf.gz
└── tmp_intermediate_files.tar.gz
Here are the details about each output folders and files:
detected_mobile_elements
: This folder includes the final results files, see the section on the main output files above.standard_genome_annotation_formats
: Files that allow to visualize the ICEscreen results in a genome visualization software. Tested with Artemis, JBrowse, and IGV. Those files are compressed with gzip and need to be extracted before use. SPs of mobile elements that are to be manually reviewed are missing from those files.*{genbank file name analysed}*_icescreen.embl.gz
: Mobile elements and signature proteins detected by ICEscreen in EMBL format. The original genome annotations are not in this file.*{genbank file name analysed}*_icescreen.gb.gz
: Original genbank file modified with the addition of the results of ICEscreen (the mobile elements and signature proteins detected) in genbank format.*{genbank file name analysed}*_icescreen.gff.gz
: Mobile elements and signature proteins detected are annotated in GFF3 format. The original genome annotations are not in this file.*{genbank file name analysed}*_source.fa.gz
: FASTA sequence of the whole original genome.*{genbank file name analysed}*_source.gff.gz
: GFF3 file with annotation extracted from the original genbank file.
param.conf.gz
: Parameters used by ICEscreen to generate the results. This file is compressed with gzip and need to be extracted before use.tmp_intermediate_files.tar.gz
: Temporary intermediate files generated by ICEscreen. This file is compressed with tar and gzip and need to be extracted before use. Once extracted, this file generates the following directory structure:
├── detection_SP/
| ├── Blast_mode/
| | ├── blastp_output/
| | ├── filtered_results/
| | ├── unfiltered_results/
| | └── *{genbank file name analysed}*_blast_SP.tsv
| ├── hits_cleaning/
| | ├── proteins_to_remove/
| | ├── *{genbank file name analysed}*_detected_SP_hmm_cleaned.tsv
| | ├── *{genbank file name analysed}*_detected_SP_hmm_cleaned_reannotated.tsv
| | ├── *{genbank file name analysed}*_detected_SP_source.faa
| | └── *{genbank file name analysed}*__detected_SP_source.tsv
| ├── HMM_mode/
| | ├── filtered_results/
| | ├── hmmscan_output/
| | ├── unfiltered_results/
| | └── *{genbank file name analysed}*_hmm_SP.tsv
| └── *{genbank file name analysed}*_detected_SP.tsv
└── *{genbank file name analysed}*.faa
detection_SP
: Temporary intermediate files generated by the detection of the signature proteins. This directory comprises files generated during the blast and HMM searches at various stages of the cleaning process.*{genbank file name analysed}*.faa
: Temporary intermediate multifasta file of the protein products with annotation extracted from the original genbank file.