Main Output files
There are 4 main output files generated by ICEscreen, they can found in the directory ICEscreen_results/results/detection_ME/
:
*_detected_SP_withMEIds.tsv
: List of the signature proteins detected by the tool and their possible assignment to an ICE or IME element.*_detected_ME.tsv
: List of the ICEs and IMEs elements detected by the tool, including information about the signature proteins they contain.*_detected_ME.summary
: Statistical summary by category of result.*_detected_ME.log
: Parameters used by ICEscreen to generate the results.
The file *_detected_SP_withMEIds.tsv
This file lists the signature proteins detected by the tool and their possible assignment to an ICE or IME element. It is a comma separated table with a header. Each line represents a signature protein detected by ICEscreen.
Description of the columns
Information associated with the ICE/IME element detected by ICEscreen
N° Column | Column name | Description |
---|---|---|
1 | ICE_IME_id | ICE / IME identifier attributed to a signature proteins (SP) associated with high confidence to a structure. In other words, all the different SP with a similar ICE_IME_id belong to the same ICE / IME structure according to the software. |
2 | ICE_IME_id_need_manual_curation | ICE / IME identifier attributed to a signature proteins (SP) associated to a structure with lower confidence. In other words, This SP could be attributed to the ICE / IME structure but this requires some manual curation. This identifier is identical to the column “ICE_IME_id”. |
3 | Segment_number | Number of the segment in which the signature protein is located. One of the first step of the algorithm is to cut the ordered list of SP into smaller segments where two subsequent SPs can not be separated by more than 100 CDSs. |
4 | Comments_ICE_IME_structure | Specifies information regarding special decisions made by the algorithm on attributing the SP to a structure or not. An example of comment: “The SP NP_XXXXXX is not a VirB4, constitutes a conjugaison module by itself, and no integrase has been attributed to the element 3, please manually check.”. |
5 | Is_hit_blast | Indicates whether the signature protein was detected by BlastP (1: Yes, 0: No). |
6 | Is_hit_HMM | Indicates whether the signature protein was detected by HMM (1: Yes, 0: No). |
Information extracted from the GenBank annotation
N° Column | Column name | Description |
---|---|---|
7 | CDS_num | CDS number of the signature protein on the genome when all the CDSs of the genome are ordered according to their start position. |
8 | Genome_accession | Identifier of the genome that harbour the CDS. |
9 | Genome_accession_rank | Rank of the genome that harbour the CDS in the original genbank file. If multiple genomes are stored back to back in a gbff file, the rank of the first record is 1, the rank of the second record is 2, and so on. |
10 | CDS_locus_tag | The locus tag (unique identifier) of the CDS. |
11 | CDS_protein_id | The protein id associated with the CDS. Different CDSs can have a similar protein id if they code a similar protein. |
12 | CDS_strand | Strand of the gene encoding the CDS. |
13 | CDS_start | CDS start position. |
14 | CDS_end | CDS end position. |
15 | CDS_length | CDS length (in aa). |
16 | Is_pseudo | Indicates whether the CDS is annotated as pseudogene in the GenBank annotation. |
Information on BlastP results
Information on raw BlastP results and BlastP results enriched with ICEscreen annotation
N° Column | Column name | Description |
---|---|---|
17 and 18 | CDS_Protein_type and CDS_Protein_type_blast | Protein type (genbank annotation) of the CDS (VirB4, Coupling protein, Relaxase, DDE transposase, Tyrosine integrase, Serine integrase). |
19 | Description_of_blast_most_similar_ref_SP | Description (genbank annotation) of the reference SP most similar to the CDS. |
20 | Id_of_blast_most_similar_ref_SP | Protein identifier of the reference SP most similar to the CDS (BlastP search). |
21 | Length_of_blast_most_similar_ref_SP | Size of the reference SP most similar to the CDS. |
22 | Blast_ali_length | BlastP alignment length. |
23 | Blast_ali_start_CDS | Start (in pb) of the alignment with regard to the CDS protein sequence. |
24 | Blast_ali_end_CDS | Stop (in pb) of the alignment with regard to the CDS protein sequence. |
25 | Blast_ali_start_Query_blast | Start (in pb) of the alignment with regard to the protein sequence of the reference SP most similar to the CDS. |
26 | Blast_ali_end_Query_blast | Stop (in pb) of the alignment with regard to the protein sequence of the reference SP most similar to the CDS. |
27 | Blast_ali_identity_perc | BlastP Alignment Identity Percentage. |
28 | Blast_ali_E-value | E-value of the BlastP alignment. |
29 | Blast_ali_bitscore | BlastP Alignment bitscore. |
30 | CDS_coverage_blast | BlastP Alignment coverage for the CDS of the analysed genome. |
31 | Blast_ali_coverage_most_similar_ref_SP | BlastP Alignment coverage for most similar reference SP used as a blast query. |
32 | Protein_type_of_blast_most_similar_ref_SP | Protein type of the most similar reference SP used as a blast query (VirB4, Coupling protein, Relaxase, integrase). |
33 | Associated_element_type_of_blast_most_similar_ref_SP | Is the mobile element of the most similar reference SP used as a blast query an ICE or an IME. |
34 | ICE_superfamily_of_most_similar_ref_SP | ICE superfamily of the mobile element which harbors the protein used as query for the BlastP search. |
35 | ICE_family_of_most_similar_ref_SP | ICE family of the mobile element which harbors the protein used as query for the BlastP search. |
36 | IME_superfamily_of_most_similar_ref_SP | IME superfamily of the mobile element which harbors the protein used as query for the BlastP search . This family matches the IME relaxase family. |
37 | Relaxase_family_domain_of_most_similar_ref_SP | Family of the domain of the relaxase of the mobile element which harbors the protein used as query for the BlastP search. |
38 | Relaxase_family_MOB_of_most_similar_ref_SP | MOB of the Relaxase of the mobile element which harbors the protein used as query for the BlastP search (if the mobile element is an IME). |
39 | Coupling_type_of_most_similar_ref_SP | Type of the coupling protein of the mobile element which harbors the protein used as query for the BlastP search (if the mobile element is an IME). |
40 | False_positives | Indicates whether ICEscreen considers the signature protein to be a false positive. |
41 | SP_blast_validation | Is the prediction of the signature protein by BlastP validated or not ? |
42 | Use_annotation | Is the expert DINAMIC annotation of the most similar reference SP transferred to this CDS or not ? (based on percentage idenitity). |
Information on the results obtained with HMM
N° Column | Column name | Description |
---|---|---|
43 | Protein_type_of_matching_HMM_profile | Type of the protein associated with the HMM profile used for the HMM search (VirB4, Coupling protein, Relaxase, DDE transposase, Tyrosine integrase, Serine integrase). |
44 | Description_of_matching_HMM_profile | Description of the HMM profile with the best score (best HMM result). |
45 | Profile_name | Name of the HMM profile with the best score (best HMM result). |
46 | Length_of_matching_HMM_profile | Length of the HMM profile with the best score (best HMM result). |
47 | HMM_ali_i-Evalue | i-Evalue of the best HMM result (The i-Evalue for HMM is the equivalent to the E-value for BlastP). |
48 | HMM_ali_E-value | E-value of the best HMM result (taking into account multi-alignments). |
49 | HMM_ali_Score | Score of the best HMM result. |
50 | HMM_ali_Bias | Bias of the best HMM result. |
51 | HMM_ali_Global_score | Global score of the best HMM result. |
52 | HMM_ali_Global_bias | Global bias of the best HMM result. |
53 | HMM_coverage | Percentage coverage of the HMM profile found in the alignment of the best HMM result (Proportion of the HMM profile found in the HMM alignment). |
54 | CDS_coverage_hmm | Percentage coverage of the CDS found in the alignment of the best HMM result (Proportion of the CDS found in the HMM alignment). |
File *_detected_ME.tsv
This file lists the ICEs / IMEs / partial elements structures detected by the tool including the SPs they contain. It is a tab separated table with a header. Information in this file is similar to the output file _withICEIMEIds.tsv (option -m) but centered around a list of ICE / IME structures instead of a list of SPs. The identifiers for the CDSs reported in this file are the locus tag if present or [Protein identifier]-[CDS start position]
if not (in case of gbff file wih multiple genome records, the genome identifier is added like so [Genome identifier]-[Protein identifier]-[CDS start position]
.
Description of the columns
N° Column | Column name | Description |
---|---|---|
1 | ICE_IME_id | Identifier of the mobile element detected by ICEscreen. |
2 | Segment_number | Segment number in which the element is located. |
3 | Genome_accession | Identifier of the genome that harbour the CDS. |
4 | Category_of_element | Type of the mobile element defined by its composition in total SPs (transfer module + integrase): ICE, IME, complete, partial, etc. |
5 | Category_of_integrase | Number and type of integrase(s). |
6 | Host_ICE_IME_ids | The other mobile element that host this mobile element. |
7 | Guest_ICE_IME_ids | List of other mobile elements that are integrated in this mobile element. |
8 | Colocalized_ICE_IME_ids | Other mobile elements located in the same segment. The mobile elements located in the same segment are not necessarily in accretion. |
9 | ICEline_format | Description of the element’s SPs in the ICEline format. The ICEline format is a single line description of the content of SP and the number of CDS between each of them. See the section on ICEline format for more details. |
10 | ICE_consensus_superfamily_SP_conj_module | ICE Superfamily if consensus between the different SP of the conjugation module. |
11 | ICE_consensus_family_SP_conj_module | ICE family if consensus between the different SP of the conjugation module. |
12 | IME_relaxase_family_domains_blast | List of the IME relaxase family domains. |
13 | HMM_family_SP_conj_module | Family of the SPs of the conjugation module as reported by SPs hits with the HMM profiles. This information is displayed only if no family was reported by BlastP hits. In other words, this is a listing of the type of HMM profile used for relaxase and / or coupling protein and / or VirB4. |
14 | Integrase_upstream | Integrase(s) whose genomic position is located before the SPs of the conjugation module of the mobile element. |
15 | Integrase_downstream | Integrase(s) whose genomic position is located after the SPs of the conjugation module of the mobile element. |
16 | Relaxase | Relaxase of the mobile element. |
17 | Coupling_protein | Coupling protein of the mobile element. |
18 | VirB4 | VirB4 of the mobile element. |
19 | List_SP_ordered_genomic_position | List of the signature proteins reported in the columns above ordered according to their position on the genome and separated by commas. |
20 | Start_of_most_upstream_SP | The start of the most upstream signature protein. This is not to be mistaken for the start of the element however. |
21 | Stop_of_most_downstream_SP | The stop of the most upstream signature protein. This is not to be mistaken for the stop of the element however. |
22 | Other_potential_SP_conj_module_need_manual_curation_and_review | List of other signature proteins of the conjugation module that can potentially be associated with the transfer module of the mobile element, manual verification is required. |
23 | Other_potential_integrase_need_manual_curation_and_review | List of other integrase(s) that can potentially be associated with the integration module of the mobile element, manual verification is required. |
24 | Comments_regarding_structure | Detailed explanation of why some signature proteins need manual curation or could not be associated with the element with high confidence. Other types of comments include the rational of the algorithm for special case situations. |
File *_detected_ME.summary
This file summarizes the main parameters and statistics regarding the ICE / IME structures and the SPs. The parameters section includes the maximum number of CDSs between subsequent SPs in the same segment and the maximum number of CDSs between subsequent SPs for an IME. The general statistics are on (1) the number of elements detected and their type, (2) the number of signature proteins detected, their type and their relationship with a mobile element, and (3) the segments and their content. Example of statistics includes “Total number of segments”, “Number of segments with one element”, “Number of segments with several elements”, “Number of segments with nested elements”, “Number of segments with no element”, “Number of complete ICE (4 types of SP)”, etc. If multiple genomes are stored back to back in a gbff file, the summaries for each genome identifiers are displayed back to back according to their order in the original genbank file.
File *_detected_ME.log
This file contains the detailed internal parameters and algorithms (step by step) used by ICEscreen to generate the results.
Detailled architecture of the ICEscreen output folder, including intermediate files
All the files generated by the different steps of ICEscreen are stored in a folder named ICEscreen_results
. Below is the folder’s content:
└── ICEscreen_results/
├── faa/
| └── *.faa
└── results/
└── <genbank name>/
├── detection_ME/
| ├── *_detected_ME.log
| ├── *_detected_ME.summary
| ├── *_detected_ME.tsv
| └── *_detected_SP_withMEIds.tsv
├── detection_SP/
| ├── Blast_mode/
| | ├── blastp_output/
| | ├── filtered_results/
| | ├── unfiltered_results/
| | └── *_blast_SP.tsv
| ├── hits_cleaning/
| | ├── proteins_to_remove/
| | ├── *_detected_SP_hmm_cleaned.tsv
| | ├── *_detected_SP_hmm_cleaned_reannotated.tsv
| | ├── *_detected_SP_source.faa
| | └── *__detected_SP_source.tsv
| ├── HMM_mode/
| | ├── filtered_results/
| | ├── hmmscan_output/
| | ├── unfiltered_results/
| | └── *_hmm_SP.tsv
| └── *_detected_SP.tsv
├── visualization_files/
| ├── *_icescreen.embl
| ├── *_icescreen.gb
| ├── *_icescreen.gff
| ├── *_source.fa
| └── *_source.gff
└── icescreen.conf
Here is the details about each output folders and files:
faa
*.faa
: Multifasta of the protein products annotated in the genbank files.
results
: All results of ICEscreen.detection_ME
: this folder include the final results files, see the section on the main output files.detection_SP
: Files generated by the first step of the pipeline which detects the signature proteins. The signature proteins detected in this step are stored in the files*_detected_SP.tsv
. Each row corresponds to a detected signature protein.visualization_files
: Files that allow to visualize the ICEscreen results in a genome visualization software. Tested with Artemis, JBrowse, and IGV. Only SPs of mobile elements that are not to be manually reviewed can be visualized.*_source.fa
: FASTA sequence of the genome in genbank file.*_source.gff
: GFF3 file with annotation extracted from genbank file.*_icescreen.gff
: Results of ICEscreen: the mobile elements and signature proteins detected are annotated in GFF3 format.*_icescreen.embl
: Results of ICEscreen: the mobile elements and signature proteins detected are annotated in EMBL format. This format is recommended with Artemis.*_icescreen.gb
: original genbank file modified with the addition of the results of ICEscreen (the mobile elements and signature proteins detected) in genbank format.
icescreen.conf
: Config file with absolute paths of all inputs and outputs files used by the ICEscreen pipeline. Parameters of each step are also provided.