Description of the output files

Main Output files

There are three main output files generated by ICEscreen that can found in the directory ICEscreen_results/*{genbank file name analysed}*/detected_mobile_elements/:

*{genbank file name analysed}*_detected_SP_withMEIds.tsv: List of the signature proteins detected by the tool and their possible assignment to an ICE or IME element.
*{genbank file name analysed}*_detected_ME.tsv: List of the ICEs and IMEs elements detected by the tool, including information about the signature proteins they contain.
*{genbank file name analysed}*_detected_ME.summary: Statistical summary by category of result.

File `{genbank file name analysed}_detected_SP_withMEIds.tsv`

This file lists the signature proteins (SP) detected by the tool and their possible assignment to an ICE or IME element. It is a comma separated table with a header. Each line represents a signature protein detected by ICEscreen.

Description of the columns

Information associated with the ICE/IME element detected by ICEscreen

N° Column	Column name	Description
1	ICE_IME_id	ICE / IME identifier attributed to a signature proteins (SP) associated with high confidence to a structure. In other words, all the different SP with a similar ICE_IME_id belong to the same ICE / IME structure according to the software.
2	ICE_IME_id_need_manual_curation	ICE / IME identifier attributed to a signature proteins (SP) associated to a structure with lower confidence. In other words, This SP could be attributed to the ICE / IME structure but this requires some manual curation. This identifier is identical to the column “ICE_IME_id”.
3	Segment_number	Number of the segment in which the signature protein is located. One of the first step of the algorithm is to cut the ordered list of SP into smaller segments where two subsequent SPs can not be separated by more than 100 CDSs.
4	Comments_ICE_IME_structure	Specifies information regarding special decisions made by the algorithm on attributing the SP to a structure or not. An example of comment: “The SP NP_XXXXXX is not a VirB4, constitutes a conjugaison module by itself, and no integrase has been attributed to the element 3, please manually check.”.
5	Is_hit_blast	Indicates whether the signature protein was detected by BlastP (1: Yes, 0: No).
6	Is_hit_HMM	Indicates whether the signature protein was detected by HMM (1: Yes, 0: No).

Information extracted from the GenBank annotation

N° Column	Column name	Description
7	CDS_num	CDS number of the signature protein on the genome when all the CDSs of the genome are ordered according to their start position.
8	Genome_accession	Identifier of the genome that harbour the CDS.
9	Genome_accession_rank	Rank of the genome that harbour the CDS in the original genbank file. If multiple genomes are stored back to back in a gbff file, the rank of the first record is 1, the rank of the second record is 2, and so on.
10	CDS_locus_tag	The locus tag (unique identifier) of the CDS.
11	CDS_protein_id	The protein id associated with the CDS. Different CDSs can have a similar protein id if they code a similar protein.
12	CDS_strand	Strand of the gene encoding the CDS.
13	CDS_start	CDS start position.
14	CDS_end	CDS end position.
15	CDS_length	CDS length (in aa).
16	Is_pseudo	Indicates whether the CDS is annotated as pseudogene in the GenBank annotation.

Information on BlastP results

Information on raw BlastP results and BlastP results enriched with ICEscreen annotation

N° Column	Column name	Description
17 and 18	CDS_Protein_type and CDS_Protein_type_blast	Protein type (genbank annotation) of the CDS (VirB4, Coupling protein, Relaxase, DDE transposase, Tyrosine integrase, Serine integrase).
19	Description_of_blast_most_similar_ref_SP	Description (genbank annotation) of the reference SP most similar to the CDS.
20	Id_of_blast_most_similar_ref_SP	Protein identifier of the reference SP most similar to the CDS (BlastP search).
21	Length_of_blast_most_similar_ref_SP	Size of the reference SP most similar to the CDS.
22	Blast_ali_length	BlastP alignment length.
23	Blast_ali_start_CDS	Start (in pb) of the alignment with regard to the CDS protein sequence.
24	Blast_ali_end_CDS	Stop (in pb) of the alignment with regard to the CDS protein sequence.
25	Blast_ali_start_Query_blast	Start (in pb) of the alignment with regard to the protein sequence of the reference SP most similar to the CDS.
26	Blast_ali_end_Query_blast	Stop (in pb) of the alignment with regard to the protein sequence of the reference SP most similar to the CDS.
27	Blast_ali_identity_perc	BlastP Alignment Identity Percentage.
28	Blast_ali_E-value	E-value of the BlastP alignment.
29	Blast_ali_bitscore	BlastP Alignment bitscore.
30	CDS_coverage_blast	BlastP Alignment coverage for the CDS of the analysed genome.
31	Blast_ali_coverage_most_similar_ref_SP	BlastP Alignment coverage for most similar reference SP used as a blast query.
32	Protein_type_of_blast_most_similar_ref_SP	Protein type of the most similar reference SP used as a blast query (VirB4, Coupling protein, Relaxase, integrase).
33	Associated_element_type_of_blast_most_similar_ref_SP	Is the mobile element of the most similar reference SP used as a blast query an ICE or an IME.
34	ICE_superfamily_of_most_similar_ref_SP	ICE superfamily of the mobile element which harbors the protein used as query for the BlastP search.
35	ICE_family_of_most_similar_ref_SP	ICE family of the mobile element which harbors the protein used as query for the BlastP search.
36	IME_superfamily_of_most_similar_ref_SP	IME superfamily of the mobile element which harbors the protein used as query for the BlastP search . This family matches the IME relaxase family.
37	Relaxase_family_domain_of_most_similar_ref_SP	Family of the domain of the relaxase of the mobile element which harbors the protein used as query for the BlastP search.
38	Relaxase_family_MOB_of_most_similar_ref_SP	MOB of the Relaxase of the mobile element which harbors the protein used as query for the BlastP search (if the mobile element is an IME).
39	Coupling_type_of_most_similar_ref_SP	Type of the coupling protein of the mobile element which harbors the protein used as query for the BlastP search (if the mobile element is an IME).
40	False_positives	Indicates whether ICEscreen considers the signature protein to be a false positive.
41	SP_blast_validation	Is the prediction of the signature protein by BlastP validated or not ?
42	Use_annotation	Is the expert DINAMIC annotation of the most similar reference SP transferred to this CDS or not ? (based on percentage idenitity).

Information on the results obtained with HMM

N° Column	Column name	Description
43	Protein_type_of_matching_HMM_profile	Type of the protein associated with the HMM profile used for the HMM search (VirB4, Coupling protein, Relaxase, DDE transposase, Tyrosine integrase, Serine integrase).
44	Description_of_matching_HMM_profile	Description of the HMM profile with the best score (best HMM result).
45	Profile_name	Name of the HMM profile with the best score (best HMM result).
46	Length_of_matching_HMM_profile	Length of the HMM profile with the best score (best HMM result).
47	HMM_ali_i-Evalue	i-Evalue of the best HMM result (The i-Evalue for HMM is the equivalent to the E-value for BlastP).
48	HMM_ali_E-value	E-value of the best HMM result (taking into account multi-alignments).
49	HMM_ali_Score	Score of the best HMM result.
50	HMM_ali_Bias	Bias of the best HMM result.
51	HMM_ali_Global_score	Global score of the best HMM result.
52	HMM_ali_Global_bias	Global bias of the best HMM result.
53	HMM_coverage	Percentage coverage of the HMM profile found in the alignment of the best HMM result (Proportion of the HMM profile found in the HMM alignment).
54	CDS_coverage_hmm	Percentage coverage of the CDS found in the alignment of the best HMM result (Proportion of the CDS found in the HMM alignment).

File `{genbank file name analysed}_detected_ME.tsv`

This file lists the ICEs and IMEs mobile elements (MEs) detected by ICEscreen including the SPs they contain. It is a tab separated table with a header. Information in this file is similar to the output file _withICEIMEIds.tsv (option -m) but centered around a list of ICE / IME structures instead of a list of SPs. The identifiers for the CDSs reported in this file are the locus tag if present or [Protein identifier]-[CDS start position] if not (in case of gbff file wih multiple genome records, the genome identifier is added like so [Genome identifier]-[Protein identifier]-[CDS start position].

Description of the columns

N° Column	Column name	Description
1	ICE_IME_id	Identifier of the mobile element detected by ICEscreen.
2	Segment_number	Segment number in which the element is located.
3	Genome_accession	Identifier of the genome that harbour the CDS.
4	Category_of_element	Type of the mobile element defined by its composition in total SPs (transfer module + integrase): ICE, IME, complete, partial, etc.
5	Category_of_integrase	Number and type of integrase(s).
6	Host_ICE_IME_ids	The other mobile element that host this mobile element.
7	Guest_ICE_IME_ids	List of other mobile elements that are integrated in this mobile element.
8	Colocalized_ICE_IME_ids	Other mobile elements located in the same segment. The mobile elements located in the same segment are not necessarily in accretion.
9	ICEline_format	Description of the element’s SPs in the ICEline format. The ICEline format is a single line description of the content of SP and the number of CDS between each of them. See the section on ICEline format for more details.
10	ICE_consensus_superfamily_SP_conj_module	ICE Superfamily if consensus between the different SP of the conjugation module.
11	ICE_consensus_family_SP_conj_module	ICE family if consensus between the different SP of the conjugation module.
12	IME_relaxase_family_domains_blast	List of the IME relaxase family domains.
13	HMM_family_SP_conj_module	Family of the SPs of the conjugation module as reported by SPs hits with the HMM profiles. This information is displayed only if no family was reported by BlastP hits. In other words, this is a listing of the type of HMM profile used for relaxase and / or coupling protein and / or VirB4.
14	Integrase_upstream	Integrase(s) whose genomic position is located before the SPs of the conjugation module of the mobile element.
15	Integrase_downstream	Integrase(s) whose genomic position is located after the SPs of the conjugation module of the mobile element.
16	Relaxase	Relaxase of the mobile element.
17	Coupling_protein	Coupling protein of the mobile element.
18	VirB4	VirB4 of the mobile element.
19	List_SP_ordered_genomic_position	List of the signature proteins reported in the columns above ordered according to their position on the genome and separated by commas.
20	Start_of_most_upstream_SP	The start of the most upstream signature protein. This is not to be mistaken for the start of the element however.
21	Stop_of_most_downstream_SP	The stop of the most upstream signature protein. This is not to be mistaken for the stop of the element however.
22	Other_potential_SP_conj_module_need_manual_curation_and_review	List of other signature proteins of the conjugation module that can potentially be associated with the transfer module of the mobile element, manual verification is required.
23	Other_potential_integrase_need_manual_curation_and_review	List of other integrase(s) that can potentially be associated with the integration module of the mobile element, manual verification is required.
24	Comments_regarding_structure	Detailed explanation of why some signature proteins need manual curation or could not be associated with the element with high confidence. Other types of comments include the rational of the algorithm for special case situations.

File `{genbank file name analysed}_detected_ME.summary`

This file summarizes the statistics regarding the ICEs and IMEs structures and the SPs: (1) the number of elements detected and their type, (2) the number of signature proteins detected, their type and their relationship with a mobile element, and (3) the segments and their content. Example of statistics includes “Total number of segments”, “Number of segments with one element”, “Number of segments with several elements”, “Number of segments with nested elements”, “Number of segments with no element”, “Number of complete ICE (4 types of SP)”, etc. If multiple genomes are stored back to back in a gbff file, the summaries for each genome identifiers are displayed back to back according to their order in the original genbank file.

Detailled architecture of the ICEscreen output folder, including intermediate files

All the files generated by the different steps of ICEscreen are stored in a folder named ICEscreen_results. Below is the folder’s content:

└── ICEscreen_results/
    └── *{genbank file name analysed}*/
        └── detected_mobile_elements/
            ├── *{genbank file name analysed}*_detected_ME.summary
            ├── *{genbank file name analysed}*_detected_ME.tsv
            ├── *{genbank file name analysed}*_detected_SP_withMEIds.tsv
            ├── standard_genome_annotation_formats/
            |   ├── *{genbank file name analysed}*_icescreen.embl.gz
            |   ├── *{genbank file name analysed}*_icescreen.gb.gz
            |   ├── *{genbank file name analysed}*_icescreen.gff.gz
            |   ├── *{genbank file name analysed}*_source.fa.gz
            |   └── *{genbank file name analysed}*_source.gff.gz
            ├── param.conf.gz
            └── tmp_intermediate_files.tar.gz

Here are the details about each output folders and files:

detected_mobile_elements: This folder includes the final results files, see the section on the main output files above.
standard_genome_annotation_formats: Files that allow to visualize the ICEscreen results in a genome visualization software. Tested with Artemis, JBrowse, and IGV. Those files are compressed with gzip and need to be extracted before use. SPs of mobile elements that are to be manually reviewed are missing from those files.
- *{genbank file name analysed}*_icescreen.embl.gz: Mobile elements and signature proteins detected by ICEscreen in EMBL format. The original genome annotations are not in this file.
- *{genbank file name analysed}*_icescreen.gb.gz: Original genbank file modified with the addition of the results of ICEscreen (the mobile elements and signature proteins detected) in genbank format.
- *{genbank file name analysed}*_icescreen.gff.gz: Mobile elements and signature proteins detected are annotated in GFF3 format. The original genome annotations are not in this file.
- *{genbank file name analysed}*_source.fa.gz: FASTA sequence of the whole original genome.
- *{genbank file name analysed}*_source.gff.gz: GFF3 file with annotation extracted from the original genbank file.
param.conf.gz: Parameters used by ICEscreen to generate the results. This file is compressed with gzip and need to be extracted before use.
tmp_intermediate_files.tar.gz: Temporary intermediate files generated by ICEscreen. This file is compressed with tar and gzip and need to be extracted before use. Once extracted, this file generates the following directory structure:

            ├── detection_SP/
            |   ├── Blast_mode/
            |   |   ├── blastp_output/
            |   |   ├── filtered_results/
            |   |   ├── unfiltered_results/
            |   |   └── *{genbank file name analysed}*_blast_SP.tsv
            |   ├── hits_cleaning/
            |   |   ├── proteins_to_remove/
            |   |   ├── *{genbank file name analysed}*_detected_SP_hmm_cleaned.tsv
            |   |   ├── *{genbank file name analysed}*_detected_SP_hmm_cleaned_reannotated.tsv
            |   |   ├── *{genbank file name analysed}*_detected_SP_source.faa
            |   |   └── *{genbank file name analysed}*__detected_SP_source.tsv
            |   ├── HMM_mode/
            |   |   ├── filtered_results/
            |   |   ├── hmmscan_output/
            |   |   ├── unfiltered_results/
            |   |   └── *{genbank file name analysed}*_hmm_SP.tsv
            |   └── *{genbank file name analysed}*_detected_SP.tsv
            └── *{genbank file name analysed}*.faa

detection_SP: Temporary intermediate files generated by the detection of the signature proteins. This directory comprises files generated during the blast and HMM searches at various stages of the cleaning process.
*{genbank file name analysed}*.faa: Temporary intermediate multifasta file of the protein products with annotation extracted from the original genbank file.