Description of the output files

Main Output files

There are 4 main output files generated by ICEscreen, they can found in the directory ICEscreen_results/results/detection_ME/:

  • *_detected_SP_withMEIds.tsv: List of the signature proteins detected by the tool and their possible assignment to an ICE or IME element.
  • *_detected_ME.tsv: List of the ICEs and IMEs elements detected by the tool, including information about the signature proteins they contain.
  • *_detected_ME.summary: Statistical summary by category of result.
  • *_detected_ME.log: Parameters used by ICEscreen to generate the results.

The file *_detected_SP_withMEIds.tsv

This file lists the signature proteins detected by the tool and their possible assignment to an ICE or IME element. It is a comma separated table with a header. Each line represents a signature protein detected by ICEscreen.

Description of the columns

Information associated with the ICE/IME element detected by ICEscreen
N° ColumnColumn nameDescription
1ICE_IME_idICE / IME identifier attributed to a signature proteins (SP) associated with high confidence to a structure. In other words, all the different SP with a similar ICE_IME_id belong to the same ICE / IME structure according to the software.
2ICE_IME_id_need_manual_curationICE / IME identifier attributed to a signature proteins (SP) associated to a structure with lower confidence. In other words, This SP could be attributed to the ICE / IME structure but this requires some manual curation. This identifier is identical to the column “ICE_IME_id”.
3Segment_numberNumber of the segment in which the signature protein is located. One of the first step of the algorithm is to cut the ordered list of SP into smaller segments where two subsequent SPs can not be separated by more than 100 CDSs.
4Comments_ICE_IME_structureSpecifies information regarding special decisions made by the algorithm on attributing the SP to a structure or not. An example of comment: “The SP NP_XXXXXX is not a VirB4, constitutes a conjugaison module by itself, and no integrase has been attributed to the element 3, please manually check.”.
5Is_hit_blastIndicates whether the signature protein was detected by BlastP (1: Yes, 0: No).
6Is_hit_HMMIndicates whether the signature protein was detected by HMM (1: Yes, 0: No).
Information extracted from the GenBank annotation
N° ColumnColumn nameDescription
7CDS_numCDS number of the signature protein on the genome when all the CDSs of the genome are ordered according to their start position.
8Genome_accessionIdentifier of the genome that harbour the CDS.
9Genome_accession_rankRank of the genome that harbour the CDS in the original genbank file. If multiple genomes are stored back to back in a gbff file, the rank of the first record is 1, the rank of the second record is 2, and so on.
10CDS_locus_tagThe locus tag (unique identifier) of the CDS.
11CDS_protein_idThe protein id associated with the CDS. Different CDSs can have a similar protein id if they code a similar protein.
12CDS_strandStrand of the gene encoding the CDS.
13CDS_startCDS start position.
14CDS_endCDS end position.
15CDS_lengthCDS length (in aa).
16Is_pseudoIndicates whether the CDS is annotated as pseudogene in the GenBank annotation.
Information on BlastP results
Information on raw BlastP results and BlastP results enriched with ICEscreen annotation
N° ColumnColumn nameDescription
17 and 18CDS_Protein_type and CDS_Protein_type_blastProtein type (genbank annotation) of the CDS (VirB4, Coupling protein, Relaxase, DDE transposase, Tyrosine integrase, Serine integrase).
19Description_of_blast_most_similar_ref_SPDescription (genbank annotation) of the reference SP most similar to the CDS.
20Id_of_blast_most_similar_ref_SPProtein identifier of the reference SP most similar to the CDS (BlastP search).
21Length_of_blast_most_similar_ref_SPSize of the reference SP most similar to the CDS.
22Blast_ali_lengthBlastP alignment length.
23Blast_ali_start_CDSStart (in pb) of the alignment with regard to the CDS protein sequence.
24Blast_ali_end_CDSStop (in pb) of the alignment with regard to the CDS protein sequence.
25Blast_ali_start_Query_blastStart (in pb) of the alignment with regard to the protein sequence of the reference SP most similar to the CDS.
26Blast_ali_end_Query_blastStop (in pb) of the alignment with regard to the protein sequence of the reference SP most similar to the CDS.
27Blast_ali_identity_percBlastP Alignment Identity Percentage.
28Blast_ali_E-valueE-value of the BlastP alignment.
29Blast_ali_bitscoreBlastP Alignment bitscore.
30CDS_coverage_blastBlastP Alignment coverage for the CDS of the analysed genome.
31Blast_ali_coverage_most_similar_ref_SPBlastP Alignment coverage for most similar reference SP used as a blast query.
32Protein_type_of_blast_most_similar_ref_SPProtein type of the most similar reference SP used as a blast query (VirB4, Coupling protein, Relaxase, integrase).
33Associated_element_type_of_blast_most_similar_ref_SPIs the mobile element of the most similar reference SP used as a blast query an ICE or an IME.
34ICE_superfamily_of_most_similar_ref_SPICE superfamily of the mobile element which harbors the protein used as query for the BlastP search.
35ICE_family_of_most_similar_ref_SPICE family of the mobile element which harbors the protein used as query for the BlastP search.
36IME_superfamily_of_most_similar_ref_SPIME superfamily of the mobile element which harbors the protein used as query for the BlastP search . This family matches the IME relaxase family.
37Relaxase_family_domain_of_most_similar_ref_SPFamily of the domain of the relaxase of the mobile element which harbors the protein used as query for the BlastP search.
38Relaxase_family_MOB_of_most_similar_ref_SPMOB of the Relaxase of the mobile element which harbors the protein used as query for the BlastP search (if the mobile element is an IME).
39Coupling_type_of_most_similar_ref_SPType of the coupling protein of the mobile element which harbors the protein used as query for the BlastP search (if the mobile element is an IME).
40False_positivesIndicates whether ICEscreen considers the signature protein to be a false positive.
41SP_blast_validationIs the prediction of the signature protein by BlastP validated or not ?
42Use_annotationIs the expert DINAMIC annotation of the most similar reference SP transferred to this CDS or not ? (based on percentage idenitity).
Information on the results obtained with HMM
N° ColumnColumn nameDescription
43Protein_type_of_matching_HMM_profileType of the protein associated with the HMM profile used for the HMM search (VirB4, Coupling protein, Relaxase, DDE transposase, Tyrosine integrase, Serine integrase).
44Description_of_matching_HMM_profileDescription of the HMM profile with the best score (best HMM result).
45Profile_nameName of the HMM profile with the best score (best HMM result).
46Length_of_matching_HMM_profileLength of the HMM profile with the best score (best HMM result).
47HMM_ali_i-Evaluei-Evalue of the best HMM result (The i-Evalue for HMM is the equivalent to the E-value for BlastP).
48HMM_ali_E-valueE-value of the best HMM result (taking into account multi-alignments).
49HMM_ali_ScoreScore of the best HMM result.
50HMM_ali_BiasBias of the best HMM result.
51HMM_ali_Global_scoreGlobal score of the best HMM result.
52HMM_ali_Global_biasGlobal bias of the best HMM result.
53HMM_coveragePercentage coverage of the HMM profile found in the alignment of the best HMM result (Proportion of the HMM profile found in the HMM alignment).
54CDS_coverage_hmmPercentage coverage of the CDS found in the alignment of the best HMM result (Proportion of the CDS found in the HMM alignment).

File *_detected_ME.tsv

This file lists the ICEs / IMEs / partial elements structures detected by the tool including the SPs they contain. It is a tab separated table with a header. Information in this file is similar to the output file _withICEIMEIds.tsv (option -m) but centered around a list of ICE / IME structures instead of a list of SPs. The identifiers for the CDSs reported in this file are the locus tag if present or [Protein identifier]-[CDS start position] if not (in case of gbff file wih multiple genome records, the genome identifier is added like so [Genome identifier]-[Protein identifier]-[CDS start position].

Description of the columns

N° ColumnColumn nameDescription
1ICE_IME_idIdentifier of the mobile element detected by ICEscreen.
2Segment_numberSegment number in which the element is located.
3Genome_accessionIdentifier of the genome that harbour the CDS.
4Category_of_elementType of the mobile element defined by its composition in total SPs (transfer module + integrase): ICE, IME, complete, partial, etc.
5Category_of_integraseNumber and type of integrase(s).
6Host_ICE_IME_idsThe other mobile element that host this mobile element.
7Guest_ICE_IME_idsList of other mobile elements that are integrated in this mobile element.
8Colocalized_ICE_IME_idsOther mobile elements located in the same segment. The mobile elements located in the same segment are not necessarily in accretion.
9ICEline_formatDescription of the element’s SPs in the ICEline format. The ICEline format is a single line description of the content of SP and the number of CDS between each of them. See the section on ICEline format for more details.
10ICE_consensus_superfamily_SP_conj_moduleICE Superfamily if consensus between the different SP of the conjugation module.
11ICE_consensus_family_SP_conj_moduleICE family if consensus between the different SP of the conjugation module.
12IME_relaxase_family_domains_blastList of the IME relaxase family domains.
13HMM_family_SP_conj_moduleFamily of the SPs of the conjugation module as reported by SPs hits with the HMM profiles. This information is displayed only if no family was reported by BlastP hits. In other words, this is a listing of the type of HMM profile used for relaxase and / or coupling protein and / or VirB4.
14Integrase_upstreamIntegrase(s) whose genomic position is located before the SPs of the conjugation module of the mobile element.
15Integrase_downstreamIntegrase(s) whose genomic position is located after the SPs of the conjugation module of the mobile element.
16RelaxaseRelaxase of the mobile element.
17Coupling_proteinCoupling protein of the mobile element.
18VirB4VirB4 of the mobile element.
19List_SP_ordered_genomic_positionList of the signature proteins reported in the columns above ordered according to their position on the genome and separated by commas.
20Start_of_most_upstream_SPThe start of the most upstream signature protein. This is not to be mistaken for the start of the element however.
21Stop_of_most_downstream_SPThe stop of the most upstream signature protein. This is not to be mistaken for the stop of the element however.
22Other_potential_SP_conj_module_need_manual_curation_and_reviewList of other signature proteins of the conjugation module that can potentially be associated with the transfer module of the mobile element, manual verification is required.
23Other_potential_integrase_need_manual_curation_and_reviewList of other integrase(s) that can potentially be associated with the integration module of the mobile element, manual verification is required.
24Comments_regarding_structureDetailed explanation of why some signature proteins need manual curation or could not be associated with the element with high confidence. Other types of comments include the rational of the algorithm for special case situations.

File *_detected_ME.summary

This file summarizes the main parameters and statistics regarding the ICE / IME structures and the SPs. The parameters section includes the maximum number of CDSs between subsequent SPs in the same segment and the maximum number of CDSs between subsequent SPs for an IME. The general statistics are on (1) the number of elements detected and their type, (2) the number of signature proteins detected, their type and their relationship with a mobile element, and (3) the segments and their content. Example of statistics includes “Total number of segments”, “Number of segments with one element”, “Number of segments with several elements”, “Number of segments with nested elements”, “Number of segments with no element”, “Number of complete ICE (4 types of SP)”, etc. If multiple genomes are stored back to back in a gbff file, the summaries for each genome identifiers are displayed back to back according to their order in the original genbank file.

File *_detected_ME.log

This file contains the detailed internal parameters and algorithms (step by step) used by ICEscreen to generate the results.

Detailled architecture of the ICEscreen output folder, including intermediate files

All the files generated by the different steps of ICEscreen are stored in a folder named ICEscreen_results. Below is the folder’s content:

└── ICEscreen_results/
    ├── faa/
    |   └── *.faa
    └── results/
        └── <genbank name>/
            ├── detection_ME/
            |   ├── *_detected_ME.log
            |   ├── *_detected_ME.summary
            |   ├── *_detected_ME.tsv
            |   └── *_detected_SP_withMEIds.tsv
            ├── detection_SP/
            |   ├── Blast_mode/
            |   |   ├── blastp_output/
            |   |   ├── filtered_results/
            |   |   ├── unfiltered_results/
            |   |   └── *_blast_SP.tsv
            |   ├── hits_cleaning/
            |   |   ├── proteins_to_remove/
            |   |   ├── *_detected_SP_hmm_cleaned.tsv
            |   |   ├── *_detected_SP_hmm_cleaned_reannotated.tsv
            |   |   ├── *_detected_SP_source.faa
            |   |   └── *__detected_SP_source.tsv
            |   ├── HMM_mode/
            |   |   ├── filtered_results/
            |   |   ├── hmmscan_output/
            |   |   ├── unfiltered_results/
            |   |   └── *_hmm_SP.tsv
            |   └── *_detected_SP.tsv
            ├── visualization_files/
            |   ├── *_icescreen.embl
            |   ├── *_icescreen.gb
            |   ├── *_icescreen.gff
            |   ├── *_source.fa
            |   └── *_source.gff
            └── icescreen.conf

Here is the details about each output folders and files:

  • faa
    • *.faa: Multifasta of the protein products annotated in the genbank files.
  • results: All results of ICEscreen.
    • detection_ME: this folder include the final results files, see the section on the main output files.
    • detection_SP: Files generated by the first step of the pipeline which detects the signature proteins. The signature proteins detected in this step are stored in the files *_detected_SP.tsv. Each row corresponds to a detected signature protein.
    • visualization_files: Files that allow to visualize the ICEscreen results in a genome visualization software. Tested with Artemis, JBrowse, and IGV. Only SPs of mobile elements that are not to be manually reviewed can be visualized.
      • *_source.fa: FASTA sequence of the genome in genbank file.
      • *_source.gff: GFF3 file with annotation extracted from genbank file.
      • *_icescreen.gff: Results of ICEscreen: the mobile elements and signature proteins detected are annotated in GFF3 format.
      • *_icescreen.embl: Results of ICEscreen: the mobile elements and signature proteins detected are annotated in EMBL format. This format is recommended with Artemis.
      • *_icescreen.gb: original genbank file modified with the addition of the results of ICEscreen (the mobile elements and signature proteins detected) in genbank format.
    • icescreen.conf: Config file with absolute paths of all inputs and outputs files used by the ICEscreen pipeline. Parameters of each step are also provided.