Detection of the ICEs and IMEs signature proteins

SPs are proteins encoded by ICEs and IMEs that perform essential functions. ICEscreen search for four types of SPs: (1) the integrase which is required for the excision and integration of the element, (2) the relaxase and (3) the coupling protein which are necessary for the transfer of the element, and (4) the VirB4 which plays an essential role in the placement and functionality of the conjugation pore. The first step of ICEscreen is to search for signature proteins (SPs) of ICE and IME from the coding sequences (CDS) of the genome of interest. ICEscreen uses two methods for searching the SPs: BlastP and HMMscan.

BlastP to search for close homologs

ICEscreen is an improvement of the ICE/IME Finder tool (Ambroset et al., 2016; Coluzzi et al., 2017) which uses BlastP to find close homologs to a curated resource of SPs from Streptococci. BlastP allows for a more accurate assignment of hits to a given ICE/IME superfamily or family when compared to using HMM profiles. The resulting alignments are then filtered in two stages in order to remove as many false positives as possible. The first stage involved thresholds based on the alignments with the reference sequences in order to remove non-significant results and / or non-homologous proteins. Those thresholds are defined based on manual curation of known elements in Streptococcus. Four filters are used: a minimum percentage identity, a maximum E-value, a minimum coverage rate, and a minimum and maximum length for the different types of SPs. The second stage involved targeted removal of known false positives by using dedicated protein sequences and HMM profiles. Some known false positives still pass the criteria above, specifically ATP binding cassette transporter proteins or proteins of unknown function DUF853 can be falsely detected as coupling proteins. They are proteins with functional domains similar but not identical to those of the desired SPs. A second filtration step was implemented with HMM profiles specifically designed to identify them (domains COG1126 and COG4586). Likewise, the XerS site-specific tyrosine recombinases require the implementation of a targeted strategy: CDS with alignments with more than 70% identity with ACO17137 are classified as streptococcal XerS and therefore removed from the list of detected SPs. CDS with alignments with more than 70% identity with WP_011825230 are classified as tyrosine integrase.

HMM profiles to search for distant homologs

ICEScreen implement a search for more distant homologs of SPs in order to generalize the tool to Bacillota. ICEScreen does so by leveraging HMM profiles for each family of SPs and using HMMscan. The HMM profiles have been either imported from trusted resources or created and curated when needed. Four types of criteria were used to filter out the false positives following the HMM profile search: CDS length (criteria identical to the BlastP approach), i-Evalue <10-5, percentage of coverage of the HMM profile domain >40% (except for PF01719-like for which >80% was used). An additional filter was implemented specifically for the TcpA coupling protein profile: minimum coverage of the CDS domain >40%. Those criteria were set following a manual review of the results on the FirmiData dataset (see below). A second filtration step identical to the BlastP approach was performed.