This section of the algorithm analyses co-localized groups of signature proteins (Relaxase, Coupling Protein, VirB4, Integrase) and find mobile element structures based on their ICE / IME conjugation modules (see paper Ambroset et al. 2016). It tries to resolve nested structures (guest/host structures) and signature proteins that could be attributed to multiple structures. The rules implemented for detecting the ICEs/IMEs structures are based on the biological nature of ICEs/IMEs and empirical evidence analyzed from over 130 curated structures in Streptococcus genomes by the DynAMic research team. The three main steps of the algorithm are (i) finding anchors of signature proteins (SPs) from the conjugation module and extending those anchors sequentially and bi-directionally, (ii) eventually merging distant compatible anchors to find nested structures, and (iii) finding the integrases that belong to the structures.
Finding anchors of signature proteins
The input data is the sequence of detected SPs ordered by their genomic positions resulting from the previous stage (see step “Detection of the ICEs and IMEs signature proteins”).
As the largest ICE observed in Streptococcus is ~100 kb, a preliminary step cuts the sequence in segments whenever there are more than 100 CDSs between two subsequent SPs.
The steps described below are carried out within each segment. The first step is to find anchors based on subsequent SPs of the conjugation module. Inference of structures without subsequent SPs (i.e. nested structures) is dealt with at a later stage (see merging of compatible anchors below). The SPs from the conjugation module that are used to find the anchors are relaxase, coupling, and virB4. They are quite indicative of an ICE/IME conjugation module when they are found in the genomic vicinity (<100kb) of at least other SP. Integrases are not part of the conjugation module and are less specific of ICEs/IMEs structures as they may also relate to other mobile elements (i.e. prophages for integrases tyrosine or integrases serine, transposons or insertion sequence for integrases DDE). Integrases are always found at the border of the mobile element and are dealt with at a later stage. The sequence of ordered SPs is scanned from left to right and an anchor is created when either one of the conjugation module’s SPs (relaxase, coupling, or virB4) is found.
The sequence of SPs continues to be scanned from left to right to try to extend the current anchor. An anchor cannot contain (conditions for stopping the current anchor’s extension): (i) two SPs separated by more than 100 CDSs, (ii) two virB4 or two coupling, (iii) two relaxase unless they are adjacent on the genome or separated by one CDS, (iv) integrase since they are dealt with at a later stage, and (v) SPs of different superfamilies (i.e. * = ICESt3, ¤ = Tn916, etc.). Family and superfamilies of ICEs and IMEs were curated from known elements in Streptococcus (see the section on the detection of SPs). BlastP hits of the same family are preferably grouped within an anchor while BlastP hits of different superfamilies are separated in different elements. SPs without any family or superfamily information (i.e. HMM hits) can be added to any anchor regardless of the family criterion.
Once an anchor has been created and possibly extended from left to right, the algorithm tries to extend it from right to left (same conditions for stopping the extension as from left to right). ICEs and IMEs have no direction on the genome so it is important that the algorithm is independent of the choice of the initial scanning direction.
The steps to find anchors of SPs from the conjugation module and extending them sequentially and bi-directionally are repeated until the whole sequence of SPs is scanned. Some SPs may be attributed to two different anchors at this stage (shown in dotted black box in the example below).
Merging distant compatible anchors to find nested structures
Following this first step of finding anchors of signature proteins, the algorithm will then try to merge distant compatible anchors to find nested structures. The merging is exhaustive as all combinations of merging of anchors are tested. Priority is given to the merging of the closest anchors if there are multiple possibilities. The algorithm is recursive as it detects multiple levels of nesting and/or when the ICEs/IMEs are “split apart” in more than two pieces. The conditions for merging anchors are identical to the conditions for extending an anchor.
The merging of distant compatible anchors can sometimes help resolve SPs previously attributed to two different anchors.
Finding the integrases that belong to the structures
The last step for inferring the ICEs/IMEs structures is finding the integrases. Integrases are always at the border of the mobile element and can be upstream or downstream of the conjugation module. Integrases subsequent to the conjugation module anchor and within the 100 CDSs distance are the better candidates but the algorithm also accounts for more distant integrases in case of nested ICEs/IMEs. Any integrase can be associated with SPs of the conjugation module regardless of families or superfamilies.
Some special cases can occur regarding the integrases. For example, there can be a trio or a duo of serine integrases with adjacent genomic positions or separated by a single CDS. In case of an ICE, the integrase is specifically oriented facing away from the structure (downstream integrase are found on strand +, upstream integrase are found on strand -). Attributing an integrase to a conjugation module anchor proves sometimes difficult and the algorithm may not be able to choose between upstream and downstream integrases if both are valid candidates.
Assessing confidence and classifying the ICE/IME structures
The program distinguishes three levels of confidence for SPs within an ICE/IME structure. When there is no ambiguity, the SPs are attributed to a structure with high confidence. Other SPs within a structure are reported as “to manually verify” along with a comment why the algorithm is unsure about the SP’s relationship to the structure. Examples for the “to manually verify” category include singleton conjugation module SPs, conjugation module SPs that could be attributed to different structures, or upstream and downstream candidate integrases that are both valid. Other SPs like singleton integrases have high ambiguity and do not result in making a structure. ICE/IME structures are classified into different categories based on the completeness of their conjugation module and whether or not they were attributed an integrase. The different categories are: ICE (R+C+V+I), IME (R+I or R+C+I with distance <= 10 CDSs), conjugation module (R+C+V), mobilizable element (R+C with distance <= 10 CDS), partial ICE (any structure that contains at least a V and that is not a complete ICE), and other partial elements (any structure that contains at least two SPs and that does not fall in the previous categories).