terminator Evaluation

Transcription termination in bacteria is a crucial regulatory step that ensures accurate gene expression. Terminators can be broadly classified into Rho-dependent and Rho-independent types. While Rho-dependent terminators require the Rho protein to facilitate RNA polymerase dissociation, Rho-independent terminators function autonomously, relying solely on intrinsic sequence features. Rho-independent terminators are typically characterised by a GC-rich stem-loop flanked by short poly-A and poly-U tracts. The upstream poly-A tract stabilises stem-loop formation, whereas the downstream poly-U tract promotes RNA polymerase release. The stem-loop itself, consisting of a paired stem and a connecting loop, creates a physical barrier that destabilises the RNA–DNA hybrid and facilitates termination. In contrast, Rho-dependent terminators lack a stable hairpin and poly-U tract. Instead, they contain a C-rich, G-poor, unstructured sequence known as a Rho utilisation (rut) site. The rut site provides the RNA docking platform for the Rho helicase, which loads onto the transcript, translocates downstream using ATP hydrolysis, and ultimately catches the paused RNA polymerase. The resulting destabilisation of the transcription complex causes dissociation of the RNA–DNA hybrid and efficient termination.

To predict transcription terminators of both Rho-dependent and Rho-independent types, we trained models using experimentally validated terminator and non-terminator sequences encompassing both classes. The initial dataset was unbalanced, with a terminator-to-non-terminator ratio of approximately 1:10. To correct for this, non-terminator sequences (35048) were randomly sampled to match the number of terminator sequences (35048). Ten independent training datasets were generated, and for each, random forest and XGBoost models were trained. The resulting ten models were subsequently combined into an ensemble to provide a robust evaluation of terminator sequences. This strategy ensures that the models capture key sequence and structural determinants of transcription termination while mitigating the effects of class imbalance.