SEQUENCE SLIDER-ML - Help Documentation

Overview

Accurate assignment of amino-acid side chains remains a major bottleneck in macromolecular structure determination, particularly for low-resolution structures, samples derived from natural sources, and proteins exhibiting sequence heterogeneity. Even when global validation metrics are satisfactory, local side-chain ambiguities can propagate errors into functional interpretation and computational downstream analyses. The SEQUENCE SLIDER framework provides a foundation for automated sequence assignment by integrating structural biology data and phylogenetic analysis (PMID: 32133987; PMID: 35104880). However, local implementation and large-scale calculations limit accessibility for many experimental biologists.

SEQUENCE SLIDER-ML is an interactive web server that enables users to evaluate, rank, and visualize residue-specific side-chain hypotheses directly from structural data. Users upload a coordinate file (PDB or mmCIF) together with an experimental electron density map. The server analyzes the local structural environment and ranks alternative amino-acid hypotheses for each residue using an internal machine-learning–assisted confidence scoring model.

Results are presented through interactive sequence-level and 3D visualizations (with moorhen and GitHub repository), allowing users to inspect alternative hypotheses in the context of the experimental map. The server highlights low-confidence cases that require expert judgment, providing objective decision support rather than automated assignments.

The machine-learning component is used exclusively as an internal scoring engine. It is trained on curated experimentally solved protein structures and validated using blind, independent test sets, with model interpretability provided through SHAP analysis.

The server complements existing model-building and refinement software and is particularly useful for ambiguous regions, heterogeneous samples, and post-refinement validation.

Usage

Select PDB File:

Load the PDB file (as finished as possible) for sequence validation and evaluation.

Select MTZ File:

Load the MTZ file having one single label for either amplitudes or intensities.

Click on RUN SEQUENCE SLIDER:

Each one of the 20 natural amino acids for each residue position will be modeled and have its theoretical electron density analyzed in comparison to the experimental one
Local structural analysis is performed
ML model scores probability and confidence of each hypothesis

System Log and Status:

Shows progress of calculations
Each one of the 20 natural amino acids for each residue position
A link will be shown, to which results may be seen when finished, output

Output

Sequence Confidence Visualization

Figure having the sequence by chain showing confidence of each assignment for each residue present in the PDB:

Residue Inspection

User can evaluate each residue in Inspect field by typing its single letter chain "Chain" field and its residue number in "Residue" field and clicking "Go":

Moorhen page will be opened containing structure and the side-chain polder omit map (PMID: 28177311).

All 20 Natural Modelled Side-Chains:

Correct Assignment Example (28W):

Other Output Files:

FASTA Files: predicted_sequence.fasta - Contain best scored sequence
Prediction Figure (25 residues per chunk): <pdb_id>_logo_chunk_<number>.jpg - Contain global probabilities of each residue
Prediction Results (Table): <pdb_name>_result_df.cs - Contains PDBid, Chain, ResN (residue number), ResT (residue type), probabilities for each amino acid (A, C, D, ..., Y), Predicted_AA, and Probability
Logs: sequence_slider.log

Color Scheme

The sequence logos and HTML visualization use the following color scheme for amino acids:

Acidic (D, E) #FF0000

Basic (K, R) #0000FF

Basic (H) #0066CC

Polar (S, T, N, Q) #00AA00

Polar (Y) #00CC66

Hydrophobic (A, V, I, L, M, F, W, P) #1f77b4

Glycine (G) #FFAA00

Cysteine (C) #FFFF00

Machine-learning Model Performance and Evaluation

The XGBOOST performance by class considering a blind validation set of real residue types and their actual prediction:

Class	Precision	Recall	F1-Score	Support
Performance metrics for the XGBOOST model on a blind validation set

Contact

For support or inquiries, contact:

Rafael Junqueira Borges:

rjborges@unicamp.br

License

This software is provided for research purposes. Please contact the authors for licensing details.