TEXTAL
New! (Sept 2006)
A version of TEXTAL
customized for molecular replacement
Overview
TEXTAL is a program for automated protein model-building based on
pattern recognition techniques. It tries to model (build coordinates
for) regions of an electron-density map by searching a database of
previously-solved maps to find the most similar regions it has seen
before, and takes the coordinates in those regions, transforms them
into the new map, and concatenates them. It relies on the extraction
of rotation-invariant features that characterize 3D patterns in the
density, as well as local density correlation calculations.
The original program took an electron-density map as input,
and output a PDB file with a partial model. We have expanded the current
version to take a reflection file (.mtz or .hkl) as input, so the user
is no longer required to create a map first. In addition, we have developed
an automated routine for centering the map on a contiguous molecular
region (Findmol).
An important point about resolution: TEXTAL was optimized
for building models in density maps generated at 2.8A. Its
pattern-recognition routines were trained on databases of
electron-density maps at this resolution. If you have
higher-resolution data, that is OK; TEXTAL will simply truncate the
higher-resolution reflections when generating maps internally for
building. TEXTAL can also work on slightly lower (worse) resolution
data. It has been demonstrated to work pretty robustly on datasets
whose resolution upper limit is between 2.6-3.2A. If you have much
higher resolution data, e.g. <2.4A, you might as well use ARP/wARP,
which has been shown to do quite well at building very accurate models
on a number of high-resolution datasets. However, ARP/wARP does not
do so well in medium-resolution ranges, where the data/parameter ratio
(number of reflections to coordinates) is lower. We intentionally
chose 2.8A as a target resolution for TEXTAL because this represents a
very common resolution range (2.5-3.0A) for many MAD datasets
collected at synchrotrons. At this resolution, the challenges of
automatically interpreting an electron-density map are substantial.
There are many ambiguities in the density (cannot see individual
atoms), and often a great deal of noise.
TEXTAL has been developed since around 1998 at Texas A&M University
through a collaboration between Thomas R. Ioerger (Dept. of
Computer Science) and James
C. Sacchettini (Dept. of Biochemistry and Biophysics), with
funding from the NIH, and contributions from many graduate students
and post-docs over the years.
Major Steps
To explain some of our terminology, here are the
major steps TEXTAL goes through:
- FINDMOL - takes a reflection file and identifies a symmetry-unique
region of space containing a contiguous molecule (or set of
tightly-packed sub-units). FINDMOL works by first constructing a
trace of the ASU, and then incrementally re-arranging the pseudo-atoms
using symmetry operators to that they are clustered together in space.
This facilitates model-building, rather than using the ASU, which
tends to cut molecules into arbitrarily-disconnected chains. This
automation avoid requiring the user to manually center the map over a
molecule.
- scaling the electron density map - For an electron density map
calculated from structure factors, or for a map input by the user, it
might be on an arbitrary scale (in terms of magnitude of density). To
make maps comparable for pattern-matching, it is important to put them
all on a uniform scale. Therefore, we always scale maps before using them
for model building. The scaling routine is roughly simlar to creating
a "1 sigma" map (with mean=0 and std dev=1), though it tries not to be
sensitive to things like variations in solvent content.
- tracing the map - creates pseudo-atoms at roughly 0.5A spacing that
represent the medial axis of the density contours
- CAPRA - pattern-recognition algorithm for predicing the backbone of the protein; uses feature-extraction and a neural net to identify likely positions of C-alpha atoms, and uses a variety of heuristics such as preserving geometry of secondary structure to link them together
- LOOKUP - does side-chain modeling by looking through a database of
previously-solved maps for regions with similar patterns of density
surround each putative C-alpha; uses feature-matching to select
candidate regions, and then evaluates them by doing local density
correlation to pick the best one. This process takes a long time,
typtically on the order of an hour for a medium-sized protein,
depending on the speed of your machine.
- sequence alignment - Although LOOKUP does its best to build
side-chains based on local patterns of density, this often results in a
number of mis-identified amino acids (typically, accuracy is 20-50%).
This is due to a number of reasons: first, some amino acids (like Thr
and Val) look structually similar and cannot be distinguished in
electron density; second, the density for many side chains might be
blurred due to motion (high B-factors), or corrupted by noise. In an
effort to address this, we have implemented a method that compares the
predicted sequence of fragments (chains) in the model to the true
amino acid sequence, and determines most likely assignments using a
Smith-Waterman-type algorithm (allowing gaps). In some cases, a few
correct amino acids are enough to find where a chain maps to in the
sequence, and then this can be used to detect and fix other incorrect
amino acids in between. The algorithm uses a substitution matrix that
is based on typical ambiguities between amino acid density patterns,
rather that biochemical similarity or evolutionary conservation. The
algorithm also tries to estimate the secondary structure of the
sequence, to map apparent alpha-helices or beta-strands to appropriate
places. The format of the sequence file, if provided, is one-letter
amino acids codes (like FASTA, but without the head line '>'). One
sequence (on separate lines) should be included for each molecule in
the ASU, though you do not need to include duplicates, e.g. for
homodimers, NCS symmetry copies, etc. Work in progress: We are adding
a new feature that can take advantage of coordinates of selenium sites
(if available) to help find methionines, which can then facilitate
making the correct alignment (in some cases). The output PDB from
this stage contains codes in the far-right column that indicate which
chains have been confidently aligned (SEQU), putatively aligned
(PUTA), or unaligned. After, determining the alignment, the set of
C-alpha chains are editted with corrected amino acid identities and
fed back into LOOKUP, which will force it to pick the
highest-correlation side-chain match of the right residue type for
each C-alpha. In good maps (low noise), this process (alignment, plus
a second pass through LOOKUP) can often improve the accuracy of amino
acid identities in the model up to ~90%.
- rsfit - real-space refinement of model; fixes some (but not
all) geometry problems, especially along the backbone
- SA - simulated annealing of final model against structure
factors (calls phenix.refine), mainly to estimate R-factor as an
indicator of quality of the model.
Expectations
Accuracy:
- backbone - C-alpha atoms typically within 1A RMS
- breaks - due to noise or disordered regions, TEXTAL often outputs multiple chains for a single protein; chain length usually ranges between 10 and 100 amino acids
- in/dels - there are occasionally insertions or deletions (extra or missing C-alphas), but this occurs rarely (once every 30 residues or so)
- percent built - typtically up to 90%, depending on quality of map
- amino acid identity
- without sequence alignment - typically 20-50%, due to ambiguitites among structually similar side-chains (e.g. Val and Thr), noise, disorder (surface residues with high B-factor), etc.
- with sequence alignment - typically 50-90% (however, this is very sensitive to the quality of the map, and does not always work reliably on lower-quality maps with a large amount of noise)
Run Time:
- It usually takes on the order of a couple of hours to
build a complete model for a medium sized structure (a few hundred
amino acids), depending on the speed of your machine. The program
also takes a lot of memory, and might start swapping if your machine
does not have enough RAM. Building just a backbone (C-alpha chains) is much
faster, usually taking only a few minutes.
Assumptions
We generally read and write electron-density maps in XPLOR format,
and models in PDB format. We use TER records to separate chains.
The occupancy and B-factor fields are often used for other purposes
so be careful. However, in the final model, we try to set them to
reasonable values.
TEXTAL is optimized for recognizing patterns in 2.8A maps.
It is OK if you have higher-resolution data; TEXTAL will automatically
truncate your structure factors when generating maps internally for
model-building.
TEXTAL only knows how to build peptide structures. If you have other
molecules in the crystal, ranging from solvent molecules to ligands
and co-factors to nucleic acids, TEXTAL will try its best to
interpret them as protein.
It is relatively important to use density-modified phases.
Initial phasing is often not sufficient to produce a map that is clean
enough to build a model by pattern recognition. However, in our
experience, simple things like solvent-flattening go a long way to
producing interpretable density. Keep in mind that TEXTAL builds
by pattern-recognition, so if you cannot possibly visually interpret
something in the density, then TEXTAL probably cannot either.
Currently, TEXTAL does *not* do iterative improvement of phases by
iterating between model-building and phase-combination between
experimental phases and phases from partial model. So don't expect to
give poorly phased data to TEXTAL and have it output a complete and
refined model. This is a future objective. For now, it just builds
what it sees, and thus is dependent on the quality of the density.
Again, if you cannot see some disordered region in density, TEXTAL
probably cannot either.
How to Use TEXTAL
Right now, the main distribution of TEXTAL is through PHENIX
(phenix-online.org).
It takes a long time to download and install, but it is worth it.
Phenix has a great deal of funcionality, including cctbx, Phaser, and Resolve,
as well as TEXTAL.
For everything below, you must first source the phenix_env script in
the root directory of the PHENIX installation to setup appropriate
environment variables.
A note on reflection file handling:
- Either .mtz or .hkl format is acceptable.
- Symmetry refers to 6 unit cell params and
space group, e.g. "50.2,60.3,70.4,90,90,90,P422". You do not need
to specify symmetry for .mtz files, since the parameters are contained in them.
You need to give symmetry for .hkl files, though. An alternative is just
to give the name of a .inp file from CNS, and it will extract the parameters
from that.
- Column names for amplitudes, phases, and FOM can often be guessed.
If it is unambigous (e.g. if there is only one column of each, or they
use obvious names), the code in cctbx can usually pick the right columns
for you, so you might be able to get away with leaving them blank.
Using TEXTAL Through the Phenix GUI
Under "Tasks" and "Strategies" there is a single entry for model-building
with Textal. This task is designed to do multiple things, depending on
what data you give it and what options you select. In particular, you can:
- build a backbone only (C-alpha chains) - fast, takes only a few minutes
- build side-chains from user-supplied C-alphas - slow, takes on the order of an hour
- build complete model (backbone plus side-chains)
When building a complete model, the user the option of running sequence
alignment, and doing simulated annealing as a post-processing step.
Input may either be a reflection file (in .mtz or .hkl format), or a
pre-generated map supplied by the user for a region he or she wishes to
build.
No DISPLAY items. Note that in the current implementation of
the Textal task/strategy, there are no objects to be displayed as
output (i.e. the "magnifying glass" button is non-functional). All
output is written to disk as files with a common prefix (e.g. "textal-...").
Using TEXTAL from the Command Line
- textal.build - builds a backbone (C-alpha chains), side-chains, or complete
model, given a reflection file (or user-supplied XPLOR map).
- This is the main routine! It runs all the steps, from
FINDMOL, through CAPRA, LOOKUP, and even post-processing with SA. If
you call "textal.build --reflections=my_prot.mtz...", it will
build a complete model by default, and run it through SA. If you
specify "--capra_only", it will just build the backbone. For
more options (e.g. to specify amino acid sequence), see below.
Running textal.build generates a number of files in your local
directory. The main one to look for at the end is either
textal-capra.pdb (C-alpha chains only) or
textal-model.pdb (complete model), depending on whether you
elect to build side-chains. It also outputs
textal-scaled.xplor and textal-trace.pdb for the region
built, since those are often useful to look at. The prefix 'textal'
can be changed by the user. If you ran it through SA, you might want
to look at textal-refine.pdb and textal-refine.log.
This command-line program calls the same code that underlies the
Textal task in the Phenix GUI. Thus the options are analogous.
However, the command-line version prints out a few more messages and
generates a few more intermediate files and log files. (Don't bother
looking at the log files - they're mostly for developers.)
- textal.makemap - make electron-density maps from a reflection file
- makemap calculates electron density maps (in XPLOR format) from
reflection files with a variety of options. The default is standard
Fourier maps, using amplitudes, phases, and FOM specified by the user.
Patterson maps may also be constructed. A resolution (range) may be
specified for limiting (by truncation) the structure factors used.
Finally, if it is desired for the map to cover a particular molecule,
as opposed to the ASU, then the model may be specified. It will
deteremine bordes of the map based on extremes of coordinates in the
model, plus some buffer distance in each direction. The user may also
indicate whether density surrounding the model (e.g. solvent) should
be masked to 0.
- textal.findmol - finds contiguous molecular region (as an
alternative to the ASU).
- Outputs a set of pseudo-atoms representing
the trace, but grouped together representing a contiguous molecule.
Also outputs a map covering this region (with surrounding density masked to 0).
The atoms and region are guaranteed to be "symmetry unique."
- textal.scale - similar to "1 sigma" normalization (mean=0, std dev=1)
- Given a map, scale outputs a similar map but with density values
scaled in a consistent way (roughly like a "1 sigma" map, with 0 mean).
Most of the TEXTAL routines require this scaled version of the user's
input map (which may be scaled arbitrarily).
- textal.trace - skeletonization, kind of like Bones
- textal.join - runs patch and stitch to try to connect chains
- Given a set of C-alpha chains (e.g. from CAPRA), 'join' will
attempt to find additional connections between nearby endpoints,
possibly representing breaks in longer chains. Join combines two
separate methods: patch and stitch. Patch analyzes the scaled
electron density map to look for cases where connectivity between
nearby endpoints would have been established if the contour threshold
(used for tracing) had just been lowered slightly (e.g. from 1.0 to
0.5). Stitch uses a fragment-matching approach to search a database
for stretches of amino acids in other known protein structures that
match the geometry of C-alphas at the ends to two chains, and can be
used to predict missing C-alphas in between. When patch and stitch
are complete, join filters out any remaining chains whose length is
below some threshold, typically 6 residues long, because they are
often unreliable and represent noise.
- textal.run_sa - runs simulated annealing (calls phenix.refine)
- Not a thorough and robust approach to refinement - just a
quick attempt at post-processing the model, with some generic
parameter settings. Main purpose is to compute the R-factor to get a sense
of the quality of the model built by TEXTAL. You might have to do
more careful job, such as by fine-tuning b-factors (anisotropic?
grouped?), dealing with TLS or bulk-solvent, etc. Probably won't fix
major geometry problems, but could improve model in minor ways. Makes
up its own free set (user can't specify theirs). It currently does
only 3 fixed macro-cycles. Outputs textal-refine.pdb and
textal-refine.log. If it crashes, just use textal-model.pdb. Issues to
consider: which resolution should be used? which phases (probably
experimental, rather than density-modified).
- textal.trim_chains - eliminates fragements in one model that don't overlap another
- textal.pdb2seq - extracts 1-letter amino acid codes from PDB
- textal.variance_map - kind of like solvent-masking
- Given in input XPLOR map, it creates another map covering
same region at same resolution that contains the local variance in a
5A sphere around each lattice point. Effectively, this identifies
regions of protein from solvent (try using a contour level of 1.0), even
in fairly noise maps, since protein regions have higher variance than
solvent regions. Note, the output map contains continuous values (variance),
not booleans (0 and 1, like an actual mask).
- textal.linearize_trace - strips off branches from trace
- textal.superpose - RMSD-minimizing superposition of one molecule onto another with paired atoms
The command-line script 'textal.build' is intended to mirror what you can
do through the Textal task in the Phenix GUI. It contains many options, as
follows:
> textal.build
usage: textal.build [options]
options:
--reflections=<filename> (default=None)
.mtz or .hkl file
--symmetry=<params or filename> (default=None)
only needed for .hkl files; unit cell params and space group separated by commas, or CNS .inp file
--amplitudes=<column name> (default=None)
e.g. FP; optional, will try to guess from file
--phases=<column name> (default=None)
e.g. PHIB; optional, will try to guess from file
--FOM=<column name> (default=None)
e.g. FOM; optional, will try to guess from file
--resolution=<number or range> (default=2.8)
for truncating SFs to make maps; best to leave set at 2.8, since that is optimal for Textal; can also give range like 2.8-20.0
--threshold=<number> (default=1.0)
contour threshold; affects connectivity
--min_chain_len=<integer> (default=6)
shorter chains are filtered out of model
--input_map=<filename> (default=None)
alternative to giving reflection file
--input_model=<filename> (default=None)
for user-defined C-alpha atoms
--sequence=<filename> (default=None)
single-letter amino-acid codes
--se_sites=<filename> (default=None)
selenium sites, if known
--copies=<integer> (default=1)
expected number of NCS symmetry copies in ASU
--prefix=<string> (default=textal)
base name for output files
--capra_only
build C-alpha chains only, without modeling side-chains (faster)
--preserve_identities
force LOOKUP to use amino acid types in user's input model instead of predicting side-chain identities based on density
--asu
skip Findmol; build model in map of ASU
--no_sa
skip simulated annealing after model-building
--sa_only
just run simulated annealing on input model
Help on command-line programs can be accessed by typing "textal.help"
on the command line. Also, most programs using the convention that if
you try them without any arguments, they will output a usage statement
that describes what arguments and options they take.
Examples:
// with no args, prints out help/usage like above, as do all textal programs
textal.build
// textal can guess the ampl/phase columns because they are unambiguous
// outputs textal-refine.pdb after simulated annealing
textal.build --reflections=if5a.mtz --sequence=if5a.seq
// builds only C-alpha chains
// must supply symmetry info (unit cell params, space group), which are in CNS .inp file
textal.build --reflections=a2u-globulin.hkl --symmetry=a2u-globulin.inp --capra_only
// must give column names here, since there are multiple choices in this mtz
// also note: 2 copies of molecules expected in ASU
textal.build --reflections=czra-ref.mtz --amplitudes=FOBS --phases=PHASE --FOM=FOM --sequence=czra.seq --copies=2
// skips FINDMOL; builds backbone model in ASU without re-centering
// also changes prefix for output files to be "new_run-trace.pdb", etc. instead of "textal-trace.pdb", etc. by default
textal.build --reflections=if5a.mtz --capra_only --asu --prefix=new_run
// builds and refines side-chains for user-supplied backbone
textal.build --reflections=mvk-dm.mtz --amplitudes=FP --phases=PHIDM --FOM=FOMDM --sequence=mvk.seq --input_model=my-backbone-model.pdb
Calling TEXTAL from Python Scripts
Basically, importing the textal.pytex module will give you access to
many of the functions. For example, to scale a map, you could do:
from textal.pytex import *
map = emap(file_name="my_map.xplor")
(scaled_map,log) = scale(map)
scaled_map.write("my_map_scaled.xplor")
Note that in the case, as with many of the functions in pytex, a tuple
is returned that contains the output object and a log file with some
(rather arbitrary and uninterpretable) text.
Help on pytex functions can be accessed by typing "textal.pytex_help"
on the command line.
References
The main reference for Textal is:
- Ioerger, T.R. and Sacchettini, J.C. (2003).
The TEXTAL system: Artificial Intelligence techniques for automated protein model building. Methods Enzymology, 374:244-270.
More references may be found here.
Etc.
More documentation may be found at textal.tamu.edu
You may contact us by sending email to: [email protected]