TEXTAL

New! (Sept 2006) A version of TEXTAL customized for molecular replacement

Overview

TEXTAL is a program for automated protein model-building based on pattern recognition techniques. It tries to model (build coordinates for) regions of an electron-density map by searching a database of previously-solved maps to find the most similar regions it has seen before, and takes the coordinates in those regions, transforms them into the new map, and concatenates them. It relies on the extraction of rotation-invariant features that characterize 3D patterns in the density, as well as local density correlation calculations.

The original program took an electron-density map as input, and output a PDB file with a partial model. We have expanded the current version to take a reflection file (.mtz or .hkl) as input, so the user is no longer required to create a map first. In addition, we have developed an automated routine for centering the map on a contiguous molecular region (Findmol).

An important point about resolution: TEXTAL was optimized for building models in density maps generated at 2.8A. Its pattern-recognition routines were trained on databases of electron-density maps at this resolution. If you have higher-resolution data, that is OK; TEXTAL will simply truncate the higher-resolution reflections when generating maps internally for building. TEXTAL can also work on slightly lower (worse) resolution data. It has been demonstrated to work pretty robustly on datasets whose resolution upper limit is between 2.6-3.2A. If you have much higher resolution data, e.g. <2.4A, you might as well use ARP/wARP, which has been shown to do quite well at building very accurate models on a number of high-resolution datasets. However, ARP/wARP does not do so well in medium-resolution ranges, where the data/parameter ratio (number of reflections to coordinates) is lower. We intentionally chose 2.8A as a target resolution for TEXTAL because this represents a very common resolution range (2.5-3.0A) for many MAD datasets collected at synchrotrons. At this resolution, the challenges of automatically interpreting an electron-density map are substantial. There are many ambiguities in the density (cannot see individual atoms), and often a great deal of noise.

TEXTAL has been developed since around 1998 at Texas A&M University through a collaboration between Thomas R. Ioerger (Dept. of Computer Science) and James C. Sacchettini (Dept. of Biochemistry and Biophysics), with funding from the NIH, and contributions from many graduate students and post-docs over the years.

Major Steps

To explain some of our terminology, here are the major steps TEXTAL goes through:

FINDMOL - takes a reflection file and identifies a symmetry-unique region of space containing a contiguous molecule (or set of tightly-packed sub-units). FINDMOL works by first constructing a trace of the ASU, and then incrementally re-arranging the pseudo-atoms using symmetry operators to that they are clustered together in space. This facilitates model-building, rather than using the ASU, which tends to cut molecules into arbitrarily-disconnected chains. This automation avoid requiring the user to manually center the map over a molecule.
scaling the electron density map - For an electron density map calculated from structure factors, or for a map input by the user, it might be on an arbitrary scale (in terms of magnitude of density). To make maps comparable for pattern-matching, it is important to put them all on a uniform scale. Therefore, we always scale maps before using them for model building. The scaling routine is roughly simlar to creating a "1 sigma" map (with mean=0 and std dev=1), though it tries not to be sensitive to things like variations in solvent content.
tracing the map - creates pseudo-atoms at roughly 0.5A spacing that represent the medial axis of the density contours
CAPRA - pattern-recognition algorithm for predicing the backbone of the protein; uses feature-extraction and a neural net to identify likely positions of C-alpha atoms, and uses a variety of heuristics such as preserving geometry of secondary structure to link them together
LOOKUP - does side-chain modeling by looking through a database of previously-solved maps for regions with similar patterns of density surround each putative C-alpha; uses feature-matching to select candidate regions, and then evaluates them by doing local density correlation to pick the best one. This process takes a long time, typtically on the order of an hour for a medium-sized protein, depending on the speed of your machine.
sequence alignment - Although LOOKUP does its best to build side-chains based on local patterns of density, this often results in a number of mis-identified amino acids (typically, accuracy is 20-50%). This is due to a number of reasons: first, some amino acids (like Thr and Val) look structually similar and cannot be distinguished in electron density; second, the density for many side chains might be blurred due to motion (high B-factors), or corrupted by noise. In an effort to address this, we have implemented a method that compares the predicted sequence of fragments (chains) in the model to the true amino acid sequence, and determines most likely assignments using a Smith-Waterman-type algorithm (allowing gaps). In some cases, a few correct amino acids are enough to find where a chain maps to in the sequence, and then this can be used to detect and fix other incorrect amino acids in between. The algorithm uses a substitution matrix that is based on typical ambiguities between amino acid density patterns, rather that biochemical similarity or evolutionary conservation. The algorithm also tries to estimate the secondary structure of the sequence, to map apparent alpha-helices or beta-strands to appropriate places. The format of the sequence file, if provided, is one-letter amino acids codes (like FASTA, but without the head line '>'). One sequence (on separate lines) should be included for each molecule in the ASU, though you do not need to include duplicates, e.g. for homodimers, NCS symmetry copies, etc. Work in progress: We are adding a new feature that can take advantage of coordinates of selenium sites (if available) to help find methionines, which can then facilitate making the correct alignment (in some cases). The output PDB from this stage contains codes in the far-right column that indicate which chains have been confidently aligned (SEQU), putatively aligned (PUTA), or unaligned. After, determining the alignment, the set of C-alpha chains are editted with corrected amino acid identities and fed back into LOOKUP, which will force it to pick the highest-correlation side-chain match of the right residue type for each C-alpha. In good maps (low noise), this process (alignment, plus a second pass through LOOKUP) can often improve the accuracy of amino acid identities in the model up to ~90%.
rsfit - real-space refinement of model; fixes some (but not all) geometry problems, especially along the backbone
SA - simulated annealing of final model against structure factors (calls phenix.refine), mainly to estimate R-factor as an indicator of quality of the model.

Expectations

Accuracy:

backbone - C-alpha atoms typically within 1A RMS
breaks - due to noise or disordered regions, TEXTAL often outputs multiple chains for a single protein; chain length usually ranges between 10 and 100 amino acids
in/dels - there are occasionally insertions or deletions (extra or missing C-alphas), but this occurs rarely (once every 30 residues or so)
percent built - typtically up to 90%, depending on quality of map
amino acid identity
- without sequence alignment - typically 20-50%, due to ambiguitites among structually similar side-chains (e.g. Val and Thr), noise, disorder (surface residues with high B-factor), etc.
- with sequence alignment - typically 50-90% (however, this is very sensitive to the quality of the map, and does not always work reliably on lower-quality maps with a large amount of noise)

Run Time:

It usually takes on the order of a couple of hours to build a complete model for a medium sized structure (a few hundred amino acids), depending on the speed of your machine. The program also takes a lot of memory, and might start swapping if your machine does not have enough RAM. Building just a backbone (C-alpha chains) is much faster, usually taking only a few minutes.

Assumptions

We generally read and write electron-density maps in XPLOR format, and models in PDB format. We use TER records to separate chains. The occupancy and B-factor fields are often used for other purposes so be careful. However, in the final model, we try to set them to reasonable values.

TEXTAL is optimized for recognizing patterns in 2.8A maps. It is OK if you have higher-resolution data; TEXTAL will automatically truncate your structure factors when generating maps internally for model-building.

TEXTAL only knows how to build peptide structures. If you have other molecules in the crystal, ranging from solvent molecules to ligands and co-factors to nucleic acids, TEXTAL will try its best to interpret them as protein.

It is relatively important to use density-modified phases. Initial phasing is often not sufficient to produce a map that is clean enough to build a model by pattern recognition. However, in our experience, simple things like solvent-flattening go a long way to producing interpretable density. Keep in mind that TEXTAL builds by pattern-recognition, so if you cannot possibly visually interpret something in the density, then TEXTAL probably cannot either.

Currently, TEXTAL does *not* do iterative improvement of phases by iterating between model-building and phase-combination between experimental phases and phases from partial model. So don't expect to give poorly phased data to TEXTAL and have it output a complete and refined model. This is a future objective. For now, it just builds what it sees, and thus is dependent on the quality of the density. Again, if you cannot see some disordered region in density, TEXTAL probably cannot either.

How to Use TEXTAL

Right now, the main distribution of TEXTAL is through PHENIX (phenix-online.org). It takes a long time to download and install, but it is worth it. Phenix has a great deal of funcionality, including cctbx, Phaser, and Resolve, as well as TEXTAL.

For everything below, you must first source the phenix_env script in the root directory of the PHENIX installation to setup appropriate environment variables.

A note on reflection file handling:

Either .mtz or .hkl format is acceptable.
Symmetry refers to 6 unit cell params and space group, e.g. "50.2,60.3,70.4,90,90,90,P422". You do not need to specify symmetry for .mtz files, since the parameters are contained in them. You need to give symmetry for .hkl files, though. An alternative is just to give the name of a .inp file from CNS, and it will extract the parameters from that.
Column names for amplitudes, phases, and FOM can often be guessed. If it is unambigous (e.g. if there is only one column of each, or they use obvious names), the code in cctbx can usually pick the right columns for you, so you might be able to get away with leaving them blank.

Using TEXTAL Through the Phenix GUI

Under "Tasks" and "Strategies" there is a single entry for model-building with Textal. This task is designed to do multiple things, depending on what data you give it and what options you select. In particular, you can:

build a backbone only (C-alpha chains) - fast, takes only a few minutes
build side-chains from user-supplied C-alphas - slow, takes on the order of an hour
build complete model (backbone plus side-chains)

When building a complete model, the user the option of running sequence alignment, and doing simulated annealing as a post-processing step.

Input may either be a reflection file (in .mtz or .hkl format), or a pre-generated map supplied by the user for a region he or she wishes to build.

No DISPLAY items. Note that in the current implementation of the Textal task/strategy, there are no objects to be displayed as output (i.e. the "magnifying glass" button is non-functional). All output is written to disk as files with a common prefix (e.g. "textal-...").

Using TEXTAL from the Command Line

textal.build - builds a backbone (C-alpha chains), side-chains, or complete model, given a reflection file (or user-supplied XPLOR map).
- This is the main routine! It runs all the steps, from FINDMOL, through CAPRA, LOOKUP, and even post-processing with SA. If you call "textal.build --reflections=my_prot.mtz...", it will build a complete model by default, and run it through SA. If you specify "--capra_only", it will just build the backbone. For more options (e.g. to specify amino acid sequence), see below. Running textal.build generates a number of files in your local directory. The main one to look for at the end is either textal-capra.pdb (C-alpha chains only) or textal-model.pdb (complete model), depending on whether you elect to build side-chains. It also outputs textal-scaled.xplor and textal-trace.pdb for the region built, since those are often useful to look at. The prefix 'textal' can be changed by the user. If you ran it through SA, you might want to look at textal-refine.pdb and textal-refine.log. This command-line program calls the same code that underlies the Textal task in the Phenix GUI. Thus the options are analogous. However, the command-line version prints out a few more messages and generates a few more intermediate files and log files. (Don't bother looking at the log files - they're mostly for developers.)
textal.makemap - make electron-density maps from a reflection file
- makemap calculates electron density maps (in XPLOR format) from reflection files with a variety of options. The default is standard Fourier maps, using amplitudes, phases, and FOM specified by the user. Patterson maps may also be constructed. A resolution (range) may be specified for limiting (by truncation) the structure factors used. Finally, if it is desired for the map to cover a particular molecule, as opposed to the ASU, then the model may be specified. It will deteremine bordes of the map based on extremes of coordinates in the model, plus some buffer distance in each direction. The user may also indicate whether density surrounding the model (e.g. solvent) should be masked to 0.
textal.findmol - finds contiguous molecular region (as an alternative to the ASU).
- Outputs a set of pseudo-atoms representing the trace, but grouped together representing a contiguous molecule. Also outputs a map covering this region (with surrounding density masked to 0). The atoms and region are guaranteed to be "symmetry unique."
textal.scale - similar to "1 sigma" normalization (mean=0, std dev=1)
- Given a map, scale outputs a similar map but with density values scaled in a consistent way (roughly like a "1 sigma" map, with 0 mean). Most of the TEXTAL routines require this scaled version of the user's input map (which may be scaled arbitrarily).
textal.trace - skeletonization, kind of like Bones
textal.join - runs patch and stitch to try to connect chains
- Given a set of C-alpha chains (e.g. from CAPRA), 'join' will attempt to find additional connections between nearby endpoints, possibly representing breaks in longer chains. Join combines two separate methods: patch and stitch. Patch analyzes the scaled electron density map to look for cases where connectivity between nearby endpoints would have been established if the contour threshold (used for tracing) had just been lowered slightly (e.g. from 1.0 to 0.5). Stitch uses a fragment-matching approach to search a database for stretches of amino acids in other known protein structures that match the geometry of C-alphas at the ends to two chains, and can be used to predict missing C-alphas in between. When patch and stitch are complete, join filters out any remaining chains whose length is below some threshold, typically 6 residues long, because they are often unreliable and represent noise.
textal.run_sa - runs simulated annealing (calls phenix.refine)
- Not a thorough and robust approach to refinement - just a quick attempt at post-processing the model, with some generic parameter settings. Main purpose is to compute the R-factor to get a sense of the quality of the model built by TEXTAL. You might have to do more careful job, such as by fine-tuning b-factors (anisotropic? grouped?), dealing with TLS or bulk-solvent, etc. Probably won't fix major geometry problems, but could improve model in minor ways. Makes up its own free set (user can't specify theirs). It currently does only 3 fixed macro-cycles. Outputs textal-refine.pdb and textal-refine.log. If it crashes, just use textal-model.pdb. Issues to consider: which resolution should be used? which phases (probably experimental, rather than density-modified).
textal.trim_chains - eliminates fragements in one model that don't overlap another
textal.pdb2seq - extracts 1-letter amino acid codes from PDB
textal.variance_map - kind of like solvent-masking
- Given in input XPLOR map, it creates another map covering same region at same resolution that contains the local variance in a 5A sphere around each lattice point. Effectively, this identifies regions of protein from solvent (try using a contour level of 1.0), even in fairly noise maps, since protein regions have higher variance than solvent regions. Note, the output map contains continuous values (variance), not booleans (0 and 1, like an actual mask).
textal.linearize_trace - strips off branches from trace
textal.superpose - RMSD-minimizing superposition of one molecule onto another with paired atoms

The command-line script 'textal.build' is intended to mirror what you can do through the Textal task in the Phenix GUI. It contains many options, as follows:

> textal.build
usage: textal.build [options]
 options:
  --reflections=<filename> (default=None)
     .mtz or .hkl file
  --symmetry=<params or filename> (default=None)
     only needed for .hkl files; unit cell params and space group separated by commas, or CNS .inp file
  --amplitudes=<column name> (default=None)
     e.g. FP; optional, will try to guess from file
  --phases=<column name> (default=None)
     e.g. PHIB; optional, will try to guess from file
  --FOM=<column name> (default=None)
     e.g. FOM; optional, will try to guess from file
  --resolution=<number or range> (default=2.8)
     for truncating SFs to make maps; best to leave set at 2.8, since that is optimal for Textal; can also give range like 2.8-20.0
  --threshold=<number> (default=1.0)
     contour threshold; affects connectivity
  --min_chain_len=<integer> (default=6)
     shorter chains are filtered out of model
  --input_map=<filename> (default=None)
     alternative to giving reflection file
  --input_model=<filename> (default=None)
     for user-defined C-alpha atoms
  --sequence=<filename> (default=None)
     single-letter amino-acid codes
  --se_sites=<filename> (default=None)
     selenium sites, if known
  --copies=<integer> (default=1)
     expected number of NCS symmetry copies in ASU
  --prefix=<string> (default=textal)
     base name for output files
  --capra_only
     build C-alpha chains only, without modeling side-chains (faster)
  --preserve_identities
     force LOOKUP to use amino acid types in user's input model instead of predicting side-chain identities based on density
  --asu
     skip Findmol; build model in map of ASU
  --no_sa
     skip simulated annealing after model-building
  --sa_only
     just run simulated annealing on input model

Help on command-line programs can be accessed by typing "textal.help" on the command line. Also, most programs using the convention that if you try them without any arguments, they will output a usage statement that describes what arguments and options they take.

Examples:

// with no args, prints out help/usage like above, as do all textal programs

textal.build 

// textal can guess the ampl/phase columns because they are unambiguous
// outputs textal-refine.pdb after simulated annealing

textal.build --reflections=if5a.mtz --sequence=if5a.seq

// builds only C-alpha chains
// must supply symmetry info (unit cell params, space group), which are in CNS .inp file

textal.build --reflections=a2u-globulin.hkl --symmetry=a2u-globulin.inp --capra_only

// must give column names here, since there are multiple choices in this mtz
// also note: 2 copies of molecules expected in ASU

textal.build --reflections=czra-ref.mtz --amplitudes=FOBS --phases=PHASE --FOM=FOM --sequence=czra.seq --copies=2

// skips FINDMOL; builds backbone model in ASU without re-centering
// also changes prefix for output files to be "new_run-trace.pdb", etc. instead of "textal-trace.pdb", etc. by default

textal.build --reflections=if5a.mtz --capra_only --asu --prefix=new_run

// builds and refines side-chains for user-supplied backbone

textal.build --reflections=mvk-dm.mtz --amplitudes=FP --phases=PHIDM --FOM=FOMDM --sequence=mvk.seq --input_model=my-backbone-model.pdb

Calling TEXTAL from Python Scripts

Basically, importing the textal.pytex module will give you access to many of the functions. For example, to scale a map, you could do:

from textal.pytex import *
map = emap(file_name="my_map.xplor")
(scaled_map,log) = scale(map)
scaled_map.write("my_map_scaled.xplor")

Note that in the case, as with many of the functions in pytex, a tuple is returned that contains the output object and a log file with some (rather arbitrary and uninterpretable) text.

Help on pytex functions can be accessed by typing "textal.pytex_help" on the command line.

References

The main reference for Textal is:

Ioerger, T.R. and Sacchettini, J.C. (2003). The TEXTAL system: Artificial Intelligence techniques for automated protein model building. Methods Enzymology, 374:244-270.

More references may be found here.

Etc.

More documentation may be found at textal.tamu.edu

You may contact us by sending email to: [email protected]