A Molecular Replacement version of TEXTAL

command-line: textal.build_mr
Phenix GUI task: MR_Build (listed under "textal")

This is a customized version of Textal for molecular replacement applications. If you have phased your dataset by molecular replacement using phases calculated from a homologous model, you can give the coordinates of the transformed MR solution to TEXTAL, along with an alignment to the sequence of your new protein, and it will automatically build your new model for you, with correct amino acids substituted in.

Of course, the quality of the model built depends on the quality of the phases, and that depends on the degree of homology. It is also important to supply an alignment that is as accurate as possible; errors in the alignment (e.g. gap placement) will produce errors in the model (incorrect amino acid side-chains).

The program currently makes only a limited attempt to build divergent regions (e.g. loops that differ between the two structures), as well as to bridge gaps between broken chains. At the moment, it only attempts to bridge gaps of size 5 or less. (Methods for handling longer gaps will be added in the future.) It attempts to apply the three following strategies in sequence:

First, it attempts to fill in gaps between the ends of chains by looking for existing candidate C-alpha atoms from the CAPRA backbone analysis (i.e. where connected density exists between the chains, but just differs from the homologous model).
Second, it tries to bridge the gap using a fragment library (4188 unique 9-mers chosen from a database of 238 non-homologous structures using a min RMS difference of 1.25A), where fragment edges are superposed on the ends of the existing chains and contain the number of C-alpha's inbetween based on the expected size of the gap. These are selected first based on RMS fit to the ends of the existing chains, and then ranked by fit of the intervening C-alpha's to the density (maximizes the sum of density at coordinates of inserted atoms).
Third, it runs the standard patch and stitch routines in Textal. This last step is more indiscriminate; whereas the first two strategies look for a specific number of C-alpha atoms with specific amino-acid identities to fill in the gap, the third strategy just tries to connect chains wherever it can, and may create a short-cut (e.g. linking i to i+2 directly) or insert extra residues not even in the actual sequence, just based on appearance of the density. However, this is sometimes a necessary recourse when a path consisting of exactly the expected number of intervening C-alpha's between the ends of two chains cannot be found, but only one with a few more or a few less (for example, due to an alignment error).

While it would seem undesirable to introduce 'fictional' residues into the given sequence just to patch up gaps, we find that this is often helpful for compensating for errors in the sequence alignment between the protein and the homology model (common - especially, for example, with regard to uncertainty over precise placement of gaps). For this reason, there is an option that can be used to control the use of this third step: --connectivity, which has two possible values: 'conservative' and 'aggressive' (the default). With --connectivity='aggressive' (which need not be specified explicitly), it runs patch/stitch and does a better job of connecting chains, though the user would be advised to inspect and verify the residues in these regions. (They are indicated in the program output/log info.) But the user may turn this off by running textal.build_mr with the option: --connectivity='conservative'.

The residues in the output model will be numbered from 0 to N-1, where N is the number of amino acids (non-gap characters) in the sequence of the protein given in the alignment file.

Usage

The procedure is pretty straight-forward to use. From the command-line, there is a command called textal.build_mr (as is the usual convention for Textal programs, typing this command without any args invokes help) that takes a reflection file, model, and alignment, along with some options for specifying symmetry (--symmetry) and column names (--amplitudes, --phases, --FOM). As usual, most of the options have reasonable defaults or can be inferred intelligently from the input data, but are there to be specified if there is any ambiguity.

> textal.build_mr --symmetry=human-otc.inp human-otc.hkl a1s_mr_solution.pdb alignment
making map covering model
scaling map
  output: human-otc-mr-scaled.xplor
tracing map
  output: human-otc-mr-trace.pdb
running Capra
  output: human-otc-mr-chains.pdb
  output: human-otc-mr-waypts.pdb
  output: human-otc-mr-links.pdb
using molecular-replacement model to identify side-chains

gap between chains 0 (end=LYS-4) and 1 (start=ARG-6): G
failed to construct bridge via candidate CA's
failed to construct bridge via fragment matching
failed to connect

gap between chains 1 (end=ARG-32) and 2 (start=LYS-34): I
constructed bridge via candidate CAs
#  ATOM     17  CA  ILE    33      30.350  44.389  52.411  1.81  0.38

gap between chains 2 (end=ASP-81) and 3 (start=VAL-86): IHLG
constructed bridge via candidate CAs
#  ATOM   1352  CA  ILE    82      56.850  64.389  49.911  1.31  0.87
#  ATOM   1357  CA  HIS    83      56.350  67.889  48.911  1.94  2.16
#  ATOM   1688  CA  LEU    84      55.850  70.389  50.411  1.50  1.76
#  ATOM   1693  CA  GLY    85      56.350  73.389  51.411  1.96  2.37
...
  output: human-otc-mr-aa.pdb
patching small breaks
  output: human-otc-mr-patch.pdb
  output: human-otc-mr-patch.log
the following residues were connected by 'patch' and should be inspected:
  4..6
  109..111
  145..147
  178..180
  261..263
  (an attempt was made to bridge these gaps, but they are not necessarily consistent with the alignment)
refining C-alpha coords
  output: human-otc-mr-capra.pdb
summary: #chains=3, #CAs=294, expected=310, percent built=94.0%
building side-chains
original_clookup_main: Amino acid corrections will be applied.
original_clookup_main:  Partial assignments detected.  Chain direction override disabled!
original_clookup_main: Lookup is  3% complete, estimate 184m10s remaining for 265 residues...
original_clookup_main: Lookup is  7% complete, estimate 120m03s remaining for 255 residues...
...
original_clookup_main: Lookup is 98% complete, estimate 1m17s remaining for 5 residues...
original_clookup_main: Lookup took 71m13s for the first phase.
  output: human-otc-mr-lookup.pdb
applying real-space refinement
  output: human-otc-mr-model.pdb

If you want to run a quick test (e.g. 5 minutes vs. an hour), you can give the program a flag '-c' to build backbone only (C-alpha chains) but not the side chains. This will allow you to see what parts of the model will be built and to inspect the assignment of amino acid identities (residue names of C-alpha atoms in <prot>-mr-capra.pdb).

Variations:

// as above...

> textal.build_mr --symmetry=human-otc.inp human-otc.hkl a1s_mr_solution.pdb alignment

// build backbone (C-alpha chains) only (much faster)

> textal.build_mr -c --symmetry=human-otc.inp human-otc.hkl a1s_mr_solution.pdb alignment

// specify specific columns to use

> textal.build_mr --amplitudes=FULL_MOD --phases=PA_MOD --FOM=FOM_MOD --symmetry=human-otc.inp human-otc.hkl a1s_mr_solution.pdb alignment

// suppress running patch, if you are sure your alignment is accurate (output may be more disconnected but more veridical)

> textal.build_mr --connectivity=conservative --symmetry=human-otc.inp human-otc.hkl a1s_mr_solution.pdb alignment

There is also a new task for the Phenix GUI (listed under tasks/textal/MR_Build.py) that serves as an interface to the program, with inputs that parallel those for the command line program.

Format of the Alignment File

(This is a bit clunky right now, and may be simplified in the future...) The alignment file currently assumes there will be two lines: one for the sequence of the protein being built, and the second for the homology model. Each sequence should be in one-letter codes, all on one line, with no other characters. Gaps should be indicated by a '-'.

There must be exactly one amino acid in the sequence for each residue in the homology model (MR solution structure). If not, the program will complain. Note that this means removing amino acids from the normal sequence of the protein if they are absent in the model (such a disordered terminus or loop region). You can extract the sequences using textal.pdb2seq, but trim off header and non-peptide characters ('X', e.g. for water or ligands).

If there are multiple chains in the structure, then the alignment file should provide a pair of lines for each, with the sequence of the new protein first and the sequence of the chain in the homology model second. Thus, if there are N chains in the homology model (even if they are NCS symmetry copies or otherwise identical, as in homo-oligomers), then there should be 2N lines in the alignment file.

To generate alignments, I suggest the following web server, which runs Bill Pearson's LALIGN. Be sure to set the 'global' alignment option, to force it to include all residues and not drop any on the ends. Note that for 'real' runs of this program, I recommend paying careful attention to various parameters, such as gap penalties and substitution matrix, to generate as accurate an alignment as possible, since the quality of the resulting model built will depend to some extent on it getting the alignment right.

Here is an example of the alignment file for nitrite-reduct with kbv (from the internal Phenix database of test structures; use nitrite-reduct/model/kbv_mr_solution.pdb as an input model to textal.build_mr) (note: for this example, CE was used to make a structural alignment, to remove the uncertainty about the alignment itself inherent to pure sequence-based methods like LALIGN):

ATAAEIAALPRQKVELVDPPFVHAHSQVAEGGPKVVEFTMVIEEKKIVIDDAGTEVHAMAFNGTVPGPLMVVHQDDYLELTLINPETNTLMHNIDFHAATGALGGGGLTEINPGEKTILRFKATKPGVFVYHCAPPGMVPWHVVSGMNGAIMVLPREGLHDGKGKALTYDKIYYVGEQDFYVPRDENGKYKKYEAPGDAYED---------TVKVMRTLTPTHVVFNGAVGALTGDKAMTAAVGEKVLIVHSQAN--RDTRPDLIGGHGDYVWATGKFNTPPDVDQETWFIPGGAAGAAFYTFQQPGIYAYVNHNLIEAFELGAAAHFKVTGEWNDDLMTSVLAPSGTIE
-------ELPVIDAVTTHAPEVPPAI--DRDYPAKVRVKMETVEKTMKMDD-GVEYRYWTFDGDVPGRMIRVREGDTVEVEFSNNPSSTVPHNVDFHAATGQGGGAAATFTAPGRTSTFSFKALQPGLYIYHCAV-APVGMHIANGMYGLILVEPKEGL-------PKVDKEFYIVQGDFYTKG-----------------KKGAQGLQPFDMDKAVAEQPEYVVFNGHVGALTGDNALKAKAGETVRMYVGNGGPNLVSSFHVIGEIFDKVYVEGG--KLINENVQSTIVPAGGSAIVEFKVDIPGNYTLVDHSIFRAFNKGALGQLKVEGAENPEIM-----------

Here is an example of the alignment file for human-otc (with a1s):

-VQLKGRDLLTLKNFTGEEIKYMLWLSADLKFRIKQKGEYLPLLQGKSLGMIFEKRSTRTRLSTETGFALLGGHPCFLTTQDIHLGVNESLTDTARVLSSMADAVLARVYKQSDLDTLAKEASIPIINGLSDLYHPIQILADYLTLQEHYSSLKGLTLSWIGDGNNILHSIMMSAAKFGMHLQAATPKGYEPDASVTKLAEQYAKENGTKLLLTNDPLEAAHGGNVLITDTWISMGREEEKKKRLQAFQGYQVTMKTAKVAASDWTFLHCLPRKP-EEVDDEVFYSPRSLVFPEAENRKWTIMAVMVSLLTD--
VVSLAGRDLLCLQDYTAEEIWTILETAKMFKIW-QKIGKPHRLLEGKTLAMIFQKPSTRTRVSFEVAMAHLGGHALYLNAQDLQLRRGETIADTARVLSRYVDAIMARVYDHKDVEDLAKYATVPVINGLSDFSHPCQALADYMTIWEKKGTIKGVKVVYVGDGNNVAHSLMIAGTKLGADVVVATPEGYEPDEKVIKWAEQNAAESGGSFELLHDPVKAVKDADVIYTDVWASMGQEAEAEERRKIFRPFQVNKDLVKHAKPDYMFMHCLPAHRGEEVTDDVIDSPNSVVWDQAENRLHAQKAVLALVMGGIK

Here is an example of the alignment file for a2u-globulin (with mup); note that there are 4 chains (homotetramer in the ASU), hence 4 pairs of lines:

EEASSTRGNLDVAKLNGDWFSIVVASNKREKIEENGSMRVFMQHIDVLENSLGFKFRIKENGECRELYLVAYKTPEDGEYFVEYDGGNTFTILKTDYDRYVMFHLINFKNGETFQLMVLYGRTKDLSSDIKEKFAKLCEAHGITRDNIIDLTKTDRCL
EEASSTGRNFNVEKINGEWHTIILASDKREKIEDNGNFRLFLEQIHVLEKSLVLKFHTVRDEECSELSMVADKTEKAGEYSVTYDGFNTFTIPKTDYDNFLMAHLINEKDGETFQLMGLYGREPDLSSDIKERFAQLCEEHGILRENIIDLSNANRC-
EEASSTRGNLDVAKLNGDWFSIVVASNKREKIEENGSMRVFMQHIDVLENSLGFKFRIKENGECRELYLVAYKTPEDGEYFVEYDGGNTFTILKTDYDRYVMFHLINFKNGETFQLMVLYGRTKDLSSDIKEKFAKLCEAHGITRDNIIDLTKTDRCL
EEASSTGRNFNVEKINGEWHTIILASDKREKIEDNGNFRLFLEQIHVLEKSLVLKFHTVRDEECSELSMVADKTEKAGEYSVTYDGFNTFTIPKTDYDNFLMAHLINEKDGETFQLMGLYGREPDLSSDIKERFAQLCEEHGILRENIIDLSNANRC-
EEASSTRGNLDVAKLNGDWFSIVVASNKREKIEENGSMRVFMQHIDVLENSLGFKFRIKENGECRELYLVAYKTPEDGEYFVEYDGGNTFTILKTDYDRYVMFHLINFKNGETFQLMVLYGRTKDLSSDIKEKFAKLCEAHGITRDNIIDLTKTDRCL
EEASSTGRNFNVEKINGEWHTIILASDKREKIEDNGNFRLFLEQIHVLEKSLVLKFHTVRDEECSELSMVADKTEKAGEYSVTYDGFNTFTIPKTDYDNFLMAHLINEKDGETFQLMGLYGREPDLSSDIKERFAQLCEEHGILRENIIDLSNANRC-
EEASSTRGNLDVAKLNGDWFSIVVASNKREKIEENGSMRVFMQHIDVLENSLGFKFRIKENGECRELYLVAYKTPEDGEYFVEYDGGNTFTILKTDYDRYVMFHLINFKNGETFQLMVLYGRTKDLSSDIKEKFAKLCEAHGITRDNIIDLTKTDRCL
EEASSTGRNFNVEKINGEWHTIILASDKREKIEDNGNFRLFLEQIHVLEKSLVLKFHTVRDEECSELSMVADKTEKAGEYSVTYDGFNTFTIPKTDYDNFLMAHLINEKDGETFQLMGLYGREPDLSSDIKERFAQLCEEHGILRENIIDLSNANRC-