phenix_logo
Python-based Hierarchical ENvironment for Integrated Xtallography
Documentation Home
 

Tutorial 3: Solving a structure with MIR data

Introduction
Setting up to run PHENIX
Running the demo rh-dehalogenase data with AutoSol
Where are my files?
What parameters did I use?
Reading the log files for your AutoSol run file
Summary of the command-line arguments
Reading the datafiles.
ImportRawData.
Guessing cell contents
Running phenix.xtriage
Testing for anisotropy in the data
Scaling MIR data
Running HYSS to find the heavy-atom substructure
Finding the hand and scoring heavy-atom solutions
Finding origin shifts between heavy-atom solutions for different derivatives and combining phases
Finding additional sites by density modification and heavy-atom difference Fouriers
Final phasing with SOLVE
Statistical density modification with RESOLVE
Generation of FreeR flags
Model-building with RESOLVE
The AutoSol_summary.dat summary file
How do I know if I have a good solution?
What to do next
Additional information

Introduction

This tutorial will use some very good MIR data (Native and 5 derivatives from a rh-dehalogenase protein MIR dataset analyzed at 2.8 A) as an example of how to solve a MIR dataset with AutoSol. It is designed to be read all the way through, giving pointers for you along the way. Once you have read it all and run the example data and looked at the output files, you will be in a good position to run your own data through AutoSol.

Setting up to run PHENIX

If PHENIX is already installed and your environment is all set, then if you type:

echo $PHENIX
then you should get back something like this:
/xtal//phenix-1.3
If instead you get:
PHENIX: undefined variable
then you need to set up your PHENIX environment. See the PHENIX installation page for details of how to do this. If you are using the C-shell environment (csh) then all you will need to do is add one line to your .cshrc (or equivalent) file that looks like this:
source /xtal/phenix-1.3/phenix_env
(except that the path in this statement will be where your PHENIX is installed). Then the next time you log in $PHENIX will be defined.

Running the demo rh-dehalogenase data with AutoSol

To run AutoSol on the demo rh-dehalogenase data, make yourself a tutorials directory and cd into that directory:

mkdir tutorials
cd tutorials 
Now type the phenix command:
phenix.run_example --help 
to list the available examples. Choosing rh-dehalogenase-mir for this tutorial, you can now use the phenix command:
phenix.run_example rh-dehalogenase-mir 
to solve the rh-dehalogenase structure with AutoSol. This command will copy the directory $PHENIX/examples/rh-dehalogenase-mir to your current directory (tutorials) and call it tutorials/rh-dehalogenase-mir/ . Then it will run AutoSol using the command file run.csh that is present in this tutorials/rh-dehalogenase-mir/ directory. Running an MIR dataset is a little different than running a MAD or SAD or SIR dataset because you cannot use the standard command-line control for MIR. Instead you have to run a script. It is not hard, just different. (You can do all of those other things from a script too, it's just even easier to do them from the command-line). This command file run.csh is simple. It says:
#!/bin/csh
#!/bin/csh
echo "Running AutoSol on rhodococcus dehalogenase data..."
echo "NOTE: command-line not available for MIR..using script instead"
phenix.runWizard AutoSol Facts.list
The first line (#!/bin/csh) tells the system to interpret the remainder of the text in the file using the C-shell (csh). The command phenix.autosol runs the command-line version of AutoSol (see Automated Structure Solution using AutoSol for all the details about AutoSol including a full list of keywords). The second line says to run the AutoSol Wizard, and use the contents of the file Facts.list as parameters. Now let’s look at the Facts.list file. Here is the first relevant part of the file:

sequence_file sequence.dat
thoroughness thorough
cell 93.796  79.849  43.108  90.000  90.000  90.00   # cell params
resolution 2.8                             #  Resolution

expt_type       sir                        # MIR dataset is set of SIR datasets

input_file_list  rt_rd_1.sca auki_rd_1.sca # list of input .sca files
                                           #  Native deriv 1

nat_der_list    Native  Au                 # identify files in input_file_list
                                           # as Native or the heavy-atom name
                                           # such as se.

inano_list      noinano inano              # inano/noinano/anoonly: identify
                                           # if ano diffs to be used for derivs

n_ha_list       0    5                     # number of heavy-atoms for each
                                           # file for mir/sir  (0 for native)

This part of the script tells AutoSol about the resolution, the data files for the first native-derivative combination, and the heavy atoms for these files (Native and Au), and whether anomalous differences are to be included for each (noinano for Native means do not include them; inano for the Au derivative means do include them for this derivative), and the number of heavy-atoms in each file (0 for the Native, 5 for the derivative). Note that this first native-derivative combination in this MIR dataset is being treated as an SIRAS dataset. This is the way the AutoSol Wizard works for MIR. The individual derivatives are all solved separately (except using difference Fouriers to phase one derivative using a solution from another). Then when all are finished all the SIR or SIRAS datasets are phased all together with SOLVE Bayesian correlated phasing. This approach works well because a substructure determination is done separately for each derivative, and if any one of them works well, then all the derivatives can be solved. This part of the script also tells AutoSol to use defaults for a thorough analysis. Usually for MIR this is the best idea, while for SAD and MAD experiments a quick analysis is fine. The MIR script then continues with data for the second, third... derivatives. These parts of the script all look like this:
############## NEW DATASET ################
run_list        start                      # run "start" method.
                                           # read in datafiles for this dataset

run_list        read_another_dataset       # starting a new dataset here

input_file_list  rt_rd_1.sca hgki_rd_1.sca # list of input .sca files
                                           #  Native deriv 1

nat_der_list    Native Hg                  # identify files in input_file_list
                                           # as Native or the heavy-atom name
                                           # such as se.

inano_list      noinano inano              # inano/noinano/anoonly: identify
                                           # if ano diffs to be used for derivs

n_ha_list       0    5                     # number of heavy-atoms for each
                                           # file for mir/sir  (0 for native)
Here the run_list start line is a command to AutoSol. It means "run the following list of AutoSol methods: start " . So the AutoSol Wizard runs the "start" method and stops. This basically reads in the datafiles from the previous dataset. The next line says to read another dataset. Now we are ready to provide the data for the second native-derivative combination, again as an SIR dataset. We provide the same native as before (although we don't have to) and a new derivative, this time an Hg derivative, again with anomalous data. This procedure is repeated for each derivative. The AutoSol Wizard will then scale all the datasets and find heavy-atom solutions for some of them by direct methods, then use difference Fouriers to find the solutions for the others. Although the phenix.run_example rh-dehalogenase-mir command has just run AutoSol from a script (run.csh), you can run AutoSol yourself from this script with the same phenix.runWizard AutoSol Facts.list command. You can also run AutoSol from a GUI. All these possibilities are described in Running a Wizard from a GUI, the command-line, or a script.

Where are my files?

Once you have started AutoSol or another Wizard, an output directory will be created in your current (working) directory. The first time you run AutoSol in this directory, this output directory will be called AutoSol_run_1_ (or AutoSol_run_1_/, where the slash at the end just indicates that this is a directory). All of the output from run 1 of AutoSol will be in this directory. If you run AutoSol again, a new subdirectory called AutoSol_run_2_ will be created. Inside the directory AutoSol_run_1_ there will be one or more temporary directories such as TEMP0 created while the Wizard is running. The files in this temporary directory may be useful sometimes in figuring out what the Wizard is doing (or not doing!). By default these directories are emptied when the Wizard finishes (but you can keep their contents with the command clean_up=False if you want.)

What parameters did I use?

When the AutoSol wizard runs from a script it does not write out a parameters file. The parameters from your Facts.list are echoed in the AutoSol log file, but otherwise the Facts.list is your record of what the parameters used were.

Reading the log files for your AutoSol run file

While the AutoSol wizard is running, there are several places you can look to see what is going on. The most important one is the overall log file for the AutoSol run. This log file is located in:

AutoSol_run_1_/AutoSol_run_1_1.log
for run 1 of AutoSol. (The second 1 in this log file name will be incremented if you stop this run in the middle and restart it with a command like phenix.autosol run=1). The AutoSol_run_1_1.log file is a running summary of what the AutoSol Wizard is doing. Here are a few of the key sections of the log files produced for the rh-dehalogenase MIR dataset.

Summary of the command-line arguments

Near the top of the log file you will find:

READING FACTS FROM Facts.list
NEW FACT from Facts.list :
cell [93.796000000000006, 79.849000000000004, 43.107999999999997, 90.0, 90.0, 90.0]
NEW FACT from Facts.list :resolution 2.8
NEW FACT from Facts.list :expt_type sir
NEW FACT from Facts.list :input_file_list ['rt_rd_1.sca', 'auki_rd_1.sca']
NEW FACT from Facts.list :nat_der_list ['Native', 'Au']
NEW FACT from Facts.list :inano_list ['noinano', 'inano']
NEW FACT from Facts.list :n_ha_list [0, 5]
NEW FACT from Facts.list :run_list ['start']

This is just a repeat of the parameters in your Facts.list script. The last fact is the "run_list start" command, which tells the AutoSol Wizard to read in the data (recall that we put in this command after each native-derivative combination so the Wizard could read it in as an SIR dataset).

Reading the datafiles.

The AutoSol Wizard will read in your datafiles and check their contents, printing out a summary for each one. This is done one dataset at a time (each native-derivative pair) until all have been read in. Here is the summary for the first derivative:

HKLIN ENTRY:  rt_rd_1.sca
FILE TYPE scalepack_no_merge_original_index
GUESS FILE TYPE MERGE TYPE sca unmerged
LABELS['I', 'SIGI']
CONTENTS: ['rt_rd_1.sca', 'sca', 'unmerged', 'P 21 21 2', None, None, 
['I', 'SIGI']]
Not checking SG as cell or sg not yet defined
SG from  rt_rd_1.sca  is:  P 21 21 2
HKLIN ENTRY:  auki_rd_1.sca
FILE TYPE scalepack_no_merge_original_index
GUESS FILE TYPE MERGE TYPE sca unmerged
LABELS['I', 'SIGI']
CONTENTS: ['auki_rd_1.sca', 'sca', 'unmerged', 'P 21 21 21', None, None, 
['I', 'SIGI']]
Converting the files ['rt_rd_1.sca', 'auki_rd_1.sca'] to sca format before proceeding

ImportRawData.

The input data files rt_rd_1.sca and auki_rd_1.sca are in unmerged Scalepack format. The AutoSol wizard converts everything to premerged Scalepack format before proceeding. Here is where the AutoSol Wizard identifies the format and then calls the ImportRawData Wizard:

Running import directly...
WIZARD:  ImportRawData
followed eventually by...
List of output files :
File 1: rt_rd_1_PHX.sca
File 2: auki_rd_1_PHX.sca
These output files are in premerged Scalepack format. After completing the ImportRawData step, the AutoSol Wizard goes back to the beginning, but uses the newly-converted files rt_rd_1_PHX.sca and auki_rd_1_PHX.sca:
HKLIN ENTRY:  AutoSol_run_1_/rt_rd_1_PHX.sca
FILE TYPE scalepack_merge
GUESS FILE TYPE MERGE TYPE sca premerged
LABELS['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']
Unit cell: (93.796, 79.849, 43.108, 90, 90, 90)
Space group: P 21 21 2 (No. 18)
CONTENTS: ['AutoSol_run_1_/rt_rd_1_PHX.sca', 'sca', 'premerged', 'P 21 21 2', 
[93.796000000000006, 79.849000000000004, 43.107999999999997, 90.0, 90.0, 90.0],
 2.4307589843043771, ['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']]
HKLIN ENTRY:  AutoSol_run_1_/auki_rd_1_PHX.sca
FILE TYPE scalepack_merge
GUESS FILE TYPE MERGE TYPE sca premerged
LABELS['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']
Unit cell: (93.796, 79.849, 43.108, 90, 90, 90)
Space group: P 21 21 2 (No. 18)
CONTENTS: ['AutoSol_run_1_/auki_rd_1_PHX.sca', 'sca', 'premerged', 'P 21 21 2', 
[93.796000000000006, 79.849000000000004, 43.107999999999997, 90.0, 90.0, 90.0], 
2.430806639777233, ['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']]
Total of 2 input data files
['AutoSol_run_1_/rt_rd_1_PHX.sca', 'AutoSol_run_1_/auki_rd_1_PHX.sca']

Guessing cell contents

The AutoSol Wizard uses the sequence information in your sequence file (sequence.dat) and the cell parameters and space group to guess the number of NCS copies and the solvent fraction.

 
AutoSol_guess_setup_for_scaling  AutoSol  Run 1 Fri Mar  7 01:24:08 2008

Solvent fraction and resolution and ha types/scatt fact
Guessing setup for scaling dataset 1
SG P 21 21 2
cell [93.796000000000006, 79.849000000000004, 43.107999999999997, 90.0, 90.0, 90.0]
Number of residues in unique chains in seq file: 294
Unit cell: (93.796, 79.849, 43.108, 90, 90, 90)
Space group: P 21 21 2 (No. 18)
CELL VOLUME :322858.090387
N_EQUIV:4
GUESS OF NCS COPIES: 1
SOLVENT FRACTION ESTIMATE: 0.51
Total residues:294
Total Met:6
resolution estimate: 2.8

Running phenix.xtriage

The AutoSol Wizard automatically runs phenix.xtriage on each of your input datafiles to analyze them for twinning, outliers, translational symmetry, and other special conditions that you should be aware of. You can read more about xtriage in Data quality assessment with phenix.xtriage. Part of the summary output from xtriage for this dataset looks like this:

 
No (pseudo)merohedral twin laws were found.

Patterson analyses
  - Largest peak height   : 6.680
   (corresponding p value : 0.56306)

The largest off-origin peak in the Patterson function is 6.68% of the
height of the origin peak. No significant pseudotranslation is detected.

The results of the L-test indicate that the intensity statistics
behave as expected. No twinning is suspected.

In this space group (P21 21 2) with the cell dimensions in this structure, there are no ways to create a twinned crystal, so you do not have to worry about twinning. There is also no large off-origin peak in the native Patterson, so there does not appear to be any translational pseudo-symmetry.

Testing for anisotropy in the data

After all the SIR datasets are read in, the AutoSol Wizard tests for anisotropy by determining the range of effective anisotropic B values along the principal lattice directions. If this range is large and the ratio of the largest to the smallest value is also large then the data are by default corrected to make the anisotropy small (see Analyzing and scaling the data in the AutoSol web page for more discussion of the anisotropy correction). In the rh-dehalogenase case, the range of anisotropic B values is small and no correction is made:

 Range of aniso B:  13.06 19.68
Not using aniso-corrected data files as the range of aniso b  is 
only  6.62  and 'correct_aniso' is not set
Note that if any one of the datafiles in a MIR dataset has a high anisotropy, then by default all of them will be corrected for anisotropy.

Scaling MIR data

The AutoSol Wizard uses SOLVE localscaling to scale MIR data. The procedure is basically to scale all the data to the native. During this process outliers that deviate from the reference values by more that ratio_out (default=3) standard deviations (using all data in the appropriate resolution shell to estimate the SD) are rejected.

Running HYSS to find the heavy-atom substructure

The HYSS (hybrid substructure search) procedure for heavy-atom searching uses a combination of a Patterson search for 2-site solutions with direct methods recycling. The search ends when the same solution is found beginning with several different starting points. The HYSS log files are named after the datafile that they are based on and the type of differences (ano, iso) that are being used. In this rh-dehalogenase MIR dataset, the HYSS logfile for the HgKI derivative is hgki_rd_1_PHX.sca_iso_2.sca_hyss.log. The key part of this HYSS log file is:

Entering search loop:

p = peaklist index in Patterson map
f = peaklist index in two-site translation function
cc = correlation coefficient after extrapolation scan
r = number of dual-space recycling cycles
cc = final correlation coefficient

=0.190 r=015 cc=0.250 [ best cc: 0.250 ]
p=000 f=001 cc=0.191 r=015 cc=0.242 [ best cc: 0.250 0.242 ]
Number of matching sites of top 2 structures: 3
p=000 f=002 cc=0.174 r=015 cc=0.200 [ best cc: 0.250 0.242 ]
p=001 f=000 cc=0.167 r=015 cc=0.230 [ best cc: 0.250 0.242 0.230 ]
Number of matching sites of top 2 structures: 3
Number of matching sites of top 3 structures: 2
...
p=011 f=002 cc=0.165 r=015 cc=0.229 [ best cc: 0.293 0.279 0.277 0.276 ]
p=012 f=000 cc=0.184 r=015 cc=0.250 [ best cc: 0.293 0.279 0.277 0.276 ]
p=012 f=001 cc=0.148 r=015 cc=0.292 [ best cc: 0.293 0.292 0.279 0.277 ]
Number of matching sites of top 2 structures: 7
Number of matching sites of top 3 structures: 7
Number of matching sites of top 4 structures: 6

Here a correlation coefficient of 0.5 is very good (0.1 is hopeless, 0.2 is possible, 0.3 is good) and 8 sites were found that matched in the first two tries. The program continues until 4 structures all have 6 matching sites, then ends and prints out the final correlations, after taking the top 5 sites.

Finding the hand and scoring heavy-atom solutions

Normally either hand of the heavy-atom substructure is a possible solution, and both must be tested by calculating phases and examining the electron density map and by carrying out density modification, as they will give the same statistics for all heavy-atom analysis and phasing steps. Note that in chiral space groups (those that have a handedness such as P61, both hands of the space group must be tested. The AutoSol Wizard will do this for you, inverting the hand of the heavy-atom substructure and the space group at the same time. For example, in space group P61 the hand of the substructure is inverted and then it is placed in space group P65. The AutoSol Wizard scores heavy-atom solutions based on two criteria by default. The first criterion is the skew of the electron density in the map (SKEW). Good values for the skew are anything greater than 0.1. In a MIR structure determination, the heavy-atom solution with the correct hand may have a more positive skew than the one with the inverse hand. The second criterion is the correlation of local RMS density (CORR_RMS). This is a measure of how contiguous the solvent and non-solvent regions are in the map. (If the local rms is low at one point and also low at neighboring points, then the solvent region must be relatively contiguous, and not split up into small regions.) For MIR datasets, SOLVE is used for calculating phases. For a MIR dataset, a figure of merit of 0.5 is acceptable, 0.6 is fine and anything above 0.7 is very good. The scores are listed in the AutoSol log file. Here is the scoring for solution 4 (the best initial map):

AutoSol_run_1_/TEMP0/resolve.scores SKEW 0.2797302
AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.9306123

CC-EST (BAYES-CC) SKEW : 57.8 +/- 17.0
CC-EST (BAYES-CC) CORR_RMS : 63.3 +/- 28.2
ESTIMATED MAP CC x 100:  60.8 +/- 13.3

This is a good solution, with a high (and positive) skew (0.28), and a high correlation of local rms density (0.93) The ESTIMATED MAP CC x 100 is an estimate of the quality of the experimental electron density map (not the density-modified one). A set of real structures was used to calibrate the range of values of each score that were obtained for phases with varying quality. The resulting probability distributions are used above to estimate the correlation between the experimental map and an ideal map for this structure. Then all the estimates are combined to yield an overall Bayesian estimate of the map quality. These are reported as CC x 100 +/- 2SD. These estimated map CC values are usually fairly close, so if the estimate is 60.8 +/- 13.3 then you can be confident that your structure is solved and that the density-modified map will be quite good. In this case the datasets used to find heavy-atom substructures were the isomorphous differences for each derivative. For each dataset one solution was found, and that solution and its inverse were scored. The scores were (skipping extra text below):
SCORING SOLUTION 1: Solution 1 using HYSS on 
/net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/
AutoSol_run_1_/auki_rd_1_PHX.sca_iso_1.sca. Dataset #1, with 5 sites
ESTIMATED MAP CC x 100:  29.6 +/- 34.8

SCORING SOLUTION 2: Solution  2 using HYSS on 
/net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/
AutoSol_run_1_/auki_rd_1_PHX.sca_iso_1.sca and taking inverse. Dataset #1, with 5 sites
ESTIMATED MAP CC x 100:  41.7 +/- 26.5

SCORING SOLUTION 3: Solution 3 using HYSS on 
/net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/
AutoSol_run_1_/hgki_rd_1_PHX.sca_iso_2.sca. Dataset #2, with 5 sites
ESTIMATED MAP CC x 100:  40.4 +/- 30.2

SCORING SOLUTION 4: Solution  4 using HYSS on 
/net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/
AutoSol_run_1_/hgki_rd_1_PHX.sca_iso_2.sca and taking inverse. Dataset #2, with 5 sites
ESTIMATED MAP CC x 100:  60.8 +/- 13.3

In this case the best score was solution 4 (as shown above), based on the HGKI derivative and taking the inverse of the heavy-atom sites, with a ESTIMATED MAP CC x 100: 58.5 +/- 7.9. The score from the opposite hand was just 36.8+/- 27.6 and so the hand was clear.

Finding origin shifts between heavy-atom solutions for different derivatives and combining phases

Depending on the space group, there may be a few or infinitely many totally equivalent heavy-atom substructures for a particular native-derivative pair. These are related to each other by translations that can be thought of as offsets of the origins for the two substructures. The AutoSol Wizard identifies the allowed offsets for the space group. Then it aligns the solutions from different derivatives by finding the origin offset that maximizes the correlation of electron density in the native Fouriers for the two. Then it combines the phases from the two using addition of Hendrickson-Lattman coefficients. These combined phases are then used to score the phasing obtained by combining the two derivatives. The best combinations are iteratively combined until all available derivatives are considered and combined in an optimal fashion. Once an optimal set of derivatives and sites is found, SOLVE Bayesian correlated phasing is used to calculate a final set of native phases from the native and all the derivatives at once. Here is the best pair of derivatives from this first cycle:

Getting origin shift for 2 mapped on to 4
Phases from solution 4:solve_4.mtz
Phases from solution 2:solve_2.mtz
Merged ha files in ha_4_2.pdb
Merged files in merged_4_2.mtz
FOM solution 4: 0.56    FOM solution 2: 0.44    
Correlation of maps: 0.249    Ideal map correlation: 0.2464

RESULT: FOM solution 4: 0.56    FOM solution 2: 0.44    
Correlation of maps: 0.249    Ideal map correlation: 0.2464
 Origin offset of solution 2: [-0.5, 0.0, 0.0]
Here solutions 2 and 4 have a map correlation of 0.25, just about the same as expected based on the FOM of the two solutions (0.56 and .44) and assuming random errors. The two solutions differ by an origin shift of 0.5 along x. The two solutions are then phased as a group to use as the basis for density modification:
Merging a set of solutions and phasing the group with SOLVE
...
 PHASED SOLUTION: Solution 9 based on MIR phasing starting from 
solutions 4 (dataset #2)  and 2 (dataset #1)
However in this case after phasing with the two derivatives together, the score is not improved over the HGKI derivative by itself:
AutoSol_run_1_/TEMP0/resolve.scores SKEW 0.1357029
AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.9086851

CC-EST (BAYES-CC) SKEW : 37.3 +/- 31.3
CC-EST (BAYES-CC) CORR_RMS : 60.8 +/- 31.5
ESTIMATED MAP CC x 100:  49.1 +/- 20.2
Though worse than the HGKI solution by itself, this is reasonably good solution, with a moderate a positive skew (0.14), and a good correlation of local rms density (0.91). As the original HGKI solution was the best, it is used for density modification and finding additional sites:
SOLUTION USED TO START DEN MOD:
Solution  4 using HYSS on 
/net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/
AutoSol_run_1_/hgki_rd_1_PHX.sca_iso_2.sca and taking inverse. Dataset #2
HKLIN: solve_4.mtz
Testing density modification with mask_type = wang
RFACTOR:  0.2475
Best mask type so far is  wang
Testing density modification with mask_type = histograms
RFACTOR:  0.2553

Note that two types of masks (wang and histograms) are tested in the density modification procedure. This is because sometimes one method for identification of the solvent region is better than the other. The wang method chooses the solvent region as those points surrounded by regions of low variation. The histograms method chooses points instead based on the similarity of the histograms of density nearby to those of idealized solvent and protein regions. The R-factor for density modification is used to choose which is working best in this case (the wang method).

Finding additional sites by density modification and heavy-atom difference Fouriers

When AutoSol is used with the default keyword of thoroughness=thorough as in this example, additional heavy-atom sites are found by phasing using the current model, carrying out density modification to improve the phases, and using the improved phases along with isomorphous differences and the phase difference between the heavy atoms and the non-heavy atoms to calculate Fourier maps showing the positions of the heavy atoms. The top peaks in these maps are used as trial heavy-atom sites (if they are not already part of the heavy-atom model. In this example solution 4 from derivative 2 is used for this phasing/density modification/Fourier procedure. Sites are are found for all the derivatives and new solutions are created and scored using the top sites for each derivative. The combinations are then tested as above, and the highest-scoring ones are kept again. The best solution found is #107:

 PHASED SOLUTION: Solution 107 based on MIR phasing starting from 
solutions 4 (dataset #2)  and 48 (dataset #1)
...
AutoSol_run_1_/TEMP0/resolve.scores SKEW 0.3863446
AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.9383891

CC-EST (BAYES-CC) SKEW : 66.8 +/- 15.3
CC-EST (BAYES-CC) CORR_RMS : 63.8 +/- 27.5
ESTIMATED MAP CC x 100:  68.4 +/- 12.4

This is quite a good solution, with high skew (0.39) and correlation of local rms density (0.94). This solution is the best overall and is used for final phasing and density modification. Notice that it only contains two of the five derivatives. The merging procedure identifies which combinations of derivatives give the best phasing, and all the other derivatives are ignored.

Final phasing with SOLVE

Once the best heavy-atom solution or solutions are chosen based on Z-scores, these are used in a final round of phasing with SOLVE (for MIR phasing). In this case several nearly-equally-good solutions are available, and all are used in phasing, density modification and initial model-building, with the R-factor in density modification and the model-map correlation in model-building being used to identify the best solutions. The log file from phasing for solution 107 is in solve_107.prt. The heavy-atom model is refined and phases are calculated with Bayesian correlated MIR phasing. An important part of this phasing method is a statistical method of taking into account the correlation of non-isomorphism among derivatives. The extent of this correlation is listed in the solve_107.prt summary file:

SUMMARY OF CORRELATED ERRORS AMONG DERIVATIVES

 DERIVATIVE:            1
 CENTRIC REFLECTIONS:
 DMIN:            ALL     10.22   6.41   4.99   4.23   3.73   3.37   3.10   2.89
 RMS errors correlated and uncorrelated with others in group:
      Correlated:   51.5   69.4   50.8   42.4   52.7   57.8   55.0   40.9   30.5
    Uncorrelated:   31.8   26.0   31.1   41.7   22.7   29.1   12.4   38.1   43.0

 Correlation of errors with other derivs:
 DERIV 2:           0.58   0.74   0.61   0.44   0.54   0.65   0.63   0.51   0.37

Here the centric reflections in derivative 1 have non-isomorphism errors related to those in derivative 2, with a correlation coefficient overall of 0.58. another way to look at this is that the RMS correlated error is 51.5 and the RMS uncorrelated (random) error is just 31.8. That means that a big part of the errors are correlated, and should be treated as such. The final occupancies and coordinates are listed at the end:
                    SITE  ATOM       OCCUP     X       Y       Z         B
 CURRENT VALUES:      1    Hg       0.3587  0.7222  0.2799  0.4213    6.2541
 CURRENT VALUES:      2    Hg       0.3862  0.1884  0.1574  0.4392   30.8420
 CURRENT VALUES:      3    Hg       0.2914  0.7366  0.2517  0.4165    8.5235
 CURRENT VALUES:      4    Hg       0.2864  0.7111  0.2957  0.4698    8.0632

                    SITE  ATOM       OCCUP     X       Y       Z         B
 CURRENT VALUES:      1    Au       0.5620  0.7124  0.2832  0.4337   33.6309
 CURRENT VALUES:      2    Au       0.3820  0.2087  0.1867  0.4641   19.5002
 CURRENT VALUES:      3    Au       0.2807  0.3614  0.3383  0.4823   13.1097
 CURRENT VALUES:      4    Au       0.0610  0.3693  0.4041  0.3216    1.0000

In this case the occupancies of the top sites are about 1/3 to 2/3, which is fine for MIR (particularly with such heavy atoms as Hg and Au).

Statistical density modification with RESOLVE

After MIR phases are calculated with SOLVE, the AutoSol Wizard uses RESOLVE density modification to improve the quality of the electron density map. The statistical density modification in RESOLVE takes advantage of the flatness of the solvent region and the expected distribution of electron density in the region containing the macromolecule, as well as any NCS that can be found from the heavy-atom substructure. The weighted structure factors and phases (FP, PHIB) from SOLVE are used to calculate the starting map for RESOLVE, and the experimental structure factor amplitudes (FP) and MIR Hendrickson-Lattman coefficients from SOLVE are used in the density modification process. The output from RESOLVE for solution 107 can be found in resolve_107.log. Here are key sections of this output. First, the plot of how many points in the "protein" region of the map have each possible value of electron density. The plot below is normalized so that a density of zero is the mean of the solvent region, and the standard deviation of the density in the map is 1.0. A perfect map has a lot of points with density slightly less than zero on this scale (the points between atoms) and a few points with very high density (the points near atoms), and no points with very negative density. Such a map has a very high skew (think "skewed off to the right"). This map is good, with a positive skew, though it is not perfect.


 Plot of Observed (o) and model (x) electron density distributions for protein
 region, where the model distribution is given by,
  p_model(beta*(rho+offset)) = p_ideal(rho)
 and then convoluted with a gaussian with width of sigma
 where sigma, offset and beta are given below under "Error estimate."

                          0.03..................................................
                              .                   .                            .
                              .                xx .                            .
                              .              xxooxx                            .
                              .             xxo   .xx                          .
                              .            xoo    . xo                         .
                              .           xx      .  xxo                       .
                p(rho)        .          x        .    xoo                     .
                              .         ox        .     xxoo                   .
                              .        ox         .       xxoo                 .
                              .       ox          .         xxxo               .
                              .      oxx          .            xxx             .
                              .     oxx           .              oxxx          .
                              .   oxx             .                ooxxx       .
                              .ooxxx              .                   ooxxxxxx .
                         0.0  xxx.........................................oooxxx

                             -2        -1         0         1         2        3

                                  normalized rho (0 = mean of solvent region)

 -------------------------------------------------------------------------------

After density modification, the curve is more ideal, with a very strong positive skew:

                          0.03..................................................
                              .                   .                            .
                              .                   .                            .
                              .          xxxxxxx  .                            .
                              .         xo  oooxx .                            .
                              .        xo     o oxx                            .
                              .       x          oxxo                          .
                p(rho)        .      xx           . xo                         .
                              .    oxx            .  xxoo                      .
                              .    x              .    xxoooo                  .
                              .   xx              .      xxxxoooo              .
                              .  xx               .          xxxxxxxxoo        .
                              .oxo                .                 oxxxxxxxo  .
                              xx                  .                        xxxxo
                              x                   .                            x
                         0.0  x................................................x

                             -2        -1         0         1         2        3

                                  normalized rho (0 = mean of solvent region)
  
The key statistic from this RESOLVE density modification is the R-factor for comparison of observed structure factor amplitudes (FP) with those calculated from the density modification procedure (FC). In this rh-dehalogenase MIR phasing the R-factor is very low:
Overall R-factor for FC vs FP: 0.254 for       8313 reflections
An acceptable value is anything below 0.35; below 0.30 is good. The R-factors for all the solutions considered at this stage were just about the same: #105 (R=0.254), #107 (R=0.254), #103 (R=.257) and all were used for initial model-building as well.

Generation of FreeR flags

The AutoSol Wizard will create a set of free R flags indicating which reflections are not to be used in refinement. By default 5% of reflections, (up to a maximum of 2000) are reserved for this test set. If you want to supply a reflection file hires.mtz that has higher resolution than the data used to solve the structure, or has a test set already marked, then you can do this with the keyword input_refinement_file=hires.mtz. The files to be used for model-building and refinement are listed in the AutoSol log file:

 

Model-building with RESOLVE

The AutoSol Wizard by default uses a very quick method to build just the secondary structure of your macromolecule. This is controlled by the keyword helices_strands_only=True. The Wizard will guess from your sequence file whether the structure is protein or RNA or DNA (but you can tell it if you want with (chain_type=PROTEIN). If the quick model-building does not build a satisfactory model (if the correlation of map and model is less than acceptable_secondary_structure_cc=0.35), then model-building is tried again with the standard build procedure, essentially the same as one cycle of model-building with the AutoBuild Wizard (see the web page Automated Model Building and Rebuilding with AutoBuild, except that if you specify thoroughness=quick as we have in this example, the model-building is done less comprehensively to speed things up. In this case the secondary-structure-only model-building using solution #107 produces an initial model with 162 residues built and side chains assigned to 52, and which has a model-map correlation of 0.42:

Model with helices and strands is in  Build_2.pdb
Log for helices and strands is in  Build_2.log
Model aligned with  AutoSol_run_1_/TEMP0/coords.pdb  is in  resolve_compare.pdb
Final file:  AutoSol_run_1_/TEMP0/Build_2.pdb
Log file:  Build_2.log  copied to  Build_2.log
Model 2: Residues built=162  placed=52  Chains=18  Model-map CC=0.42
This is new best model with cc =  0.42
Getting R for model:  Build_2.pdb
Model: AutoSol_run_1_/TEMP0/refine_2.pdb  R/Rfree=0.53/0.54
This is quite an adequate secondary-structure-only model. It is just a preliminary model, but it is good enough to tell that the structure is solved. For full model-building you will want to go on and use the AutoBuild Wizard (see the web page Automated Model Building and Rebuilding with AutoBuild )

The AutoSol_summary.dat summary file

A quick summary of the results of your AutoSol run is in the AutoSol_summary.dat file in your output directory. This file lists the key files that were produced in your run of AutoSol (all these are in the output directory) and some of the key statistics for the run, including the scores for the heavy-atom substructure and the model-building and refinement statistics. These statistics are listed for all the solutions obtained, with the highest-scoring solutions first. Here is part of the summary for this rh-dehalogenase MIR dataset:

 

-----------CURRENT SOLUTIONS FOR RUN 1 : -------------------

 *** FILES ARE IN THE DIRECTORY: AutoSol_run_1_ ****

Solution # 107  BAYES-CC: 68.4 +/- 12.4 Dataset #0   FOM: 0.64 ----------------

Solution 107 based on MIR phasing starting from solutions 4 (dataset #2)  and 48 (dataset #1)
This solution is a composite of solutions:  4 48 (Already used for 
Phasing at resol of 2.8)      Refined Sites: 4
NCS information  in: AutoSol_107.ncs_spec
Experimental phases in: solve_107.mtz
Experimental phases plus FreeR_flags for refinement in: 
exptl_fobs_phases_freeR_flags_107.mtz
Density-modified phases in: resolve_107.mtz
HA sites (PDB format) in: ha_107.pdb_formatted.pdb
Sequence file in: sequence.dat
Model in: refine_2.pdb
  Residues built: 162
  Side-chains built: 52
  Chains: 18
  Overall model-map correlation: 0.42
  R/R-free: 0.53/0.54
Phasing logfile in: solve_107.prt
Density modification logfile in: resolve_107.log (R=0.25)
Build logfile in: Build_2.log

 Score type:     SKEW    CORR_RMS
Raw scores:     0.39      0.94
BAYES-CC:      66.76     63.84

Refined heavy atom sites (fractional):
deriv 1
xyz       0.722      0.280      0.421
xyz       0.188      0.157      0.439
xyz       0.737      0.252      0.416
xyz       0.711      0.296      0.470
deriv 2
xyz       0.712      0.283      0.434
xyz       0.209      0.187      0.464
xyz       0.361      0.338      0.482
xyz       0.369      0.404      0.322

How do I know if I have a good solution?

Here are some of the things to look for to tell if you have obtained a correct solution:

  • How much of the model was built? More than 50% is good, particularly if you are using the default of helices_strands_only=True. If less than 25% of the model is built, then it may be entirely incorrect. Have a look at the model. If you see clear sets of parallel or antiparallel strands, or if you see helices and strands with the expected relationships, your model is going to be correct. If you see a lot of short fragments everywhere, your model and solution is going to be incorrect. How many side-chains were fitted to density? More than 25% is ok, more than 50% is very good.
  • What is the R-factor of the model? This only applies if you are building a full model (not for helices_strands_only=True). For a solution at moderate to high resolution (2.5 A or better) the R-factor should be in the low 30's to be very good. For lower-resolution data, an R-factor in the low 40's is probably largely correct but the model is not very good.
  • What are the individual CC-BAYES estimates of map correlation for your top solution? For a good solution they are all around 50 or more, with 2SD uncertainties that are about 10-20.
  • What is the overall "ESTIMATED MAP CC x 100" of your top solution. This should also be 50 or more for a good solution. This is an estimate of the map correlation before density modification, so if you have a lot of solvent or several NCS-related copies in the asymmetric unit, then lower values may still give you a good map.
  • What is the difference in "ESTIMATED MAP CC x 100" between the top solution and its inverse? If this is large (more than the 2SD values for each) that is a good sign.

What to do next

Once you have run AutoSol and have obtained a good solution and model, the next thing to do is to run the AutoBuild Wizard. If you run it in the same directory where you ran AutoSol, the AutoBuild Wizard will pick up where the AutoSol Wizard left off and carry out iterative model-building, density modification and refinement to improve your model and map. See the web page Automated Model Building and Rebuilding with AutoBuild for details on how to run AutoBuild. If you do not obtain a good solution, then it's not time to give up yet. There are a number of standard things to try that may improve the structure determination. Here are a few that you should always try:

  • Have a careful look at all the output files. Work your way through the main log file (e.g., AutoSol_run_1_1.log) and all the other principal log files in order beginning with scaling (dataset_1_scale.log), then looking at heavy-atom searching (e.g., auki_rd_1_PHX.sca_iso_1.sca_hyss.log), phasing (e.g., solve_107.log or solve_xx.log depending on which solution xx was the top solution) and density modification (e.g., resolve_xx.log). Is there anything strange or unusual in any of them that may give you a clue as to what to try next? For example did the phasing work well (high figure of merit) yet the density modification failed? (Perhaps the hand is incorrect). Was the solvent content estimated correctly? (You can specify it yourself if you want). What does the xtriage output say? Is there twinning or strong translational symmetry? Are there problems with reflections near ice rings? Are there many outlier reflections?
  • Try a different resolution cutoff. For example 0.5 A lower resolution than you tried before. Often the highest-resolution shells have little useful information for structure solution (though the data may be useful in refinement and density modification).
  • Try a different rejection criterion for outliers. The default is ratio_out=3.0 (toss reflections with delta F more than 3 times the rms delta F of all reflections in the shell). Try instead ratio_out=5.0 to keep almost everything.
  • If the heavy-atom substructure search did not yield plausible solutions, try searching with HYSS using the command-line interface, and vary the resolution and number of sites you look for. Can you find a solution that has a higher CC than the one found in AutoSol? If so, you can read your solution in to AutoSol with sites_file=my_sites.pdb.
  • Was an anisotropy correction applied in AutoSol? If there is some anisotropy but no correction was applied, you can force AutoSol to apply the correction with correct_aniso=True.
  • Try including more phased solutions from each derivative with the keyword min_phased_each_deriv=8 instead of the default 1.
  • Try related space groups. If you are not positive that your space group is P212121, then try other possibilities with different or no screw axes.

Additional information

For details about the AutoSol Wizard, see Automated structure solution with AutoSol. For help on running Wizards, see Running a Wizard from a GUI, the command-line, or a script.