| Python-based Hierarchical ENvironment for Integrated Xtallography |
| Documentation Home |
Tutorial 3: Solving a structure with MIR data
IntroductionThis tutorial will use some very good MIR data (Native and 5 derivatives from a rh-dehalogenase protein MIR dataset analyzed at 2.8 A) as an example of how to solve a MIR dataset with AutoSol. It is designed to be read all the way through, giving pointers for you along the way. Once you have read it all and run the example data and looked at the output files, you will be in a good position to run your own data through AutoSol. Setting up to run PHENIXIf PHENIX is already installed and your environment is all set, then if you type: echo $PHENIXthen you should get back something like this: /xtal//phenix-1.3If instead you get: PHENIX: undefined variablethen you need to set up your PHENIX environment. See the PHENIX installation page for details of how to do this. If you are using the C-shell environment (csh) then all you will need to do is add one line to your .cshrc (or equivalent) file that looks like this: source /xtal/phenix-1.3/phenix_env(except that the path in this statement will be where your PHENIX is installed). Then the next time you log in $PHENIX will be defined. Running the demo rh-dehalogenase data with AutoSolTo run AutoSol on the demo rh-dehalogenase data, make yourself a tutorials directory and cd into that directory: mkdir tutorials cd tutorialsNow type the phenix command: phenix.run_example --helpto list the available examples. Choosing rh-dehalogenase-mir for this tutorial, you can now use the phenix command: phenix.run_example rh-dehalogenase-mirto solve the rh-dehalogenase structure with AutoSol. This command will copy the directory $PHENIX/examples/rh-dehalogenase-mir to your current directory (tutorials) and call it tutorials/rh-dehalogenase-mir/ . Then it will run AutoSol using the command file run.csh that is present in this tutorials/rh-dehalogenase-mir/ directory. Running an MIR dataset is a little different than running a MAD or SAD or SIR dataset because you cannot use the standard command-line control for MIR. Instead you have to run a script. It is not hard, just different. (You can do all of those other things from a script too, it's just even easier to do them from the command-line). This command file run.csh is simple. It says: #!/bin/csh #!/bin/csh echo "Running AutoSol on rhodococcus dehalogenase data..." echo "NOTE: command-line not available for MIR..using script instead" phenix.runWizard AutoSol Facts.listThe first line (#!/bin/csh) tells the system to interpret the remainder of the text in the file using the C-shell (csh). The command phenix.autosol runs the command-line version of AutoSol (see Automated Structure Solution using AutoSol for all the details about AutoSol including a full list of keywords). The second line says to run the AutoSol Wizard, and use the contents of the file Facts.list as parameters. Now let’s look at the Facts.list file. Here is the first relevant part of the file:
sequence_file sequence.dat
thoroughness thorough
cell 93.796 79.849 43.108 90.000 90.000 90.00 # cell params
resolution 2.8 # Resolution
expt_type sir # MIR dataset is set of SIR datasets
input_file_list rt_rd_1.sca auki_rd_1.sca # list of input .sca files
# Native deriv 1
nat_der_list Native Au # identify files in input_file_list
# as Native or the heavy-atom name
# such as se.
inano_list noinano inano # inano/noinano/anoonly: identify
# if ano diffs to be used for derivs
n_ha_list 0 5 # number of heavy-atoms for each
# file for mir/sir (0 for native)
This part of the script tells AutoSol about the resolution, the data files
for the first native-derivative combination, and the heavy atoms for these
files (Native and Au), and whether anomalous differences are to be included
for each (noinano for Native means do not include them; inano for the
Au derivative means do include them for this derivative), and the number of
heavy-atoms in each file (0 for the Native, 5 for the derivative).
Note that this first native-derivative combination in this MIR dataset
is being treated as an SIRAS dataset. This is the way the AutoSol Wizard
works for MIR. The individual derivatives are all solved separately (except using
difference Fouriers to phase one derivative using a solution from another). Then
when all are finished all the SIR or SIRAS datasets are phased all together
with SOLVE Bayesian correlated phasing. This approach works well because
a substructure determination is done separately for each
derivative, and if any one of them works well, then all the derivatives
can be solved.
This part of the script also tells AutoSol to use defaults for a
thorough analysis. Usually for MIR this is the best idea, while for SAD
and MAD experiments a quick analysis is fine.
The MIR script then continues with data for the second, third... derivatives.
These parts of the script all look like this:
############## NEW DATASET ################
run_list start # run "start" method.
# read in datafiles for this dataset
run_list read_another_dataset # starting a new dataset here
input_file_list rt_rd_1.sca hgki_rd_1.sca # list of input .sca files
# Native deriv 1
nat_der_list Native Hg # identify files in input_file_list
# as Native or the heavy-atom name
# such as se.
inano_list noinano inano # inano/noinano/anoonly: identify
# if ano diffs to be used for derivs
n_ha_list 0 5 # number of heavy-atoms for each
# file for mir/sir (0 for native)
Here the run_list start line is a command to AutoSol. It means
"run the following list of AutoSol methods: start " . So the
AutoSol Wizard runs the "start" method and stops. This basically reads
in the datafiles from the previous dataset. The next line says to read another
dataset. Now we are ready to provide the data for the second native-derivative
combination, again as an SIR dataset. We provide the same native as before
(although we don't have to) and a new derivative, this time an Hg derivative,
again with anomalous data.
This procedure is repeated for each derivative. The AutoSol Wizard will
then scale all the datasets and find heavy-atom solutions for some of them
by direct methods, then use difference Fouriers to find the solutions for the
others.
Although the phenix.run_example rh-dehalogenase-mir command has just run
AutoSol from a script (run.csh), you can run AutoSol yourself from
this script with the same phenix.runWizard AutoSol Facts.list command.
You can also run AutoSol from a GUI. All these possibilities are described in
Running a Wizard from a GUI, the command-line, or a script.
Where are my files?Once you have started AutoSol or another Wizard, an output directory will be created in your current (working) directory. The first time you run AutoSol in this directory, this output directory will be called AutoSol_run_1_ (or AutoSol_run_1_/, where the slash at the end just indicates that this is a directory). All of the output from run 1 of AutoSol will be in this directory. If you run AutoSol again, a new subdirectory called AutoSol_run_2_ will be created. Inside the directory AutoSol_run_1_ there will be one or more temporary directories such as TEMP0 created while the Wizard is running. The files in this temporary directory may be useful sometimes in figuring out what the Wizard is doing (or not doing!). By default these directories are emptied when the Wizard finishes (but you can keep their contents with the command clean_up=False if you want.) What parameters did I use?When the AutoSol wizard runs from a script it does not write out a parameters file. The parameters from your Facts.list are echoed in the AutoSol log file, but otherwise the Facts.list is your record of what the parameters used were. Reading the log files for your AutoSol run fileWhile the AutoSol wizard is running, there are several places you can look to see what is going on. The most important one is the overall log file for the AutoSol run. This log file is located in: AutoSol_run_1_/AutoSol_run_1_1.logfor run 1 of AutoSol. (The second 1 in this log file name will be incremented if you stop this run in the middle and restart it with a command like phenix.autosol run=1). The AutoSol_run_1_1.log file is a running summary of what the AutoSol Wizard is doing. Here are a few of the key sections of the log files produced for the rh-dehalogenase MIR dataset. Summary of the command-line argumentsNear the top of the log file you will find: READING FACTS FROM Facts.list NEW FACT from Facts.list : cell [93.796000000000006, 79.849000000000004, 43.107999999999997, 90.0, 90.0, 90.0] NEW FACT from Facts.list :resolution 2.8 NEW FACT from Facts.list :expt_type sir NEW FACT from Facts.list :input_file_list ['rt_rd_1.sca', 'auki_rd_1.sca'] NEW FACT from Facts.list :nat_der_list ['Native', 'Au'] NEW FACT from Facts.list :inano_list ['noinano', 'inano'] NEW FACT from Facts.list :n_ha_list [0, 5] NEW FACT from Facts.list :run_list ['start']This is just a repeat of the parameters in your Facts.list script. The last fact is the "run_list start" command, which tells the AutoSol Wizard to read in the data (recall that we put in this command after each native-derivative combination so the Wizard could read it in as an SIR dataset). Reading the datafiles.The AutoSol Wizard will read in your datafiles and check their contents, printing out a summary for each one. This is done one dataset at a time (each native-derivative pair) until all have been read in. Here is the summary for the first derivative: HKLIN ENTRY: rt_rd_1.sca FILE TYPE scalepack_no_merge_original_index GUESS FILE TYPE MERGE TYPE sca unmerged LABELS['I', 'SIGI'] CONTENTS: ['rt_rd_1.sca', 'sca', 'unmerged', 'P 21 21 2', None, None, ['I', 'SIGI']] Not checking SG as cell or sg not yet defined SG from rt_rd_1.sca is: P 21 21 2 HKLIN ENTRY: auki_rd_1.sca FILE TYPE scalepack_no_merge_original_index GUESS FILE TYPE MERGE TYPE sca unmerged LABELS['I', 'SIGI'] CONTENTS: ['auki_rd_1.sca', 'sca', 'unmerged', 'P 21 21 21', None, None, ['I', 'SIGI']] Converting the files ['rt_rd_1.sca', 'auki_rd_1.sca'] to sca format before proceeding ImportRawData.The input data files rt_rd_1.sca and auki_rd_1.sca are in unmerged Scalepack format. The AutoSol wizard converts everything to premerged Scalepack format before proceeding. Here is where the AutoSol Wizard identifies the format and then calls the ImportRawData Wizard: Running import directly... WIZARD: ImportRawDatafollowed eventually by... List of output files : File 1: rt_rd_1_PHX.sca File 2: auki_rd_1_PHX.scaThese output files are in premerged Scalepack format. After completing the ImportRawData step, the AutoSol Wizard goes back to the beginning, but uses the newly-converted files rt_rd_1_PHX.sca and auki_rd_1_PHX.sca: HKLIN ENTRY: AutoSol_run_1_/rt_rd_1_PHX.sca FILE TYPE scalepack_merge GUESS FILE TYPE MERGE TYPE sca premerged LABELS['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU'] Unit cell: (93.796, 79.849, 43.108, 90, 90, 90) Space group: P 21 21 2 (No. 18) CONTENTS: ['AutoSol_run_1_/rt_rd_1_PHX.sca', 'sca', 'premerged', 'P 21 21 2', [93.796000000000006, 79.849000000000004, 43.107999999999997, 90.0, 90.0, 90.0], 2.4307589843043771, ['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']] HKLIN ENTRY: AutoSol_run_1_/auki_rd_1_PHX.sca FILE TYPE scalepack_merge GUESS FILE TYPE MERGE TYPE sca premerged LABELS['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU'] Unit cell: (93.796, 79.849, 43.108, 90, 90, 90) Space group: P 21 21 2 (No. 18) CONTENTS: ['AutoSol_run_1_/auki_rd_1_PHX.sca', 'sca', 'premerged', 'P 21 21 2', [93.796000000000006, 79.849000000000004, 43.107999999999997, 90.0, 90.0, 90.0], 2.430806639777233, ['IPLUS', 'SIGIPLUS', 'IMINU', 'SIGIMINU']] Total of 2 input data files ['AutoSol_run_1_/rt_rd_1_PHX.sca', 'AutoSol_run_1_/auki_rd_1_PHX.sca'] Guessing cell contentsThe AutoSol Wizard uses the sequence information in your sequence file (sequence.dat) and the cell parameters and space group to guess the number of NCS copies and the solvent fraction. AutoSol_guess_setup_for_scaling AutoSol Run 1 Fri Mar 7 01:24:08 2008 Solvent fraction and resolution and ha types/scatt fact Guessing setup for scaling dataset 1 SG P 21 21 2 cell [93.796000000000006, 79.849000000000004, 43.107999999999997, 90.0, 90.0, 90.0] Number of residues in unique chains in seq file: 294 Unit cell: (93.796, 79.849, 43.108, 90, 90, 90) Space group: P 21 21 2 (No. 18) CELL VOLUME :322858.090387 N_EQUIV:4 GUESS OF NCS COPIES: 1 SOLVENT FRACTION ESTIMATE: 0.51 Total residues:294 Total Met:6 resolution estimate: 2.8 Running phenix.xtriageThe AutoSol Wizard automatically runs phenix.xtriage on each of your input datafiles to analyze them for twinning, outliers, translational symmetry, and other special conditions that you should be aware of. You can read more about xtriage in Data quality assessment with phenix.xtriage. Part of the summary output from xtriage for this dataset looks like this: No (pseudo)merohedral twin laws were found. Patterson analyses - Largest peak height : 6.680 (corresponding p value : 0.56306) The largest off-origin peak in the Patterson function is 6.68% of the height of the origin peak. No significant pseudotranslation is detected. The results of the L-test indicate that the intensity statistics behave as expected. No twinning is suspected.In this space group (P21 21 2) with the cell dimensions in this structure, there are no ways to create a twinned crystal, so you do not have to worry about twinning. There is also no large off-origin peak in the native Patterson, so there does not appear to be any translational pseudo-symmetry. Testing for anisotropy in the dataAfter all the SIR datasets are read in, the AutoSol Wizard tests for anisotropy by determining the range of effective anisotropic B values along the principal lattice directions. If this range is large and the ratio of the largest to the smallest value is also large then the data are by default corrected to make the anisotropy small (see Analyzing and scaling the data in the AutoSol web page for more discussion of the anisotropy correction). In the rh-dehalogenase case, the range of anisotropic B values is small and no correction is made: Range of aniso B: 13.06 19.68 Not using aniso-corrected data files as the range of aniso b is only 6.62 and 'correct_aniso' is not setNote that if any one of the datafiles in a MIR dataset has a high anisotropy, then by default all of them will be corrected for anisotropy. Scaling MIR dataThe AutoSol Wizard uses SOLVE localscaling to scale MIR data. The procedure is basically to scale all the data to the native. During this process outliers that deviate from the reference values by more that ratio_out (default=3) standard deviations (using all data in the appropriate resolution shell to estimate the SD) are rejected. Running HYSS to find the heavy-atom substructureThe HYSS (hybrid substructure search) procedure for heavy-atom searching uses a combination of a Patterson search for 2-site solutions with direct methods recycling. The search ends when the same solution is found beginning with several different starting points. The HYSS log files are named after the datafile that they are based on and the type of differences (ano, iso) that are being used. In this rh-dehalogenase MIR dataset, the HYSS logfile for the HgKI derivative is hgki_rd_1_PHX.sca_iso_2.sca_hyss.log. The key part of this HYSS log file is: Entering search loop: p = peaklist index in Patterson map f = peaklist index in two-site translation function cc = correlation coefficient after extrapolation scan r = number of dual-space recycling cycles cc = final correlation coefficient =0.190 r=015 cc=0.250 [ best cc: 0.250 ] p=000 f=001 cc=0.191 r=015 cc=0.242 [ best cc: 0.250 0.242 ] Number of matching sites of top 2 structures: 3 p=000 f=002 cc=0.174 r=015 cc=0.200 [ best cc: 0.250 0.242 ] p=001 f=000 cc=0.167 r=015 cc=0.230 [ best cc: 0.250 0.242 0.230 ] Number of matching sites of top 2 structures: 3 Number of matching sites of top 3 structures: 2 ... p=011 f=002 cc=0.165 r=015 cc=0.229 [ best cc: 0.293 0.279 0.277 0.276 ] p=012 f=000 cc=0.184 r=015 cc=0.250 [ best cc: 0.293 0.279 0.277 0.276 ] p=012 f=001 cc=0.148 r=015 cc=0.292 [ best cc: 0.293 0.292 0.279 0.277 ] Number of matching sites of top 2 structures: 7 Number of matching sites of top 3 structures: 7 Number of matching sites of top 4 structures: 6Here a correlation coefficient of 0.5 is very good (0.1 is hopeless, 0.2 is possible, 0.3 is good) and 8 sites were found that matched in the first two tries. The program continues until 4 structures all have 6 matching sites, then ends and prints out the final correlations, after taking the top 5 sites. Finding the hand and scoring heavy-atom solutionsNormally either hand of the heavy-atom substructure is a possible solution, and both must be tested by calculating phases and examining the electron density map and by carrying out density modification, as they will give the same statistics for all heavy-atom analysis and phasing steps. Note that in chiral space groups (those that have a handedness such as P61, both hands of the space group must be tested. The AutoSol Wizard will do this for you, inverting the hand of the heavy-atom substructure and the space group at the same time. For example, in space group P61 the hand of the substructure is inverted and then it is placed in space group P65. The AutoSol Wizard scores heavy-atom solutions based on two criteria by default. The first criterion is the skew of the electron density in the map (SKEW). Good values for the skew are anything greater than 0.1. In a MIR structure determination, the heavy-atom solution with the correct hand may have a more positive skew than the one with the inverse hand. The second criterion is the correlation of local RMS density (CORR_RMS). This is a measure of how contiguous the solvent and non-solvent regions are in the map. (If the local rms is low at one point and also low at neighboring points, then the solvent region must be relatively contiguous, and not split up into small regions.) For MIR datasets, SOLVE is used for calculating phases. For a MIR dataset, a figure of merit of 0.5 is acceptable, 0.6 is fine and anything above 0.7 is very good. The scores are listed in the AutoSol log file. Here is the scoring for solution 4 (the best initial map): AutoSol_run_1_/TEMP0/resolve.scores SKEW 0.2797302 AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.9306123 CC-EST (BAYES-CC) SKEW : 57.8 +/- 17.0 CC-EST (BAYES-CC) CORR_RMS : 63.3 +/- 28.2 ESTIMATED MAP CC x 100: 60.8 +/- 13.3This is a good solution, with a high (and positive) skew (0.28), and a high correlation of local rms density (0.93) The ESTIMATED MAP CC x 100 is an estimate of the quality of the experimental electron density map (not the density-modified one). A set of real structures was used to calibrate the range of values of each score that were obtained for phases with varying quality. The resulting probability distributions are used above to estimate the correlation between the experimental map and an ideal map for this structure. Then all the estimates are combined to yield an overall Bayesian estimate of the map quality. These are reported as CC x 100 +/- 2SD. These estimated map CC values are usually fairly close, so if the estimate is 60.8 +/- 13.3 then you can be confident that your structure is solved and that the density-modified map will be quite good. In this case the datasets used to find heavy-atom substructures were the isomorphous differences for each derivative. For each dataset one solution was found, and that solution and its inverse were scored. The scores were (skipping extra text below): SCORING SOLUTION 1: Solution 1 using HYSS on /net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/ AutoSol_run_1_/auki_rd_1_PHX.sca_iso_1.sca. Dataset #1, with 5 sites ESTIMATED MAP CC x 100: 29.6 +/- 34.8 SCORING SOLUTION 2: Solution 2 using HYSS on /net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/ AutoSol_run_1_/auki_rd_1_PHX.sca_iso_1.sca and taking inverse. Dataset #1, with 5 sites ESTIMATED MAP CC x 100: 41.7 +/- 26.5 SCORING SOLUTION 3: Solution 3 using HYSS on /net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/ AutoSol_run_1_/hgki_rd_1_PHX.sca_iso_2.sca. Dataset #2, with 5 sites ESTIMATED MAP CC x 100: 40.4 +/- 30.2 SCORING SOLUTION 4: Solution 4 using HYSS on /net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/ AutoSol_run_1_/hgki_rd_1_PHX.sca_iso_2.sca and taking inverse. Dataset #2, with 5 sites ESTIMATED MAP CC x 100: 60.8 +/- 13.3In this case the best score was solution 4 (as shown above), based on the HGKI derivative and taking the inverse of the heavy-atom sites, with a ESTIMATED MAP CC x 100: 58.5 +/- 7.9. The score from the opposite hand was just 36.8+/- 27.6 and so the hand was clear. Finding origin shifts between heavy-atom solutions for different derivatives and combining phasesDepending on the space group, there may be a few or infinitely many totally equivalent heavy-atom substructures for a particular native-derivative pair. These are related to each other by translations that can be thought of as offsets of the origins for the two substructures. The AutoSol Wizard identifies the allowed offsets for the space group. Then it aligns the solutions from different derivatives by finding the origin offset that maximizes the correlation of electron density in the native Fouriers for the two. Then it combines the phases from the two using addition of Hendrickson-Lattman coefficients. These combined phases are then used to score the phasing obtained by combining the two derivatives. The best combinations are iteratively combined until all available derivatives are considered and combined in an optimal fashion. Once an optimal set of derivatives and sites is found, SOLVE Bayesian correlated phasing is used to calculate a final set of native phases from the native and all the derivatives at once. Here is the best pair of derivatives from this first cycle: Getting origin shift for 2 mapped on to 4 Phases from solution 4:solve_4.mtz Phases from solution 2:solve_2.mtz Merged ha files in ha_4_2.pdb Merged files in merged_4_2.mtz FOM solution 4: 0.56 FOM solution 2: 0.44 Correlation of maps: 0.249 Ideal map correlation: 0.2464 RESULT: FOM solution 4: 0.56 FOM solution 2: 0.44 Correlation of maps: 0.249 Ideal map correlation: 0.2464 Origin offset of solution 2: [-0.5, 0.0, 0.0]Here solutions 2 and 4 have a map correlation of 0.25, just about the same as expected based on the FOM of the two solutions (0.56 and .44) and assuming random errors. The two solutions differ by an origin shift of 0.5 along x. The two solutions are then phased as a group to use as the basis for density modification: Merging a set of solutions and phasing the group with SOLVE ... PHASED SOLUTION: Solution 9 based on MIR phasing starting from solutions 4 (dataset #2) and 2 (dataset #1)However in this case after phasing with the two derivatives together, the score is not improved over the HGKI derivative by itself: AutoSol_run_1_/TEMP0/resolve.scores SKEW 0.1357029 AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.9086851 CC-EST (BAYES-CC) SKEW : 37.3 +/- 31.3 CC-EST (BAYES-CC) CORR_RMS : 60.8 +/- 31.5 ESTIMATED MAP CC x 100: 49.1 +/- 20.2Though worse than the HGKI solution by itself, this is reasonably good solution, with a moderate a positive skew (0.14), and a good correlation of local rms density (0.91). As the original HGKI solution was the best, it is used for density modification and finding additional sites: SOLUTION USED TO START DEN MOD: Solution 4 using HYSS on /net/moonbird/scratch1/terwill/run_072908a/rh-dehalogenase-mir/ AutoSol_run_1_/hgki_rd_1_PHX.sca_iso_2.sca and taking inverse. Dataset #2 HKLIN: solve_4.mtz Testing density modification with mask_type = wang RFACTOR: 0.2475 Best mask type so far is wang Testing density modification with mask_type = histograms RFACTOR: 0.2553Note that two types of masks (wang and histograms) are tested in the density modification procedure. This is because sometimes one method for identification of the solvent region is better than the other. The wang method chooses the solvent region as those points surrounded by regions of low variation. The histograms method chooses points instead based on the similarity of the histograms of density nearby to those of idealized solvent and protein regions. The R-factor for density modification is used to choose which is working best in this case (the wang method). Finding additional sites by density modification and heavy-atom difference FouriersWhen AutoSol is used with the default keyword of thoroughness=thorough as in this example, additional heavy-atom sites are found by phasing using the current model, carrying out density modification to improve the phases, and using the improved phases along with isomorphous differences and the phase difference between the heavy atoms and the non-heavy atoms to calculate Fourier maps showing the positions of the heavy atoms. The top peaks in these maps are used as trial heavy-atom sites (if they are not already part of the heavy-atom model. In this example solution 4 from derivative 2 is used for this phasing/density modification/Fourier procedure. Sites are are found for all the derivatives and new solutions are created and scored using the top sites for each derivative. The combinations are then tested as above, and the highest-scoring ones are kept again. The best solution found is #107: PHASED SOLUTION: Solution 107 based on MIR phasing starting from solutions 4 (dataset #2) and 48 (dataset #1) ... AutoSol_run_1_/TEMP0/resolve.scores SKEW 0.3863446 AutoSol_run_1_/TEMP0/resolve.scores CORR_RMS 0.9383891 CC-EST (BAYES-CC) SKEW : 66.8 +/- 15.3 CC-EST (BAYES-CC) CORR_RMS : 63.8 +/- 27.5 ESTIMATED MAP CC x 100: 68.4 +/- 12.4This is quite a good solution, with high skew (0.39) and correlation of local rms density (0.94). This solution is the best overall and is used for final phasing and density modification. Notice that it only contains two of the five derivatives. The merging procedure identifies which combinations of derivatives give the best phasing, and all the other derivatives are ignored. Final phasing with SOLVEOnce the best heavy-atom solution or solutions are chosen based on Z-scores, these are used in a final round of phasing with SOLVE (for MIR phasing). In this case several nearly-equally-good solutions are available, and all are used in phasing, density modification and initial model-building, with the R-factor in density modification and the model-map correlation in model-building being used to identify the best solutions. The log file from phasing for solution 107 is in solve_107.prt. The heavy-atom model is refined and phases are calculated with Bayesian correlated MIR phasing. An important part of this phasing method is a statistical method of taking into account the correlation of non-isomorphism among derivatives. The extent of this correlation is listed in the solve_107.prt summary file:
SUMMARY OF CORRELATED ERRORS AMONG DERIVATIVES
DERIVATIVE: 1
CENTRIC REFLECTIONS:
DMIN: ALL 10.22 6.41 4.99 4.23 3.73 3.37 3.10 2.89
RMS errors correlated and uncorrelated with others in group:
Correlated: 51.5 69.4 50.8 42.4 52.7 57.8 55.0 40.9 30.5
Uncorrelated: 31.8 26.0 31.1 41.7 22.7 29.1 12.4 38.1 43.0
Correlation of errors with other derivs:
DERIV 2: 0.58 0.74 0.61 0.44 0.54 0.65 0.63 0.51 0.37
Here the centric reflections in derivative 1 have non-isomorphism
errors related to those in derivative 2, with a correlation coefficient
overall of 0.58. another way to look at this is that the RMS correlated
error is 51.5 and the RMS uncorrelated (random) error is just 31.8. That
means that a big part of the errors are correlated, and should be treated
as such.
The final occupancies and coordinates are listed at the end:
SITE ATOM OCCUP X Y Z B
CURRENT VALUES: 1 Hg 0.3587 0.7222 0.2799 0.4213 6.2541
CURRENT VALUES: 2 Hg 0.3862 0.1884 0.1574 0.4392 30.8420
CURRENT VALUES: 3 Hg 0.2914 0.7366 0.2517 0.4165 8.5235
CURRENT VALUES: 4 Hg 0.2864 0.7111 0.2957 0.4698 8.0632
SITE ATOM OCCUP X Y Z B
CURRENT VALUES: 1 Au 0.5620 0.7124 0.2832 0.4337 33.6309
CURRENT VALUES: 2 Au 0.3820 0.2087 0.1867 0.4641 19.5002
CURRENT VALUES: 3 Au 0.2807 0.3614 0.3383 0.4823 13.1097
CURRENT VALUES: 4 Au 0.0610 0.3693 0.4041 0.3216 1.0000
In this case the occupancies of the top sites are about 1/3 to 2/3, which
is fine for MIR (particularly with such heavy atoms as Hg and Au).
Statistical density modification with RESOLVEAfter MIR phases are calculated with SOLVE, the AutoSol Wizard uses RESOLVE density modification to improve the quality of the electron density map. The statistical density modification in RESOLVE takes advantage of the flatness of the solvent region and the expected distribution of electron density in the region containing the macromolecule, as well as any NCS that can be found from the heavy-atom substructure. The weighted structure factors and phases (FP, PHIB) from SOLVE are used to calculate the starting map for RESOLVE, and the experimental structure factor amplitudes (FP) and MIR Hendrickson-Lattman coefficients from SOLVE are used in the density modification process. The output from RESOLVE for solution 107 can be found in resolve_107.log. Here are key sections of this output. First, the plot of how many points in the "protein" region of the map have each possible value of electron density. The plot below is normalized so that a density of zero is the mean of the solvent region, and the standard deviation of the density in the map is 1.0. A perfect map has a lot of points with density slightly less than zero on this scale (the points between atoms) and a few points with very high density (the points near atoms), and no points with very negative density. Such a map has a very high skew (think "skewed off to the right"). This map is good, with a positive skew, though it is not perfect.
Plot of Observed (o) and model (x) electron density distributions for protein
region, where the model distribution is given by,
p_model(beta*(rho+offset)) = p_ideal(rho)
and then convoluted with a gaussian with width of sigma
where sigma, offset and beta are given below under "Error estimate."
0.03..................................................
. . .
. xx . .
. xxooxx .
. xxo .xx .
. xoo . xo .
. xx . xxo .
p(rho) . x . xoo .
. ox . xxoo .
. ox . xxoo .
. ox . xxxo .
. oxx . xxx .
. oxx . oxxx .
. oxx . ooxxx .
.ooxxx . ooxxxxxx .
0.0 xxx.........................................oooxxx
-2 -1 0 1 2 3
normalized rho (0 = mean of solvent region)
-------------------------------------------------------------------------------
After density modification, the curve is more ideal, with a very strong
positive skew:
0.03..................................................
. . .
. . .
. xxxxxxx . .
. xo oooxx . .
. xo o oxx .
. x oxxo .
p(rho) . xx . xo .
. oxx . xxoo .
. x . xxoooo .
. xx . xxxxoooo .
. xx . xxxxxxxxoo .
.oxo . oxxxxxxxo .
xx . xxxxo
x . x
0.0 x................................................x
-2 -1 0 1 2 3
normalized rho (0 = mean of solvent region)
The key statistic from this RESOLVE density modification is the R-factor
for comparison of observed structure factor amplitudes (FP) with those
calculated from the density modification procedure (FC).
In this rh-dehalogenase MIR phasing the R-factor is very low:
Overall R-factor for FC vs FP: 0.254 for 8313 reflectionsAn acceptable value is anything below 0.35; below 0.30 is good. The R-factors for all the solutions considered at this stage were just about the same: #105 (R=0.254), #107 (R=0.254), #103 (R=.257) and all were used for initial model-building as well. Generation of FreeR flagsThe AutoSol Wizard will create a set of free R flags indicating which reflections are not to be used in refinement. By default 5% of reflections, (up to a maximum of 2000) are reserved for this test set. If you want to supply a reflection file hires.mtz that has higher resolution than the data used to solve the structure, or has a test set already marked, then you can do this with the keyword input_refinement_file=hires.mtz. The files to be used for model-building and refinement are listed in the AutoSol log file: Model-building with RESOLVEThe AutoSol Wizard by default uses a very quick method to build just the secondary structure of your macromolecule. This is controlled by the keyword helices_strands_only=True. The Wizard will guess from your sequence file whether the structure is protein or RNA or DNA (but you can tell it if you want with (chain_type=PROTEIN). If the quick model-building does not build a satisfactory model (if the correlation of map and model is less than acceptable_secondary_structure_cc=0.35), then model-building is tried again with the standard build procedure, essentially the same as one cycle of model-building with the AutoBuild Wizard (see the web page Automated Model Building and Rebuilding with AutoBuild, except that if you specify thoroughness=quick as we have in this example, the model-building is done less comprehensively to speed things up. In this case the secondary-structure-only model-building using solution #107 produces an initial model with 162 residues built and side chains assigned to 52, and which has a model-map correlation of 0.42: Model with helices and strands is in Build_2.pdb Log for helices and strands is in Build_2.log Model aligned with AutoSol_run_1_/TEMP0/coords.pdb is in resolve_compare.pdb Final file: AutoSol_run_1_/TEMP0/Build_2.pdb Log file: Build_2.log copied to Build_2.log Model 2: Residues built=162 placed=52 Chains=18 Model-map CC=0.42 This is new best model with cc = 0.42 Getting R for model: Build_2.pdb Model: AutoSol_run_1_/TEMP0/refine_2.pdb R/Rfree=0.53/0.54This is quite an adequate secondary-structure-only model. It is just a preliminary model, but it is good enough to tell that the structure is solved. For full model-building you will want to go on and use the AutoBuild Wizard (see the web page Automated Model Building and Rebuilding with AutoBuild ) The AutoSol_summary.dat summary fileA quick summary of the results of your AutoSol run is in the AutoSol_summary.dat file in your output directory. This file lists the key files that were produced in your run of AutoSol (all these are in the output directory) and some of the key statistics for the run, including the scores for the heavy-atom substructure and the model-building and refinement statistics. These statistics are listed for all the solutions obtained, with the highest-scoring solutions first. Here is part of the summary for this rh-dehalogenase MIR dataset: -----------CURRENT SOLUTIONS FOR RUN 1 : ------------------- *** FILES ARE IN THE DIRECTORY: AutoSol_run_1_ **** Solution # 107 BAYES-CC: 68.4 +/- 12.4 Dataset #0 FOM: 0.64 ---------------- Solution 107 based on MIR phasing starting from solutions 4 (dataset #2) and 48 (dataset #1) This solution is a composite of solutions: 4 48 (Already used for Phasing at resol of 2.8) Refined Sites: 4 NCS information in: AutoSol_107.ncs_spec Experimental phases in: solve_107.mtz Experimental phases plus FreeR_flags for refinement in: exptl_fobs_phases_freeR_flags_107.mtz Density-modified phases in: resolve_107.mtz HA sites (PDB format) in: ha_107.pdb_formatted.pdb Sequence file in: sequence.dat Model in: refine_2.pdb Residues built: 162 Side-chains built: 52 Chains: 18 Overall model-map correlation: 0.42 R/R-free: 0.53/0.54 Phasing logfile in: solve_107.prt Density modification logfile in: resolve_107.log (R=0.25) Build logfile in: Build_2.log Score type: SKEW CORR_RMS Raw scores: 0.39 0.94 BAYES-CC: 66.76 63.84 Refined heavy atom sites (fractional): deriv 1 xyz 0.722 0.280 0.421 xyz 0.188 0.157 0.439 xyz 0.737 0.252 0.416 xyz 0.711 0.296 0.470 deriv 2 xyz 0.712 0.283 0.434 xyz 0.209 0.187 0.464 xyz 0.361 0.338 0.482 xyz 0.369 0.404 0.322 How do I know if I have a good solution?Here are some of the things to look for to tell if you have obtained a correct solution:
What to do nextOnce you have run AutoSol and have obtained a good solution and model, the next thing to do is to run the AutoBuild Wizard. If you run it in the same directory where you ran AutoSol, the AutoBuild Wizard will pick up where the AutoSol Wizard left off and carry out iterative model-building, density modification and refinement to improve your model and map. See the web page Automated Model Building and Rebuilding with AutoBuild for details on how to run AutoBuild. If you do not obtain a good solution, then it's not time to give up yet. There are a number of standard things to try that may improve the structure determination. Here are a few that you should always try:
Additional informationFor details about the AutoSol Wizard, see Automated structure solution with AutoSol. For help on running Wizards, see Running a Wizard from a GUI, the command-line, or a script. | |