Frequently asked questions for phenix.refine

Note: questions specific to the GUI can be found in its documentation.

Contents

General

How can I make phenix.refine run faster?

There are currently two options for this:

The OpenMP parallelization is not particularly efficient, since the FFTs take less than half of the typical runtime; a speedup of 40% is usually the maximum. The process-level parallelization with 'nproc' is most useful when restraint weight optimization is enabled, since these procedures can be run as multiple separate processes. In these circumstances a speedup of 4-5x is possible. However, a run using default parameters will not significantly benefit from setting 'nproc'.

There are several limitations to these options:

I set the number of processors to 8 - why is phenix.refine still only using a single processor?

As noted above, the standard parallelization primarily affects weight optimization. If this is not performed, most of the program will run serially.

What type of experimental data should I use for refinement?

Either amplitudes (F) or intensities (I) may be used in refinement (in any file format), but intensities will be used preferentially. Both anomalous and non-anomalous data are supported; there does not appear to be any particular benefit to model quality using one or the other with the default strategy. However, anomalous data may be used to refine anomalous scattering factors for heavy atoms, and anomalous difference map coefficients will automatically be created in the output MTZ file. For these reasons, anomalous data are recommended, but in general the R-factors will be similar.

Should I use the MTZ file output by phenix.refine as input for the next round of refinement?

The only time this is necessary is when you refined against a dataset that did not include R-free flags, and let phenix.refine generate a new test set. In this case, you should use the file ending in "_data.mtz" for all future rounds of refinement. You do not need to update the input file in each round, as the actual raw data (and R-free flags) are not modified.

I ran AutoSol to get a partial model that I now want to refine. Which data file should I give as input: the original .sca file from HKL2000, or the file overall_best_refine_data.mtz from AutoSol?

Always use the MTZ file output by AutoSol. This contains a new set of R-free flags that have been used to refine the model; starting over with the .sca file will result in a new set of flags being generated, which biases R-free.

Why does phenix.refine not use all data in refinement?

Reflections with abnormal values tend to reduce the performance of the refinement engine. These are identified based on several criteria (see Read 1999 for details) and filtered out at the beginning of each macro-cycle. You can prevent this by setting xray_data.remove_outliers=False.

How many macro-cycles should I run?

We recommend at least five to ensure convergence, but in some cases (especially poorly refined structures) considerably more may be required for optimal results. (The default is three macro-cycles due to speed considerations.)

How should I decide what resolution limit to use for my data?

This is a contentious subject, and not settled, although it is widely agreed that throwing away usable data simply to reduce the R-factors is not an acceptable practice. Traditional criteria for resolution limits include (among others) truncating the data at the resolution where the mean I/sigmaI falls below 2, but this may exclude valuable data. See the documentation on unmerged data (especially the cited references) for more details. In general it may be useful to include additional, weaker data in refinement in the final stages, since the weighting performed by maximum-likelihood refinement prevents these from degrading the model, and may improve it in some cases. (Note that more data will also cause phenix.refine to take longer to run, which often imposes a practical limit on resolution.)

Is it okay to refine against data that have been modified by anisotropic scaling or ellipsoidal truncation?

There is no technical reason why this is impossible. However, modified data should be avoided for several reasons:

Optimization methods

When should I use simulated annealing?

Simulated annealing (SA) is most useful early in refinement, when the model is far from convergence. Manually built models, or MR solutions involving significant local conformational changes, are common inputs where SA can improve over simple gradient-driven refinement. It is generally less helpful later in refinement, and/or at high resolution.

When should I use rigid-body refinement?

This typically only needs to be performed once after molecular replacement, unless you dock in additional domains later. Continuing to use rigid-body refinement in later runs will not improve your structure, and only adds to the runtime.

What happened to the old fix_rotamers option?

As of version 1.8.3, the real-space refinement strategy now incorporates both global minimization and local fitting; the latter is similar to fix_rotamers but much faster and incorporates backbone flexibility. Note that it will not run at very high or low resolution, or when explicit hydrogen atoms are present.

When is the ordered solvent method useful?

This works out to approximately 2.8Å as the low-resolution limit, depending on data quality. At atomic resolution (beyond approximately 1.2Å), it is useful early in refinement when no waters have been placed yet, but as the structure becomes more complete it may remove weaker, partial-occupancy waters, and it is unable to handle static disorder (alternate conformations).

Targets and restraints

When should I optimize the geometry and/or B-factor restraint weights?

This may be beneficial if the automatic weighting does not pick a good scale for the X-ray and restraint terms; this will often be recognizable by higher-than-expected bond and angle RMSDs. In general it rarely hurts to optimize the weights, and often results in a significantly better refinement, but it is several times slower than ordinary refinement unless you have a highly parallel system. However, we strongly recommend weight optimization in the final round of refinement, where it becomes essential to prevent overfitting.

How can I set the target weights manually?

Our usual response is: don't do this manually, use the automatic optimization instead. Although this takes significantly longer to run, in practice most users will spend an equivalent amount of time manually adjusting the weights by trial-and-error. If you are certain you need to have manual control, the parameters fix_wxc (for geometry restraints) and fix_wxu (for B-factor restraints) will set the weights.

Why doesn't phenix.refine use weight optimization even though I selected this option?

Because the weight optimization takes into account (and tries to limit) the difference between R-work and R-free, it will be automatically disabled if these statistics already very close (or if R-free is greater than R-work), as often happens at the start of refinement from an MR solution.

My resolution is X Angstroms; what should RMS(bonds) and RMS(angles) be?

This is somewhat controversial, but absolute upper limits for a well-refined protein structure at high resolution are typically 0.02 for RMS(bonds) and 2.0 for RMS(angles); usually they will be significantly lower. As resolution decreases the acceptable deviation from geometry restraints also decreases, so at 3.5 Angstrom, more appropriate values would be 0.01 and 1.0. We recommend using the POLYGON tool in the validation summary to judge your structure relative to others at similar resolutions.

Why does my output model have very poor geometry (RMS(bonds) and RMS(angles))?

This usually means that the automatic X-ray/geometry weighting did not work properly; this can sometimes happen if the starting model also has poor geometry. Optimizing the weight (optimize_xyz_weight=True, or equivalent GUI control in the "Refinement settings" tab) will usually fix this problem.

The RMS(bonds) and RMS(angles) in the output model are too low after weight optimization - how can I make them higher?

This is a common misconception about geometry deviations, based partly on anecdotal experience with other refinement programs. The target RMSDs come from looking at very accurate high-resolution small molecule structures, so they reflect the real variation that should occur in geometry. At lower resolution, even though you know that the bond lengths and bond angles must vary as much as they do in high-resolution structures, there isn’t enough experimental data to tell in which direction they should deviate from the expected values. So it is reasonable for the refined RMSDs to be lower than the targets.

(Anecdotally, we have found that phenix.refine often refines to much tighter RMSDs with similar R-frees to other programs. This may reflect different approaches used in the geometry restraints, X-ray target, or optimization methods, but it should not be cause for concern.)

I have experimental phases for this structure, but the initial maps were poor. Should I still use the MLHL target?

The experimental phases used to restraint refinement describe a bimodal probability distribution for every angle, rather than the single values used to generate a map. In most cases the additional restraints will not hurt refinement, and can often help.

Why is phenix.refine messing up my ligand geometry?

This often happens when the restraints were generated using ReadySet from a PDB file, and the ligand code is not recognizable in the Chemical Components Database. eLBOW will try to guess the molecular topology based on the coordinates alone, but this is imprecise and may not yield the desired result. For best results, restraints for non-standard ligands should be generated in eLBOW using a SMILES string or similar source of topology information.

What can I do to make my low-resolution structure better?

In general, if NCS is present in your structure, you should always use NCS restraints at low resolution; it is worth trying both the Cartesian (global) and torsion restraints to see which works best for your model. This alone usually helps with the geometry and overfitting, although it is rarely sufficient by itself. There are also several different types of restraint specifically designed to help with low-resolution refinement (consult the full phenix.refine manual page for details on each):

I have ions very close to water molecules/protein atoms, and phenix.refine keeps tring to move them apart. How can I prevent this?

Use phenix.ready_set or phenix.metal_coordination to generate custom bond (and optionally, bond angle) restraints, which will be output to a parameter file ending in ".edits". If you are using the PHENIX GUI, there is a toolbar button for ReadySet in the phenix.refine interface, which will automatically load the output files for use in phenix.refine.

I had previously generated custom restraints using ReadySet in the PHENIX GUI, but the atoms have changed. Now phenix.refine crashes because it can't find the atom selections. How do I remove the old custom restraints?

In the Utilities menu, select "Clear custom restraints."

What does "sigma" mean for geometry restraints, and what values are appropriate?

The sigma is the estimated standard deviation (e.s.d.) of the target value. For distance (bond) restraints, the sigma will be in Angstrom units; for bond angle and dihedral restraints it will be in degrees. A typical covalent bond will have a sigma of 0.02 or 0.03, and weaker "bond" restraints such as hydrogen bonds or metal coordination restraints will be looser with sigma between 0.05 and 0.1. For bond angle restraints, sigma is usually a few degrees; for dihedrals, up to 20-30 degrees is not uncommon. (See also the next question below.)

I want my ligand geometry to be absolutely perfect with no deviation from the target value(s). Can I just set the sigmas to zero or an extremely low value?

You cannot set the sigma to zero because the weight on the restraints is equal to 1/sigma^2. A very low value will not crash, but it will almost certainly confuse the minimizer and result in a sub-optimal structure, because those restraints will dominate the target and gradients, forcing the minimizer to take inappropriately large steps.

One part of my structure is particularly poor; how can I make the geometry restraints tighter for only those atoms?

The parameter scope refinement.geometry_restraints.edits.scale_restraints allows you to upweight the restraints for the specified atom selection, for any combination of bond lengths, bond angles, dihedral angles, and chirality. For instance:

refinement.geometry_restraints.edits.scale_restraints {
  atom_selection = "chain B"
  scale = 2.5
  apply_to = *bond *angle dihedral chirality
}

You may specify multiple such parameter blocks. The scale should be a relatively small number, typically less than 10 (you may also reduce the weight if you want). Note that since this affects the path of the minimizer, the overall geometry RMSDs (and the deviations for restraints which were not scaled) will likely change as a result.

Can I make phenix.refine restrain the planarity of RNA/DNA base pairs?

At present this requires specifying each base pair individually as a custom planarity restraint (part of the parameter scope refinement.geometry_restraints.edits), which may be excessively time-consuming for large structures. Automatic planarity restraints may be added in a future version.

Can I use a reference model to restrain ligand coordinates?

The reference model restraints are only intended to work with macromolecules. However, you may use separate harmonic restraints for any subset of atoms; these will tether the selected atoms to their initial coordinates. This can be effective when you already have good geometry and map fit for the restrained atoms; however, it does not allow for genuine conformational differences.

How do I stop simulated annealing from pushing certain atoms too far out of density?

The harmonic restraints are suitable for this purpose. This is especially useful when generating a simulated annealing omit map, where atoms may move to fill voids left by omitted scatterers.

Non-crystallographic symmetry (NCS)

When should I use non-crystallographic symmetry (NCS) restraints?

This will of course require that NCS is actually present in your crystal. An approximate cutoff for NCS restraints is 2.0 Angstrom - at higher resolution the data alone are usually sufficient, but at lower resolution additional restraints are usually necessary.

What is the difference between global and torsion NCS, and which one should I pick?

The global NCS restraints groups as rigid bodies, where all atoms in each group are expected to be related to the others by a single rotation and translation operation. This does not respect local deformations in the related molecules, which are common even at lower resolution. The torsion NCS restraints restraint dihedral angles instead, and allow them to be unrestrained if genuinely different. This option has been made the default if NCS restraints are activated, since it usually results in significantly better refinement, and rarely performs worse than using no restraints at all.

How do I specify the .ncs_spec file from AutoSol or phenix.find_ncs for use in refinement?

The .ncs_spec files containing rotation and translation matrices are only used in density modification and model building. For refinement, the NCS relationships are always given as atom selections, and in the case of the default torsion NCS restraints, the automatically detected restraint groups should be very accurate.

How does one define NCS groups manually in the Phenix GUI?

If you are using the torsion NCS restraints, this should not be necessary, but you can enter atom selections manually by selecting "Detect NCS groups" from the Utilities menu. If you have identified a structure where manual group assignment is necessary for refinement to behave properly at low resolution, this may indicate a bug in phenix.refine; please contact us directly by emailing "help@phenix-online.org".

For the global/Cartesian NCS restraints, manual selections often are necessary, and the same menu item will open a different selection window if the global option is selected. However, we encourage you to try the torsion NCS restraints first before deciding to use the global parameterization.

Both NCS restraint types make my structure worse - what should I do?

If you are refining a structure at high resolution (better than 2.0 Angstrom), this is not unexpected; NCS restraints are usually unnecessary in such cases. At lower resolution, this may indicate a bug; we encourage you to contact us directly.

What impact do NCS restraints have on the electron density maps?

The maps will only be affected to the extent that the NCS-related chains will be relatively similar, and therefore the map phases will reflect this similarity. The maps will not be directly modified using the NCS relationships, however. If you want phenix.refine to perform NCS averaging on the maps, this is available as an option for individual map coefficients.

Do NCS restraints apply to B-factors?

By default, no; our testing has indicated that this is rarely beneficial and in a significant fraction of structures it may actually yield worse results (due to real differences in the disorder present in NCS-related copies). If you want B-factors to be restrained between NCS groups, check the box labeled "Restrain NCS-related B-factors" in the "NCS options" window, or on the command line, specify ncs.restrain_b_factors=True.

B-factors/ADPs/TLS

When should I use TLS?

TLS refinement is generally valid at any resolution; at low resolution, it may be best to make each chain a single group, instead of trying to split them into smaller pieces. However, it is best to wait until near the end of refinement to add TLS; until then you should refine with isotropic ADPs only.

Can I use both TLS and anisotropic ADPs?

Yes, but not for the same atoms - since TLS is essentially constrained anisotropic refinement, the two methods are mutually exclusive.

Where is the switch for anisotropic vs. isotropic B-factors/ADPs?

phenix.refine does not have a single global switch for defining ADP parameterization; rather, when the "Individual ADPs" strategy is defined, the program uses several criteria to determine how atoms should be treated:

In the GUI, several common parameterizations are pre-defined in the dialog for entering ADP selections. Note that although it is possible to combine all of the different ADP refinement strategies in a single run, the atom selections for individual and grouped refinement may not overlap, nor may the selections for anisotropic ADPs and TLS groups.

When should I refine anisotropic ADPs instead of TLS groups?

There is no precise cutoff where you should turn on anisotropic ADPs, but these are approximate guidelines, assuming that the data are actually complete to the indicated resolution:

There may be circumstances where anisotropic refinement is permissible at slightly lower resolution, but 1.7 Angstrom is probably a lower limit. Exceptions may sometimes be made for metal ions, since they scatter very strongly. As always, you should use the drop in R-free to judge whether the change in parameterization was appropriate - a decrease of 0.5% (i.e. 0.005) or better indicates success.

When should I refine grouped B-factors/ADPs instead of individual?

It is again difficult to give an exact rule, since it depends on several properties of the crystal including resolution, solvent content, presence of NCS, etc. In general, the higher the data-to-parameter ratio, the more likely individual ADPs are to work well. As an approximate example, consider these two hypothetical structures:

In this case, the latter structure can probably be refined with individual ADPs, while the former is more marginal. If in doubt, early rounds of refinement may be done with grouped ADPs, switching to individual as the structure nears convergence. In general, it is usually worth trying individual ADPs at some point; ultimately the effect on R-factors (primarily R-free, but also the gap between R-work and R-free) is the most important guideline.

Twinning

When should I use twinned refinement?

You should only do this if Xtriage indicates that the intensity statistics are abnormal and cannot be explained by picking the wrong space group. The estimated twin-law-specific twin fractions displayed in Xtriage should not be used to determine whether twinning is present.

I performed twin refinement and my R-free went down by 1%; does that mean my structure is twinned?

No, because R-factors calculated with and without twinning are not necessarily on the same scale. In phenix.model_vs_data, the structure is only considered twinned if application of a twin law reduces R-work by at least 2%. Note that if you specify twin_law=Auto, phenix.refine will use the same procedure to determine the twin law (if any).

But what if the reported twin fraction is nearly 50%?

This often means that your data are under-merged, i.e. the crystal symmetry is too low and the applied twin operator is substituting for a symmetry operation. If this is case you should merge the data to higher symmetry and reduce the model to the new ASU. Of course it is also possible to have perfectly twinned crystals, but the high twin fraction output by phenix.refine does not by itself indicate twinning.

My data has multiple twin laws; can I use these in Phenix?

Currently we only support a single twin law; programs capable of refining tetartohedrally twinned structures are REFMAC and SHELXL.

What are the disadvantages of twinned refinement?

In Phenix specifically, twinned refinement uses a least-squares (LS) target instead of the more powerful maximum likelihood target used for conventional refinement. Additionally, twinned refinement makes no use of experimental phases (if available) as restraints. Some refinement protocols may not work with twinning, although as of April 2013 most of these have been fixed. More generally, the output map coefficients will have a significantly worse model bias problem than conventional maps; this effect increases as the twin fraction nears 0.5.

Using R-free

phenix.refine stops with this error: "R-free flags not compatible with F-obs array: missing flag for 100 F-obs selected for refinement". How do I fix this?

This means that the experimental data array contains reflections that are not present in the R-free flags array (this is required even if the flag is False). Use the reflection file editor to extend the existing R-free flags to cover all reflections in the file. Make sure that the checkbox labeled "Extend existing R-free array(s) to full resolution range" is checked.

phenix.refine stops with an error message about the model being refined against a different set of R-free flags. How can I fix this?

First, make sure that you have not actually generated a new set of R-free flags; once you have these flags for a given dataset, you should continue using them throughout the building and refinement process. The error message is intended to guard against this happening accidentally. If, however, you have collected new higher-resolution data and extended the old R-free flags, then the error message may be ignored. The R-free flag comparison is based on information stored in the REMARK records in the input PDB file, so if you edit the PDB file and remove the line containing the word "hexdigest", the refinement will be able to continue.

I have a model that was previously refined against a previous set of R-free flags that I don't have access to. How can I avoid biasing the R-free when I refine this model?

There are several methods for this, but the easiest is to reset the B-factors (using PDBTools or the "Modify start model" option in phenix.refine) and run simulated annealing on the coordinates. If you are especially worried about bias you can alternately randomize the coordinates and perform energy minimization, or build an entirely new model starting from the phases calculated from the original model. However, we usually find that the annealing is aggressive enough to remove any "memory" of the original R-free flags.

My resolution is X Angstroms, and my R/R-free are Y and Z. Am I done refining?

A partial answer can be obtained by looking at POLYGON, which plots histograms of statistics for PDB structures solved at similar resolutions, and compares these to the statistics for your output model. As a general rule, R-factors alone should not be used to decide if a structure is "done", but should be examined in combination with the validation report.

My resolution is X Angstroms, the structure is complete and well-validated, the maps look great, bu my R and R-free are still really high. How can I make them lower?

There are several possible explanations for this:

The gap between R-work and R-free is very large - how can I fix this?

Overfitting during refinement is usually helped by adding more restraints, and/or tightening the standard geometry restraints. If the output geometry is already within reasonable limits (typically RMS(bonds) < 0.016 and RMS(angles) < 1.8), ideas to try include adding NCS restraints if NCS is present, secondary structure restraints, or reference model restraints (if a high-resolution structure is available). At lower resolutions (worse than 3.0A), it may also be prudent to try grouped ADP refinement, and if desperate, Ramachandran restraints. TLS refinement can often improve overfitting across a wide range of resolutions. However, depending on the degree of overfitting, it may be necessary to perform extensive manual rebuilding first. (Note that if the large R/R-free gap suddenly appears after refinement of a model that was previously not overfit, this usually indicates incorrect parameterization of the refinement, e.g. using anisotropic ADPs at an inappropriate resolution.)

I ran a round of refinement, rebuilt in Coot, and refined again. My previous R-free was 0.25, but the new refinement starts out at 0.35. Why is it so high?

The initial R-factors reported by phenix.refine are without bulk-solvent correction, which usually has a significant impact on R-factors. Once this step is performed (at the start of the first macrocycle), the R-free should drop immediately to approximately the expected value.

Why does phenix.refine give me a different R-factor than program X?

There are many explanations for this. Even without minimization, the bulk solvent and scaling methods alone may account for as much as a 1% difference or more in calculated R-factors. For refinement results, the differences in target functions, restraints, and minimizers may be significant. In some cases the explanation is as simple as running too few cycles of refinement for one or the other program. In general, if you find a case where phenix.refine performs significantly worse than another program, we encourage you to contact us at help@phenix-online.org.

Are R-free flagged reflections included in the maps used for real-space refinement?

No, this is almost guaranteed to bias R-free; these reflections are removed internally prior to map calculation. However, the output maps will include these reflections unless you explicitly request otherwise.

Interpreting results

I solved my structure by MR and refined, and the R-free is 45%. The maps are messy and I see a lot of difference density, but none around my molecule. Why isn't refinement working?

This frequently indicates that too few copies of the structure were placed by MR, and an additional chain needs to be added. Remember that the predictions of unit cell contents based on the Matthews coefficient (performed by Xtriage, for example) only provide an estimate, not an exact answer. At high resolution a solvent content of 40% or less is quite common.

Why am I seeing negative blobs in the difference (mFo-DFc) map in hydrophobic voids?

In previous versions of Phenix, the bulk solvent mask was often being extended to include these regions. This should no longer be a problem as of July 2013, but if you continue to see this effect, please contact us at help@phenix-online.org.

After coordinate and B-factor refinement the heavy atoms in my structure have negative mFo-DFc peaks. How do I get rid of these?

Refining the occupancies will often fix this problem. Alternately, if a significant amount of anomalous scattering is expected at the wavelength used for data collection, anomalous group reifnement may also be helpful. We do not recommend setting the B-factor manually and turning off refinement for the problematic atoms.

After refinement the mFo-DFc map has positive density around correctly placed atoms that are already at full occupancy. Is my model missing something?

Usually this means that the initial B-factors of the input model were too high for refinement to converge. Typically the minimizer is very good at raising low B-factors to the correct value, but gets stuck in the opposite direction. The observed result could happen if you refine starting from a lower-resolution model, or if you build new residues in Coot and the default B-factors are well above what they should be (typically this only happens at atomic resolution). To fix the problem, you just need to reset the B-factors to a suitably low value. This can be done at the start of refinement by setting the parameter refinement.modify_start_model.adp.set_b_iso; in the GUI, this can be found in the Settings menu under "Modify start model" --> "Modify ADPs...".

Why does Phenix validation report a different number of Ramachandran outliers than Coot/Procheck/the PDB/other program?

The phi/psi distributions used in Phenix are the same as those in the MolProbity server (Chen et al. 2010), and are based on a curated set of 8000 high-resolution crystal structures. There are now six distributions for different residue classes (general, glycine, Ile/Val, pre-Pro, cis-Pro, and trans-Pro). These distributions are stored in 2-degree increments. Other programs generally use older and/or less precise distributions to score phi/psi angles, which frequently results in disagreements for residues which are on the border of allowed and outlier regions of the plot. We suggest that you rely primarily on the results in Phenix (or MolProbity), as the distributions we use are very accurate and based on the latest structural data.

How come I have a bunch of clashes with water molecules in the validation results after running solvent update?

phenix.refine is relatively aggressive in placing solvent atoms in unmodeled density. However, this may sometimes result in clashes if the density represents ions, unmodeled residues, or alternate conformations rather than solvent. For this reason, we recommend that the final round of refinement not include solvent update (regardless of resolution), after any clashing water atoms have been removed. (You should also attempt to model the observed density features if possible, although this is not always straightforward.)

What does it mean when a water atom lies in a positive mFo-DFc peak?

This indicates that the water is actually something heavier. If the density shape (for both 2mFo-DFc and mFo-DFc maps) is still relatively isolated and spherical, this suggests an ion of some sort (chloride, calcium, etc.). You can use the built-in ion identification in phenix.refine to try to model it, or make the assignment manually based on examination of the local environment. We also recommend looking at the anomalous difference map, as many ions will have significant anomalous scattering at typical data collection wavelengths. Alternately, if the density is more extended, this indicates that a larger ligand is bound.

I still have some mysterious mFo-DFc blobs that I can't identify; what should I do with them?

One option is to use the ligand identification tool to try to place various common small molecules in the density; we recommend being very conservative in filtering the results. It is very helpful to check the conditions used for purification, crystallization, and cryoprotection, since these frequently result in buffer components being visible in the maps. However, if you are unable to positively identify blobs, we recommend leaving them empty (and perhaps also noting this in the deposition remarks or publication). We do not recommend filling them with water molecules, as this misrepresents the true identity of the blobs, nor do we recommend using "unknown" atoms as placeholders (especially since these are incompatible with most tools in Phenix).

Hydrogens

When should I refine with hydrogens?

This is largely a matter of personal preference. Using explicit riding hydrogen atoms can improve geometry at any resolution; at higher resolutions, approximately 2 Angstrom or better, they will generally improve R-free as well. At atomic resolution (1.5 A or better) they should always be part of the final model. Note that at unless you have true subatomic resolution (0.9 A or better), the hydrogens should always be refined as "riding", meaning that their coordinates are defined by the heavy atoms, not individually refined against X-ray data.

How can I tell phenix.refine to add hydrogens to my model?

The command-line program does not add hydrogens; this is performed by a separate program, phenix.ready_set. However, in the GUI there is an option in the "Refinement settings" tab to add hydrogens, which simply runs ReadySet internally immediately before starting phenix.refine.

What about water molecules?

Although phenix.ready_set includes an option to add hydrogens to waters, we do not recommend this unless you have exceptionally high resolution and/or neutron data.

Why are my hydrogen atoms added by PHENIX exploding when I run real-space refinement in Coot?

Versions of Coot prior to 0.6.2 used a version of the CCP4 monomer library with hydrogen atoms named according to the PDB format version 2 standard; PHENIX can recognize these, but defaults to PDB v.3. To reconcile the different conventions, you can download the newer version of the monomer library (currently available here) and set the environment variable COOT_REFMAC_LIB_DIR to point to the directory in which you unpack it. However, newer versions of Coot do not appear to have this problem.

I refined with riding hydrogens - why are they present in the output model?

phenix.refine will always output all atoms that are part of the atomic model, regardless of how they were parameterized during refinement. Use of the "riding" model does not necessarily guarantee that the hydrogen positions and B-factors are reproducible, as they differ between various programs (or program versions).

Why can't PHENIX automatically remove hydrogens from the output PDB file?

We strongly discourage removing any atoms used in refinement from the model, as it makes reproducing the published R-factors very difficult and eliminates essential information about how the structure was refined.

How are X-H (hydrogen to heavy atom) bond lengths defined, and why are these different than the ones in molecular mechanics programs?

Since X-rays are diffracted by electrons, the default hydrogen positions are defined as the centers of the electron clouds, not the atomic nuclei. In most cases this results in the bond lengths being 0.1Å shorter than they would be if the distances to the nuclei were given. At high resolution, this will more accurately model the X-ray data. The VDW radii used for nonbonded restraints will be larger to compensate for the shorter bonds.

If you are performing neutron refinement, the nuclear positions will be used instead, resulting in longer bond lengths and smaller VDW radii. The MolProbity validation should handle both scenarios appropriately.

Miscellaneous

How can I model a charged atom?

The charge occupies columns 79-80 at the end of each ATOM or HETATM record, immediately following the element symbol. The format is the number of electrons followed by the charge sign, for example "1-" or "2+". You can edit the PDB file manually to add this, but we recommend using phenix.pdbtools:

phenix.pdbtools model.pdb charge_selection="element Mn" charge=2

This is also available in the GUI under "Model tools". The effect of setting the charge will be to use modified scattering factors for X-ray refinement, which can be helpful if you notice difference density appearing at ion sites. Note that it will have no effect on the geometry, since phenix.refine does not take electrostatics into account.

I can't see density for an arginine sidechain beyond the C-gamma atom. How should I model it?

Opinion in the crystallography community differs on the proper approach to disordered sidechains, with significant support for both of the following methods voiced on the PHENIX and CCP4 mailing lists:

  • Delete all atoms not visible in density, but leave the residue name alone. This is arguably the most conservative approach, as it avoids modeling any features not supported by the data, and it is consistent with the treatment of missing loops. The main disadvantage is aesthetic, since it is more difficult to visualize and interpret the biological effects of a structure with missing sidechains. Anecdotal evidence suggests that some non-crystallographers may be confused by this.
  • Pick an appropriate rotamer, and let the B-factors rise to account for disorder. This avoids truncated sidechains that may be mistaken for other residues, and is more realistic when interpreting surface electrostatics. The atomic B-factors and coordinates are actually refined against the data, however weak. It is potentially dangerous because it implies a greater level of confidence in these positions than is justified by the data. Additionally, the ADP restraints will keep the B-factors of nearby atoms similar (within some tolerance), which is normally essential for stable refinement but may artificially lower the B-factors of disordered sidechains.

A third approach, setting the occupancy of missing atoms to zero but leaving them in the model, is strongly disfavored, as the resulting positions and B-factors are entirely theoretical (but not immediately obvious as such).

Running phenix.model_vs_data (or the validation GUI) results in a slightly different R-factor than reported in the PDB header by phenix.refine. Shouldn't these be the same?

phenix.refine and phenix.model_vs_data use the same code to perform the bulk solvent correction and scaling, so they should report approximately the same R-factors given identical inputs. The discrepancy arises when taking a PDB file from refinement and running it back through phenix.model_vs_data. Because of the limited precision of the PDB format (three digits after the decimal point for coordinates, two digits for B-factors and occupancy), the atomic properties recorded in the PDB file will not be exactly the same as their actual refined values. In practice the difference in R-factors is statistically insignificant, however.

When should I perform anomalous group refinement?

This is most useful when you have a large number of strong anomalous scatterers, where map artifacts are common. In such cases a more precise modeling of atomic scattering may improve the R-factors as well. It is generally not necessary for routine cases, but it may be advantageous for identifying weak anomalous scatterers (such as ions from the crystallization condition) when used to calculate an anomalous log-likelihood gradient (LLG) or residual map at the end of refinement.

How does phenix.refine deal with atoms on special positions?

There are two ways to handle atoms on special positions (e.g. symmetry axes):

  • If the occupancy is set to the value expected for the special position (e.g. 0.5 for an atom on a two-fold axis) or less, the coordinate position will be refined, and it will not interact with its symmetry mates.
  • If the occupancy is set to 1, the atom will be constrained to stay at the special position, and the occupancy will be corrected internally when calculating structure factors.

Note that the partial-occupancy atoms on special positions will have their occupancies defined if using default settings; you can disable this by instructing phenix.refine to remove a specific atom selection from occupancy refinement (the keyword refinement.refine.occupancies.remove_selection, or in the GUI, edit the atom selections for the Occupancy strategy).

I have sidechains in multiple conformations interacting across a symmetry axis, where the 'A' conformations clash with each other and are moved out of density by refinement. How can I tell phenix.refine to ignore this interaction?

This is not uncommon (a classic example is Tyr378 in PDB ID 1GWE), but it is not automatically handled right now and the PDB format does not provide a means to label the atoms in such a way that clashes are avoided. The solution in phenix.refine is a parameter defining an atom selection for which nonbonded restraints should be disabled:

custom_nonbonded_symmetry_exclusions = "chain A and resseq 378"

How can I extract the isotropic B-factor equivalent from a structure refined with TLS or anisotropic atoms?

You don't need any extra steps; the B-factor column in ATOM records in the PDB (or mmCIF) file will already be the total B-factor.

Why does phenix.refine output ANISOU records for individual atoms even though I only performed isotropic and TLS refinement?

TLS refinement is essentially constrained anisotropic refinement, so the individual atoms are anisotropic (just not independent); the ANISOU records simply make this explicit, since they have a standard format and are recognized by a variety of programs, unlike TLS information in the PDB header.

Why doesn't the PDB header report the bulk solvent parameters K_sol and B_sol?

Newer versions of Phenix use an improved bulk-solvent correction and scaling procedure which uses an entirely different parameterization that we find performs better (Afonine et al. 2013; see also Uson et al. (1999) Acta Cryst. D55, 1158–1167).

What is the difference between the various scattering tables? When should I use something other than the default?

If you are refining a neutron structure, you should of course use the neutron scattering table. The other tables are all X-ray-specific; the default, n_gaussian, is the best to use, as it uses dynamically defined number of Gaussians to approximate tabulated form-factors with required accuracy. it1992 is commonly used in other programs - this is four Gaussians plus constant, taken from International Tables 1992 edition. wk1995 is from (Waasmaier & Kirfel 1995), which is five Gaussians and is more accurate (but slower) than it1992.

Why doesn't phenix.refine use a constant number of resolution bins/why doesn't phenix.refine divide reflections evenly into resolution bins?

Binning reflections evenly is inappropriate for the bulk solvent correction and scaling method used in phenix.refine, which instead uses logarithmic binning. The logic is explained in Afonine et al. (2013):

"This scheme allows the higher resolution bins to contain more reflections than the lower resolution bins and more detailed binning at low resolution without increasing the total number of bins. An additional reason for using logarithmic binning is that the dependence of the scales on resolution is approximately exponential (see previous sections), which makes the variation of scale factors more uniform between bins when a logarithmic binning algorithm is used."

Citations

How should I cite phenix.refine?

Either (Afonine et al. 2012) or (Adams et al. 2010) is suitable; we recommend the later if you used additional components of Phenix. If you used the integrated MolProbity validation in the GUI, you should also cite (Chen et al. 2010) and/or (Echols et al. 2012).

How should I specify the refinement program in my PDB deposition?

If the PDB or mmCIF file you are depositing was output by phenix.refine, the refinement program (including version number) is already specified in the header. Otherwise, "PHENIX (version 1.8.4)" is suitable (replaced with the actual version number). However, note that if you used multiple refinement programs (for example REFMAC and PHENIX) during the course of structure determination, only the last will usually be named in the header, so we suggest that you edit this information during deposition to complete the list.

What are some references for the underlying methods used in phenix.refine?

For technical background, the most thorough source is (Afonine et al. 2012), which contains all of the relevant citations. We recommend that everyone who uses phenix.refine read this paper at some point even if you are not concerned with theory, as it provides a more detailed explanation for the methods and motivations behind the program than this documentation.

Where can I read more about the principles of refinement in general?

Bernhard Rupp's "BioMolecular Crystallography" is the most modern and complete reference, and includes a detailed explanation of the maximum likelihood methods used in phenix.refine and many other programs.

References