Analyzing unmerged data in Phenix


	Python-based Hierarchical ENvironment for Integrated Xtallography
Documentation Home

Analyzing unmerged data in Phenix

Introduction
Xtriage and phenix.merging_statistics
phenix.cc_star
phenix.table_one
Details for individual statistics
References

Introduction

Most of the programs in Phenix assume that the input data, regardless of type or format, are already both scaled and merged to contain only unique reflections. The major exception is AutoSol, which can accept certain unmerged files (for a more limited set of formats, primarily Scalepack) and performs its own local scaling. In other situations, some programs will automatically merge the data themselves if necessary, discarding any information about the variance of individual observations.

Several utilities are available in Phenix which use unmerged intensities to calculate various data quality-related metrics. These include Xtriage, which will display standard merging statistics; the standalone version of this routine, phenix.merging_statistics; a separate utility for calculating CC* (Karplus & Diederichs 2012), phenix.cc_star; and the Table 1 program. Their use as it relates to unmerged data is described below. Note that due to implementation details, small numerical differences (up to a few tenths of a percent) may occur between the originally reported statistics in the processing logfiles, and the numbers reported by Phenix.

Important: we have provided a list of statistics with pointers to the relevant publications in the "Details" section, but this document does not attempt to explain the mathematics or logic behind the statistics calculated here. We strongly recommend that you read the listed references; the publication(s) associated with the data processing software used may also be useful (see for example Evans (2011)).

If you want to experiment with these programs and do not have any data of your own to use, the Phenix installation includes an unmerged data file in the p9-sad example:

$PHENIX/examples/p9-sad/p9_se_w2.sca

The format does not contain complete symmetry information (and the space group is specified incorrectly), but you can use the symmetry information in p9.pdb in the same directory.

Xtriage and phenix.merging_statistics

phenix.merging_statistics is a simple program to calculate the standard statistics output by most data processing software, such as mean(I/sigma), redundancy, R-merge, R-meas (Diederichs & Karplus 1997), R-pim (Weiss 2001), CC1/2, and others. The input may be in any format, including unmerged XDS, unmerged Scalepack, multi-batch MTZ, or SHELX (the latter will only work on the command line, however). The number of resolution bins is set to 10 by default, but is adjustable by the n_bins parameter. All data will be used unless the high_resolution or low_resolution parameters are set; the program does not attempt to identify an appropriate high-resolution cutoff on its own. Both a command-line tool and a GUI (pictured above) are available. The output in either case is a table of statistics by resolution bin:

Resolution: 28.53 - 1.75
Observations: 145453
Unique reflections: 21557
Redundancy: 6.7
Completeness: 99.92%
Mean intensity: 18672.0
Mean I/sigma(I): 14.9
R-merge: 0.073
R-meas:  0.079
R-pim:   0.030

Redundancies (non-anomalous):
  1 : 50
  2 : 163
  3 : 597
  4 : 1766
  5 : 1869
  6 : 3731
  7 : 5512
  8 : 5723
  9 : 1526
  10 : 490
  11 : 74
  12 : 53
  13 : 3

  Statistics by resolution bin:
 d_max  d_min   #obs  #uniq   mult.  %comp       <I>  <I/sI>  r_mrg r_meas  r_pim  cc1/2
 28.53   3.77  15699   2254    6.96  99.87   78997.8    23.4  0.061  0.066  0.025  0.997
  3.77   2.99  15703   2182    7.20  99.95   47400.1    23.1  0.061  0.066  0.024  0.997
  2.99   2.61  15641   2172    7.20 100.00   17930.9    21.1  0.074  0.080  0.030  0.996
  2.61   2.37  15309   2138    7.16 100.00   10520.1    18.6  0.090  0.097  0.036  0.995
  2.37   2.21  15044   2146    7.01  99.95    9103.8    17.2  0.093  0.101  0.038  0.995
  2.20   2.07  14571   2145    6.79 100.00    6560.2    13.5  0.108  0.117  0.045  0.993
  2.07   1.97  13973   2135    6.54 100.00    5016.1    10.8  0.121  0.131  0.051  0.992
  1.97   1.89  13540   2141    6.32 100.00    3620.6     8.6  0.145  0.158  0.062  0.984
  1.88   1.81  13010   2104    6.18  99.95    2070.5     6.8  0.197  0.215  0.085  0.980
  1.81   1.75  12963   2140    6.06  99.49    1477.4     5.6  0.247  0.270  0.108  0.970
 28.53   1.75 145453  21557    6.75  99.92   18672.0    14.9  0.073  0.079  0.030  0.998

In the GUI version, plots are also available:

Xtriage (Zwart et al. 2005) will also calculate these statistics if provided with unmerged intensities; in the GUI, these will appear in an extra tab before the data quality analyses.

phenix.cc_star

This is an implementation of the method outlined in Karplus & Diederichs (2012) for determining the optimal resolution cutoff, as an alternative to the traditional criteria such as limits on R-merge or I/sigma. It is essentially a combination of phenix.merging_statistics with the recalculation of R-factors for the model. The statistics CC* is an estimate of the "true" CC of the data under examination to the (unknown) true intensities. This places it on the same scale as CC(work) and CC(free) for the model and data, and the relationship between these statistics can be used as a guide for truncating the data.

Required inputs are unmerged intensities, merged data (either amplitudes or intensities - if the latter, French&Wilson treatment will be performed), R-free flags, and either a pre-calculated F(model) array, which is output by programs like phenix.refine, or the current refined model. The results should be essentially the same in either case, but supplying F(model) directly will run faster. The statistics overlap with those reported by the merging program, but with the addition of R-factors and CCs for model/data agreement.

phenix.table_one

This utility generates the standard Table 1 of crystallographic statistics found in most structure publications. If unmerged data are included along with the merged data and R-free flags, the intensity merging statistics will be calculated from these instead of parsing the data processing logfiles. We recommend that you use the unmerged data if available, as they are more reliably parsed than logfiles.

Details for individual statistics

Note that the programs listed above display slightly different subsets of these statistics.

Mean(I/sigmaI): the signal-to-noise ratio. Traditionally the data are often cut off at the resolution where this statistic declines to 2, but this may exclude genuine measurements.

Redundancy or Multiplicity: the average number of observations of each reflection.

R-merge or R-symm: the simple merging R-factor for the multiple observations. Ideally this should be as low as possible, but for reasons discussed in the listed references, using it as a resolution cutoff is problematic. (Note that it will always increase significantly in the high-resolution shells.)

R-meas or R-rim: an alternative to R-merge which is intended to be independent of redundancy. See Diederichs & Karplus (1997) for explanation.

R-pim: another alternative to R-merge, intended to show the precision of the measurements. See Weiss (2001).

CC1/2: the correlation of one half of the observations (chosen at random, but with approximately even distribution for each reflection) to the other half. A similar statistic is used in electron microscopy; see Karplus & Diederichs (2012) for more information.

CC*: a modification of CC1/2, intended to show the true correlation of the observed data to the unknown "true" intensities. Used in conjunction with CC(work) and CC(free). See Karplus & Diederichs (2012).

CC(work) and CC(free): correlation of the observed intensities to the model-based intensities, for the work and test sets respectively. Should be on the same scale as CC*.

References

Diederichs K, Karplus PA. Improved R-factors for diffraction data analysis in macromolecular crystallography. Nature Struct. Biol. (1997) 4:269-275.

Evans PR. An introduction to data reduction: space-group determination, scaling and intensity statistics. Acta Cryst. (2011) D67:282-292.

Linking crystallographic model and data quality. P.A. Karplus, and K. Diederichs. Science 336, 1030-1033 (2012).

Weiss MS. Global indicators of data quality. J Appl. Cryst. (2001) 34:130-135.

Xtriage and Fest: automatic assessment of X-ray data and substructure structure factor estimation. P.H. Zwart, R.W. Grosse-Kunstleve, and P.D. Adams. CCP4 Newsletter Winter, Contribution 7 (2005).