[phenixbb] Appropriate number of reflections for FreeR

Sun Nov 3 02:31:22 PST 2013

When we were developing the ML refinement targets for CNS, we decided that about 1000 reflections was usually enough to get a reasonable sigmaA curve, which (as Pavel points out) will have a great influence on the quality of the refinement.  How many reflections you need actually depends on how good the model is, as the likelihood function used to determine the sigmaA values is pretty flat when the true values of SigmaA in the curve are close to zero (i.e. a very poor model), but gets much more sensitive as the true values of sigmaA rise.  This implies that you should actually be able to reduce the number of cross-validation reflections towards the end of refinement, if you don't have a lot to spare (e.g relatively low resolution).  If we accept that 1000 is probably enough, then it makes sense that more than 2000 is probably a waste.  

The other argument for how many you need would be in terms of the precision of the R-free value you get, which is of course the other purpose of the cross-validation data.  I don't remember the numbers from Ian Tickle's papers off the top of my head, but 1000-2000 is probably enough to get a number that you can base decisions on.  

Pavel has expressed a different opinion, and this is probably based on a difference between the way we implemented the ML targets in CNS and what Pavel has done in phenix.refine.  In CNS, the sigmaA curve was restrained to be smooth, so there didn't have to be enough reflections in each resolution bin to independently estimate a value for sigmaA in that bin, whereas the resolution shells are treated independently in phenix.refine (which uses alpha and beta parameters, but this is ultimately equivalent to estimating sigmaA).  I would agree that, even if the model is pretty good so that the true values of sigmaA are significantly greater than zero, you wouldn't want to have fewer than 50 reflections in a resolution bin, so limiting yourself to a fixed number of reflections would limit you to a fixed number of bins, and would thus limit how finely the variation of quality with resolution can be modeled.

There's probably room for someone to do a useful study on the quality of sigmaA curves that looks at a variety of resolution limits and unit cell sizes, because some more sophisticated rules of thumb than these two extremes (10% of the data vs a total of 2000 reflections) might well be better, at least for methods that apply some smoothness restraint or functional form to the curve.

Best wishes,

Randy Read

-----
Randy J. Read
Department of Haematology, University of Cambridge
Cambridge Institute for Medical Research    Tel: +44 1223 336500
Wellcome Trust/MRC Building                         Fax: +44 1223 336827
Hills Road                                                            E-mail: rjr27 at cam.ac.uk
Cambridge CB2 0XY, U.K.                               www-structmed.cimr.cam.ac.uk

On 1 Nov 2013, at 23:38, Mark van Raaij <mjvanraaij at cnb.csic.es> wrote:

> the limit of 2000 reflections I guess is just because it would be a waste to "throw away" more reflections for refinement, once the statistical minima for calculating a reliable Rfree have been met. I.e. if you have 100.000 reflections, it would be a waste to use 5 or 10% of the reflections instead of just 2%. I'd rather use as many as possible reflections for refinement.
> 
> On 31 Oct 2013, at 20:21, Pavel Afonine wrote:
> 
>> Hi Joe,
>> 
>> flags should be selected such that there is enough of them in each relatively thin resolution shell (thin enough so ML target parameters can be considered constant in each such shell). Lower end is about 50 reflections per shell. All in all this usually translates into about 10% overall.
>> 
>> Yes, there is a limit parameter set to 2000 by default. I don't know what's the rationale for having it, may be someone can explain.
>> 
>> Pavel
>> 
>> On 10/31/13 12:10 PM, Joseph Noel wrote:
>>> Hi All. I think I have asked this before but forgot. Old age. What is
>>> the appropriate number / percentage of reflections to flag for a
>>> statistically appropriate Free-R calculation? If I am correct, the
>>> reflection file editor in Phenix chooses by default either 10% of the
>>> measured reflections or 2000 whichever comes first.
>>> ______________________________________________________________________________________
>>> Joseph P. Noel, Ph.D.
>>> Arthur and Julie Woodrow Chair
>>> Investigator, Howard Hughes Medical Institute
>>> Professor, The Jack H. Skirball Center for Chemical Biology and Proteomics
>>> The Salk Institute for Biological Studies
>>> 10010 North Torrey Pines Road
>>> La Jolla, CA  92037 USA
>>> 
>>> Phone: (858) 453-4100 extension 1442
>>> Cell: (858) 349-4700
>>> Fax: (858) 597-0855
>>> E-mail: noel at salk.edu <mailto:noel at salk.edu>
>>> 
>>> Publications & Citations:
>>> http://scholar.google.com/citations?user=xiL1lscAAAAJ
>>> 
>>> Homepage Salk: http://www.salk.edu/faculty/noel.html
>>> Homepage HHMI: http://hhmi.org/research/investigators/noel.html
>>> ______________________________________________________________________________________
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> phenixbb mailing list
>>> phenixbb at phenix-online.org
>>> http://phenix-online.org/mailman/listinfo/phenixbb
>>> 
>> _______________________________________________
>> phenixbb mailing list
>> phenixbb at phenix-online.org
>> http://phenix-online.org/mailman/listinfo/phenixbb
> 
> _______________________________________________
> phenixbb mailing list
> phenixbb at phenix-online.org
> http://phenix-online.org/mailman/listinfo/phenixbb