[phenixbb] Cross-validation when test set is miniscule

Fri Dec 19 08:38:09 PST 2014

Hi Derek,

choosing 5% for free set is not a dogma. I always use 10% and that's 
what CNS was doing for years. In your case this will make 200. Not a 
whole lot but better than 100.

You can generate several (say 10-50) different test sets and 
independently refine the model against each of them (from the very 
beginning). Then make a note of differences (in model, R-factors). Those 
differences will be uncertainties likely due to different test sets used.

I realize it may be tedious to do 10-50 refinements per each model 
parametrization and refinement strategy that you want to test. In this 
case I would simply reduce choices down to most reasonable given the 
resolution and model quality:

- use individual B-factor refinement. With type of restraints we have it 
is ok to do in most cases. Switch to group B refinement only if you have 
strong reasons to believe that individual B refinement isn't good for 
your case.
- Use torsion NCS;
- Use Ramachandran plot restraints only to keep (preserve) good 
conformations during refinement, not to fix bad ones (outliers). That 
is: in case of outlier, for it manually first then refine with 
Ramachandran restraints so that it does not become outlier again.
- If you have a higher resolution good model, you can use it as a 
reference model, if needed.

In future we will investigate using ideas recently published in Acta D 
that suggest ways to overcome the problem of too small test sets.

Pavel

On 12/19/14 3:18 AM, Derek Logan wrote:
> Hi everyone,
>
> Right now we have one of those very difficult Rfree situations where 
> it's impossible to generate a single meaningful Rfree set. Since we're 
> in a bit of a hurry with this structure it would be good if someone 
> could point me in the right direction. We have crystals with 1542 
> non-H atoms in the asymmetric unit that diffract to only 3.6 Å in P65, 
> which gives us a whopping 2300 reflections in total. 5% of this is 
> only about 100 reflections. Luckily the protein is only a single point 
> mutation of a wild type that has been solved to much better 
> resolution, so we know what it should look like and I simply want to 
> investigate the effect of different levels of conservatism in the 
> refinement, e.g. NCS in xyz and B, group B-factors, reference model, 
> Ramachandran restraints etc. However since the quality criterion for 
> this is Rfree I'm not able to do this.
>
> I believe the correct approach is k-fold statistical cross-validation, 
> but can someone remind me of the correct way to do this? I've done a 
> bit of Googling without finding anything very helpful.
>
> Thanks
> Derek
> ________________________________________________________________________
> Derek Logan       tel: +46 46 222 1443
> Associate Professor mob: +46 76 8585 707
> Dept. of Biochemistry and Structural Biology www.cmps.lu.se 
> <http://www.cmps.lu.se>
> Centre for Molecular Protein Science  www.maxlab.lu.se/crystal
> Lund University, Box 124, 221 00 Lund, Sweden           www.saromics.com
>
>
>
>
>
>
>
>
>
> _______________________________________________
> phenixbb mailing list
> phenixbb at phenix-online.org
> http://phenix-online.org/mailman/listinfo/phenixbb

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://phenix-online.org/pipermail/phenixbb/attachments/20141219/de1e77df/attachment.htm>