|Python-based Hierarchical ENvironment for Integrated Xtallography|
Sequence assignment and linkage of neighboring segments with assign_sequence
You can now carry out an improved sequence assignment of a model that you have already built with phenix.assign_sequence. Further, once the sequence has been assigned, this method will use the sequence and proximity to identify chains that should be connected, and it will connect those that have the appropriate relationships using the new loop libraries available in phenix.fit_loops. The result is that you may be able to obtain a more complete model with more chains assigned to sequence than previously.
assign_sequence is a command line tool for reanalyzing resolve sequence assignment for a model and a map including the non-crystallographic symmetry, exclusion of sequence by previously-assigned regions, and requirement for plausible distances and geometries between ends of fragments with assigned sequences. Additionally assign_sequence will use the fit_loops loop library to connect segments that are separated by a short loop.
Note: assign_sequence is designed to be used after resolve model-building in which residues that are not assigned to sequence are given residue numbers higher than any residue in the input sequence file. If you input a model not built by resolve or in phenix, or if you would like to completely redo the sequence assignment for your model, be sure to set "allow_fixed_segments=False".
How assign_sequence works:
The starting point for assign_sequence is a set of segments of structure read in from the input model. assign_sequence then uses resolve to calculate the compatibility of each possible side chain with each residue in each segment. Then assign_sequence tests out possible combinations of alignments of all the segments in the input model and chooses the set of alignments that is most compatible with the density map, the number of NCS copies, and with the geometries and distances between ends of the segments.
assign_sequence uses the side-chain to map compatibility matrix calculated by resolve to assess the relative probabilities of each possible side chain at each position in the input model. Segments that are positively assigned to a sequence by resolve are (by default) maintained and used as anchors for further sequence assignment. All other segments have a relative probability associated with each possible alignment of the segment to the input sequence. The score for each alignment is the logarithm of this probability (essentially a log-likelihood LL score).
Any pair of segments with some assignment of sequence to each segment has an additional score corresponding to the plausibility of a connection of the expected length existing between the segments. If the distance between ends is greater than can be bridged by the number of residues separating them, then the connection is not possible. If the connection is possible, it is scored based on the best density fit (CC) of a loop from the fit_loops loop library. This additional score is normally 10*CC.
Generating sequence alignments and connectivities
assign_sequence starts with the segments with the most convincing assignments of sequence. Often these are those with sequence positively assigned by resolve; otherwise they are those with the highest-probability assignments. This yields a starting arrangement (sequence assignment for a set of segments). Then each possible sequence assignment of each unassigned segment is tested for compatibility with the existing arrangement and the one that is most compatible (based on the connections that would result, duplication of sequence, and sequence-map matching) is added to the arrangement. Optionally many arrangements can be built up in parallel, but often a very good one can be found simply by taking the top one at each step. This process is repeated until no additional segments can be added to the arrangement to yield an increase in log-likelihood score of (by default) 2 or greater.
assign_sequence builds up a set of possible sequence assignments and connectivities that depends on the expected number of copies in the asymmetric unit of the crystal. If there is only one copy of the molecule in the crystal, then no residues in the sequence can be used more than once in sequence assignment. If there are N copies, then a residue can be used up to N times. If there are multiple copies, then each molecule must be self-consistent, with plausible distances and geometries relating each segment to the next.
Once a final arrangement is found, including NCS if applicable, all segments that are separated by short loops (typically 0-3 residues) are connected using loops from the fit_loops loop library. This yields longer segments of structure with sequences fully assigned. The resulting model then has side chains added to match the newly-assigned sequence and is written out.
Output files from assign_sequence
assign_sequence.pdb: A PDB file with your input model assigned to sequence (to the extent possible). Residues not assigned to sequence will be given a chain ID higher than those assigned, and they will be given residue numbers higher than any residue number in the sequence file.
Standard run of assign_sequence:
phenix.assign_sequence map_coeffs.mtz coords.pdb sequence.datIf you want (or need) to specify the column names from your mtz file, you will need to tell assign_sequence what FP and PHIB (and optionally FOM) are, in this format:
phenix.assign_sequence map_coeffs.mtz coords.pdb \ labin="FP=2FOFCWT PHIB=PH2FOFCWT" sequence.dat
Specific limitations and problems:
List of all assign_sequence keywords
------------------------------------------------------------------------------- Legend: black bold - scope names black - parameter names red - parameter values blue - parameter help blue bold - scope help Parameter values: * means selected parameter (where multiple choices are available) False is No True is Yes None means not provided, not predefined, or left up to the program "%3d" is a Python style formatting descriptor ------------------------------------------------------------------------------- assign_sequence input_files seq_file= None File with 1-letter code sequence of molecule. Chains separated by blank line or greater-than sign pdb_in= None Optional starting PDB file (ends will be extended if present) mtz_in= None MTZ file with coefficients for a map labin= "" Labin line for MTZ file with map coefficients. This is optional if assign_sequence can guess the correct coefficients for FP PHI and FOM. Otherwise specify: LABIN FP=myFP PHIB=myPHI FOM=myFOM where myFP is your column label for FP prob_file= None File with sequence probability information from resolve output_files pdb_out= assign_sequence.pdb Output PDB file log= None Output log file params_out= assign_sequence_params.eff Parameters file to rerun assign_sequence assignment ncs_resolution= None Resolution for NCS identification find_ncs= False Try to find NCS in chains after sequence assignment range_to_keep= 4.0 Keep solutions with score within range_to_keep of the maximum convincing_score= 2. Score gain required to keep a sequence assignment max_indiv_tries_per_level= None Number of sequence assignments to consider for each segment (quick default = 1, otherwise 3) max_total_tries_per_level= None Number of sequence arrangements to consider for all additional segments (quick default = 1, otherwise 6) max_placements= 100 Number of placements of any segment to consider max_final_placements= 10 Number of final arrangements to consider max_write= 1 Number of arrangements to write out check_ncs_with_offset= True Check to verify that segments that seem to show NCS are actually different if offset by 1 residue. If protein is just helices then you might need to try check=False list_only_complete= False Only include complete arrangements; ignore those that have arrangements of some segments that are subsequently removed as incompatible allow_fixed_segments= True If True, then input segments with sequences assigned, as identified by sequence numbers less than or equal to the longest segment in the sequence file, are kept fixed. replace_side_chains= True At the end of sequence assignment identify side-chain rotamers and replace existing side-chains replace_direct_joins= False Use fit-loops to rebuild all junctions that are joined flush short_segment_length= 12 Definition of a short segment max_unassigned_short_segments= 10 Maximum number of segments short_segment_length or fewer residues that are not assigned to sequence to consider in connections. Keeping too many can make the analysis take a very long time. directories temp_dir= "temp_dir" Optional temporary work directory crystal_info resolution= 0. high-resolution limit for map calculation solvent_fraction= 0.5 solvent fraction chain_type= *PROTEIN DNA RNA Chain type (for identifying main-chain and side-chain atoms) ncs_copies= 1 NCS copies model_building dist_max= 20. Maximum distance ends can be apart to consider for linking max_loop_lib_gap= 3 Maximum number of residues in working loop library (This must match loop libraries that are available) control verbose= False Verbose output quick= True Run quickly raise_sorry= False Raise sorry if problems debug= False Debugging output dry_run= False Just read in and check parameter names resolve_command_list= None You can supply any resolve command here NOTE: for command-line usage you need to enclose the whole set of commands in double quotes (") and each individual command in single quotes (') like this: resolve_command_list="'no_build' 'b_overall 23' "