phenix_logo
Python-based Hierarchical ENvironment for Integrated Xtallography
Documentation Home
 

Sequence assignment and linkage of neighboring segments with assign_sequence

Author(s)
Purpose
Usage
How assign_sequence works:
Output files from assign_sequence
Examples
Standard run of assign_sequence:
Possible Problems
Specific limitations and problems:
Literature
Additional information
List of all assign_sequence keywords

Author(s)

  • assign_sequence: Tom Terwilliger

Purpose

You can now carry out an improved sequence assignment of a model that you have already built with phenix.assign_sequence. Further, once the sequence has been assigned, this method will use the sequence and proximity to identify chains that should be connected, and it will connect those that have the appropriate relationships using the new loop libraries available in phenix.fit_loops. The result is that you may be able to obtain a more complete model with more chains assigned to sequence than previously.

assign_sequence is a command line tool for reanalyzing resolve sequence assignment for a model and a map including the non-crystallographic symmetry, exclusion of sequence by previously-assigned regions, and requirement for plausible distances and geometries between ends of fragments with assigned sequences. Additionally assign_sequence will use the fit_loops loop library to connect segments that are separated by a short loop.

Note: assign_sequence is designed to be used after resolve model-building in which residues that are not assigned to sequence are given residue numbers higher than any residue in the input sequence file. If you input a model not built by resolve or in phenix, or if you would like to completely redo the sequence assignment for your model, be sure to set "allow_fixed_segments=False".

NOTE: assign_sequence is normally called from phenix.phase_and_build but you can run it interactively if you want.

Usage

How assign_sequence works:

The starting point for assign_sequence is a set of segments of structure read in from the input model. assign_sequence then uses resolve to calculate the compatibility of each possible side chain with each residue in each segment. Then assign_sequence tests out possible combinations of alignments of all the segments in the input model and chooses the set of alignments that is most compatible with the density map, the number of NCS copies, and with the geometries and distances between ends of the segments.

Sequence probabilities:

assign_sequence uses the side-chain to map compatibility matrix calculated by resolve to assess the relative probabilities of each possible side chain at each position in the input model. Segments that are positively assigned to a sequence by resolve are (by default) maintained and used as anchors for further sequence assignment. All other segments have a relative probability associated with each possible alignment of the segment to the input sequence. The score for each alignment is the logarithm of this probability (essentially a log-likelihood LL score).

Connection scores:

Any pair of segments with some assignment of sequence to each segment has an additional score corresponding to the plausibility of a connection of the expected length existing between the segments. If the distance between ends is greater than can be bridged by the number of residues separating them, then the connection is not possible. If the connection is possible, it is scored based on the best density fit (CC) of a loop from the fit_loops loop library. This additional score is normally 10*CC.

Generating sequence alignments and connectivities

assign_sequence starts with the segments with the most convincing assignments of sequence. Often these are those with sequence positively assigned by resolve; otherwise they are those with the highest-probability assignments. This yields a starting arrangement (sequence assignment for a set of segments). Then each possible sequence assignment of each unassigned segment is tested for compatibility with the existing arrangement and the one that is most compatible (based on the connections that would result, duplication of sequence, and sequence-map matching) is added to the arrangement. Optionally many arrangements can be built up in parallel, but often a very good one can be found simply by taking the top one at each step. This process is repeated until no additional segments can be added to the arrangement to yield an increase in log-likelihood score of (by default) 2 or greater.

NCS copies:

assign_sequence builds up a set of possible sequence assignments and connectivities that depends on the expected number of copies in the asymmetric unit of the crystal. If there is only one copy of the molecule in the crystal, then no residues in the sequence can be used more than once in sequence assignment. If there are N copies, then a residue can be used up to N times. If there are multiple copies, then each molecule must be self-consistent, with plausible distances and geometries relating each segment to the next.

Connecting segments:

Once a final arrangement is found, including NCS if applicable, all segments that are separated by short loops (typically 0-3 residues) are connected using loops from the fit_loops loop library. This yields longer segments of structure with sequences fully assigned. The resulting model then has side chains added to match the newly-assigned sequence and is written out.




Output files from assign_sequence

assign_sequence.pdb: A PDB file with your input model assigned to sequence (to the extent possible). Residues not assigned to sequence will be given a chain ID higher than those assigned, and they will be given residue numbers higher than any residue number in the sequence file.

Examples

Standard run of assign_sequence:

Running assign_sequence is easy. From the command-line you can type:

phenix.assign_sequence map_coeffs.mtz coords.pdb sequence.dat
If you want (or need) to specify the column names from your mtz file, you will need to tell assign_sequence what FP and PHIB (and optionally FOM) are, in this format:
phenix.assign_sequence map_coeffs.mtz coords.pdb \
labin="FP=2FOFCWT PHIB=PH2FOFCWT" sequence.dat

Possible Problems

Specific limitations and problems:

Literature

Additional information

List of all assign_sequence keywords

------------------------------------------------------------------------------- 
Legend: black bold - scope names
        black - parameter names
        red - parameter values
        blue - parameter help
        blue bold - scope help
        Parameter values:
          * means selected parameter (where multiple choices are available)
          False is No
          True is Yes
          None means not provided, not predefined, or left up to the program
          "%3d" is a Python style formatting descriptor
------------------------------------------------------------------------------- 
assign_sequence
   input_files
      seq_file= None File with 1-letter code sequence of molecule. Chains
                separated by blank line or greater-than sign
      pdb_in= None Optional starting PDB file (ends will be extended if
              present)
      mtz_in= None MTZ file with coefficients for a map
      map_coeff_labels= None If map coefficients cannot be identified
                        automatically from your MTZ file, you can specify the
                        label or labels for them. (Please separate labels with
                        blank space, MTZ columns grouped together separated by
                        commas with no blanks.) You can specify:
                        map_coeff_labels (e.g., FWT,PHIFWT) amplitudes and
                        phases (e.g., FP,SIGFP PHIB) or amplitudes, phases,
                        weights (e.g., FP,SIGFP PHIB FOM)
      labin= "" For backward compatibility only. Use instead map_coeff_labels.
             Labin line for MTZ file with map coefficients. This is optional
             if assign_sequence can guess the correct coefficients for FP PHIB
             . Otherwise specify: LABIN FP=myFP PHIB=myPHI where myFP is your
             column label for FP. NOTE: myFP and myPHI must be adjacent in the
             mtz file.
      prob_file= None File with sequence probability information from resolve
      linkage_file= None File with linkage information from combine_models
                    Must have been run with the same input pdb file and the
                    same value of min_segment_length
      loop_dict_file= None File with loop information from previous run as
                      pickle file
      pair_object_dict_file= None pair object dict as pickle file
      checked_connections_file= None checked connections dict as pickle file
      density_removed_mtz_in= None MTZ file with density_removed coefficients
                              for a map
      comparison_model= None Comparison model (normally just for testing)
   output_files
      pdb_out= assign_sequence.pdb Output PDB file
      log= None Output log file
      params_out= assign_sequence_params.eff Parameters file to rerun
                  assign_sequence
      output_loop_dict_file= None loop dict as pickle file
      output_pair_object_dict_file= None pair object dict as pickle file
      output_checked_connections_file= None checked connections dict as pickle
                                       file
   assignment
      ncs_resolution= None Resolution for NCS identification
      find_ncs= False Try to find NCS in chains after sequence assignment
      range_to_keep= 4.0 Keep solutions with score within range_to_keep of the
                     maximum
      max_keep= 10 Number of possibilities to keep in optimization
      max_write= 6 Number of possibilities to write out at end
      linkage_score= 10. Score for creating a link between segments
      max_linkage_score= 12. Maximum linkage score attainable
      loop_score= 10. Score for a loop is loop_score*(1.+loop_cc)
      length_mismatch_penalty= 0 Decrease in linkage score if linkage is not
                               correct length
      depth_to_keep= 8.0 In full optimization solutions will be kept with
                     score depth_to_keep + max_linkage_score below the best
      maximum_length_mismatch= 3 Maximum length mismatch in linkages
      min_confidence= 0.9999 Sets required confidence level in a placement of
                      a segment to keep the best one.
      convincing_score= 2. Score gain required to keep a sequence assignment
      starting_convincing_score= 10. Score gain required to keep an initial
                                 sequence assignment
      minimum_length= 4 Minimum length of a segment to place
      min_segment_length= 5 Segments shorter than this will be ignored on
                          read-in.
      max_levels= None Number of segments to consider in building a complete
                  sequence assignment (quick default = 6, otherwise 20)
      max_indiv_tries_per_level= None Number of sequence assignments to
                                 consider for each segment (quick default = 1,
                                 otherwise 3)
      max_total_tries_per_level= None Number of sequence arrangements to
                                 consider for all additional segments (quick
                                 default = 1, otherwise 6)
      max_placements= 100 Number of placements of any segment to consider
      max_final_placements= 20 Number of final arrangements to consider
      check_ncs_with_offset= True Check to verify that segments that seem to
                             show NCS are actually different if offset by 1
                             residue. If protein is just helices then you
                             might need to try check=False
      list_only_complete= False Only include complete arrangements; ignore
                          those that have arrangements of some segments that
                          are subsequently removed as incompatible
      allow_fixed_segments= True If True, then input segments with sequences
                            assigned are kept fixed. If fix_known=False, then
                            assigned segments are identified by sequence
                            numbers less than or equal to the longest segment
                            in the sequence file.
      fix_known= False If True, then all input segments with except those
                 marked as segid=UNK will be considered fixed. (Instead of
                 identifying them based on sequence number in input file).
                 Forces allow_fixed_segments to be True.
      keep_connectivity= False This is very useful if you expect your model to
                         have the same connectivity as your template. If True,
                         then input segments are kept in the order found in
                         the input PDB and kept assigned to the original
                         chains, but their assignments may change otherwise
                         (allowing insertions/deletions). NOTE: If True, then
                         allow_fixed_segments=False unless fix_known=True
      include_chain_u_in_keep_connectivity= False If True, chains with chainID
                                            of 'U' are included in analysis
                                            with keep_connectivity=True
      optimize_arrangements= True Try to optimize arrangements at the end,
                             including removal of uncertain segments
      use_connectivity_in_optimize= True If True and keep_connectivity is True
                                    and optimize_arrangements is True , then
                                    connectivity will be used in chain
                                    optimization
      remove_uncertain_segments= True If True, then remove uncertain segments.
                                 NOTE: at very end of iteration, if any, then
                                 the removal of uncertain segments is
                                 determined by
                                 iteration.remove_uncertain_at_end.
      optimize_sequence_alignment= True If True, then try to align fragments
                                   to template sequence
      minimize_alignment_changes= False If True, then try to keep sequence
                                  numbers matched to original as much as
                                  possible.
      allow_longer_connections= True If a connection segment is available, try
                                with lower scoring also n+1,n+2,n+3
                                connections. This may be a good idea because
                                loops are often built short by one or a few
                                residues.
      replace_side_chains= True At the end of sequence assignment identify
                           side-chain rotamers and replace existing
                           side-chains
      replace_direct_joins= False Use fit-loops to rebuild all junctions that
                            are joined flush
      short_segment_length= 12 Definition of a short segment
      max_loop_length= 8 Maximum length of loop to try to fit
      n_random_loop= 500 Number of loop versions to build
      max_unassigned_short_segments= 20 Maximum number of segments
                                     short_segment_length or fewer residues
                                     that are not assigned to sequence to
                                     consider in connections. Keeping too many
                                     can make the analysis take a very long
                                     time.
      compare_only= False Just compare input model to comparison model
      reset_sequence= False Adjust sequence numbering and sequence of segments
                      in pdb_in based on sequence numbers of matching
                      positions in comparison_model. Cannot be used with
                      compare_only.
      make_unique= None Make segments of input model unique (no overlapping)
                   residues. Normally keep this at None. Used in iteration of
                   assign_sequence.
   iteration
      iterative_assignment= False You can iteratively assign sequence and fit
                            loops This can improve the assignment, but will
                            take longer
      cycles= 3 Cycles of iteration
      start_step= *assign_sequence fit_loops insert_loops get_connections You
                  can decide where to start in the cycle
      end_step= *None assign_sequence fit_loops insert_loops get_connections 
               You can specify last step to carry out in iteration
      skip_step= None You can specify steps to skip in iteration
      assign_sequence_file= None File partially assigned to sequence (output
                            of assign_sequence)
      loops_file= None File with loops to insert
      assign_sequence_insertion_file= None File partially assigned to sequence
                                      with insertions
      connections_file_list= None list of files with possible connections to
                             consider
      only_assign_sequence_on_last_cycle= True Just run assign_sequence step
                                          on last iteration cycle
      build_outside_model= True Build outside model in get_connections
      build_outside_model_once= True Run build outside model a maximum of one
                                time (Run once if build_outside_model=True,
                                never if build_outside_model=False)
      remove_uncertain_at_end= False If True, then remove uncertain segments
                               at very end of iteration (in addition to
                               removing them along the way if
                               remove_uncertain_segments=True )
      connect_all_segments= True Connect all segments sequentially in
                            get_connections
      fill_in_gaps= True After optimize and keep_connectivity, try to fill in
                    gaps
      standard_gap_score= 0. Score for filling default gap
      decrease_tries_with_levels= False Reduce number of tries each level of
                                  search
   directories
      temp_dir= "temp_dir" Optional temporary work directory
      output_dir= "" Optional output directory
   crystal_info
      resolution= 0. high-resolution limit for map calculation
      solvent_fraction= 0.5 solvent fraction
      chain_type= *PROTEIN DNA RNA Chain type (for identifying main-chain and
                  side-chain atoms)
      ncs_copies= 1 NCS copies
   model_building
      dist_max= 20. Maximum distance ends can be apart to consider for linking
      max_loop_lib_gap= 3 Maximum number of residues in working loop library
                        (This must match loop libraries that are available)
   control
      verbose= False Verbose output
      quick= True Run quickly
      raise_sorry= False Raise sorry if problems
      debug= False Debugging output
      dry_run= False Just read in and check parameter names
      nproc= 1 You can specify the number of processors to use
      max_wait_time= 1.0 You can specify the length of time (seconds) to wait
                     when looking for a file. If you have a cluster where jobs
                     do not start right away you may need a longer time to
                     wait. The symptom of too short a wait time is 'File not
                     found'
      wait_between_submit_time= 1.0 You can specify the length of time
                                (seconds) to wait between each job that is
                                submitted when running sub-processes. This can
                                be helpful on NFS-mounted systems when running
                                with multiple processors to avoid file
                                conflicts. The symptom of too short a
                                wait_between_submit_time is File exists:....
      resolve_command_list= None You can supply any resolve command here NOTE:
                            for command-line usage you need to enclose the
                            whole set of commands in double quotes (")
                            and each individual command in single quotes (')
                            like this: resolve_command_list="'no_build'
                            'b_overall 23' "
      background= None run jobs in background or not (if nproc is greater than
                  1) Usually set automatically. If run_command is sh or csh,
                  True
      run_command= "sh " Command for running jobs (e.g., sh or qsub )
   non_user_params
      print_citations= True Print citation information at end of run