data_utils

Utilities for dataset generation and tokenization

Type aliases

SeqRecord: tuple[str, str]
SeqRecords: list[SeqRecord]
GroupwiseSeqRecords: dict[str, SeqRecords]

source

get_single_and_paired_seqs

 get_single_and_paired_seqs
                             (data_group_by_group_x:dict[str,list[tuple[st
                             r,str]]], data_group_by_group_y:dict[str,list
                             [tuple[str,str]]], group_names:Optional[colle
                             ctions.abc.Sequence[str]]=None)

Single and paired sequences from two sequence records. The paired sequences are returned as a list of dictionaries, where the keys are the concatenated sequences and the values are the number of times that pair appears in the concatenated MSA.


source

create_groupwise_seq_records

 create_groupwise_seq_records (seq_records:list[tuple[str,str]],
                               group_name_func:<built-infunctioncallable>,
                               remove_groups_with_one_seq:bool=True)

Group records of the form (header, sequence) in a collection by group name (e.g. species name), extracted from header information using group_name_func.


source

remove_groups_not_in_both

 remove_groups_not_in_both
                            (data_group_by_group_x:dict[str,list[tuple[str
                            ,str]]], data_group_by_group_y:dict[str,list[t
                            uple[str,str]]])

Remove groups that are not present in both input collections.


source

pad_msas_with_dummy_sequences

 pad_msas_with_dummy_sequences
                                (data_group_by_group_x:dict[str,list[tuple
                                [str,str]]], data_group_by_group_y:dict[st
                                r,list[tuple[str,str]]],
                                dummy_symbol:str='-')

Pad MSAs with dummy sequences so that all groups/species contain the same number of sequences.


source

one_hot_encode_msa

 one_hot_encode_msa (seq_records:list[tuple[str,str]],
                     aa_to_int:Optional[dict[str,int]]=None,
                     device:Optional[torch.device]=None)

Given a list of records of the form (header, sequence), assumed to be a parsed MSA, tokenize each sequence and one-hot encode each token. Return a 3D tensor representing the one-hot encoded MSA.


source

compute_num_correct_pairings

 compute_num_correct_pairings (hard_perms_by_group:list[numpy.ndarray],
                               compare_to_identity_permutation:bool, singl
                               e_and_paired_seqs:Optional[dict[str,list]]=
                               None)

*Compute the total number of correct pairings. ‘Correct’ means that they are present in the original paired MSAs, assumed to be the ground truth.

If compare_to_identity_permutation is True, then the correct pairings are assumed to be given by the identity permutation, and the x_seqs, y_seqs, and xy_seqs arguments are ignored.*


source

compute_comparable_group_idxs

 compute_comparable_group_idxs (group_sizes_arr:numpy.ndarray,
                                max_size_ratio:int, max_group_size:int)