data_utils
Type aliases
tuple[str, str]
SeqRecord: list[SeqRecord]
SeqRecords: dict[str, SeqRecords] GroupwiseSeqRecords:
get_single_and_paired_seqs
get_single_and_paired_seqs (data_group_by_group_x:dict[str,list[tuple[st r,str]]], data_group_by_group_y:dict[str,list [tuple[str,str]]], group_names:Optional[colle ctions.abc.Sequence[str]]=None)
Single and paired sequences from two sequence records. The paired sequences are returned as a list of dictionaries, where the keys are the concatenated sequences and the values are the number of times that pair appears in the concatenated MSA.
create_groupwise_seq_records
create_groupwise_seq_records (seq_records:list[tuple[str,str]], group_name_func:<built-infunctioncallable>, remove_groups_with_one_seq:bool=True)
Group records of the form (header, sequence)
in a collection by group name (e.g. species name), extracted from header information using group_name_func
.
remove_groups_not_in_both
remove_groups_not_in_both (data_group_by_group_x:dict[str,list[tuple[str ,str]]], data_group_by_group_y:dict[str,list[t uple[str,str]]])
Remove groups that are not present in both input collections.
pad_msas_with_dummy_sequences
pad_msas_with_dummy_sequences (data_group_by_group_x:dict[str,list[tuple [str,str]]], data_group_by_group_y:dict[st r,list[tuple[str,str]]], dummy_symbol:str='-')
Pad MSAs with dummy sequences so that all groups/species contain the same number of sequences.
one_hot_encode_msa
one_hot_encode_msa (seq_records:list[tuple[str,str]], aa_to_int:Optional[dict[str,int]]=None, device:Optional[torch.device]=None)
Given a list of records of the form (header, sequence), assumed to be a parsed MSA, tokenize each sequence and one-hot encode each token. Return a 3D tensor representing the one-hot encoded MSA.
compute_num_correct_pairings
compute_num_correct_pairings (hard_perms_by_group:list[numpy.ndarray], compare_to_identity_permutation:bool, singl e_and_paired_seqs:Optional[dict[str,list]]= None)
*Compute the total number of correct pairings. ‘Correct’ means that they are present in the original paired MSAs, assumed to be the ground truth.
If compare_to_identity_permutation
is True, then the correct pairings are assumed to be given by the identity permutation, and the x_seqs
, y_seqs
, and xy_seqs
arguments are ignored.*
compute_comparable_group_idxs
compute_comparable_group_idxs (group_sizes_arr:numpy.ndarray, max_size_ratio:int, max_group_size:int)