spuco.utils

Utility classes and functions.

Trainer

class Trainer(trainset: Dataset, model: Module, batch_size: int, optimizer: Optimizer, lr_scheduler: _LRScheduler | None = None, max_grad_norm: float | None = None, criterion: Module = CrossEntropyLoss(), forward_pass: Callable[[Any], Tuple[Tensor, Tensor, Tensor]] | None = None, sampler: Sampler | None = None, device: device = device(type='cpu'), verbose: bool = False)

Bases: object

__init__(trainset: Dataset, model: Module, batch_size: int, optimizer: Optimizer, lr_scheduler: _LRScheduler | None = None, max_grad_norm: float | None = None, criterion: Module = CrossEntropyLoss(), forward_pass: Callable[[Any], Tuple[Tensor, Tensor, Tensor]] | None = None, sampler: Sampler | None = None, device: device = device(type='cpu'), verbose: bool = False) None

Initializes an instance of the Trainer class.

Parameters:
  • trainset (torch.utils.data.Dataset) – The training set.

  • model (torch.nn.Module) – The PyTorch model to train.

  • batch_size (int) – The batch size to use during training.

  • optimizer (torch.optim.Optimizer) – The optimizer to use for training.

  • criterion (torch.nn.Module, optional) – The loss function to use during training. Default is nn.CrossEntropyLoss().

  • forward_pass (Callable[[Any], Tuple[torch.Tensor, torch.Tensor, torch.Tensor]], optional) – The forward pass function to use during training. Default is None.

  • sampler (torch.utils.data.Sampler, optional) – The sampler to use for creating batches. Default is None.

  • device (torch.device, optional) – The device to use for computations. Default is torch.device(“cpu”).

  • verbose (bool, optional) – Whether to print training progress. Default is False.

train(num_epochs: int)

Trains for given number of epochs

Parameters:

num_epochs (int) – Number of epochs to train for

train_epoch(epoch: int) None

Trains the PyTorch model for 1 epoch

Parameters:

epoch (int) – epoch number that is being trained (only used by logging)

static compute_accuracy(outputs: Tensor, labels: Tensor) float

Computes the accuracy of the PyTorch model.

Parameters:
  • outputs (torch.Tensor) – The predicted outputs of the model.

  • labels (torch.Tensor) – The ground truth labels.

Returns:

The accuracy of the model.

Return type:

float

get_trainset_outputs()

Gets output of model on trainset

Custom Indices Sampler

class CustomIndicesSampler(indices: List[int], shuffle: bool = False)

Bases: Sampler[int]

Samples from the specified indices (pass indices - upsampled, downsampled, group balanced etc. to this class) Default is no shuffle.

__init__(indices: List[int], shuffle: bool = False)

Samples elements from the specified indices.

Parameters:
  • indices (list[int]) – The list of indices to sample from.

  • shuffle (bool, optional) – Whether to shuffle the indices. Default is False.

Exemplar Clustering (K-Medoids)

cluster_by_exemplars(similarity_matrix, num_exemplars, verbose=False) Dict[int, List[int]]

Returns a dictionary mapping exemplar index to a list of indices.

Parameters:
  • similarity_matrix (numpy.ndarray) – The similarity matrix.

  • num_exemplars (int) – The number of exemplars to select.

  • verbose (bool, optional) – Whether to print progress information.

Returns:

A dictionary mapping exemplar index to a list of indices.

Return type:

dict[int, list[int]]]

closest_exemplar(sample_index, exemplar_indices, similarity_matrix)

Finds the closest exemplar to a given sample index.

Parameters:
  • sample_index (int) – The index of the sample.

  • exemplar_indices (list[int]) – The indices of the exemplars.

  • similarity_matrix (numpy.ndarray) – The similarity matrix.

Returns:

The index of the closest exemplar and the similarity score.

Return type:

tuple[int, float]

Miscellaneous Functions

convert_labels_to_partition(labels: List[int]) Dict[int, List[int]]

Converts a list of labels into a partition dictionary.

Parameters:

labels (List[int]) – List of labels.

Returns:

Partition dictionary mapping labels to their corresponding indices.

Return type:

Dict[int, List[int]]

convert_partition_to_labels(partition: Dict[int, List[int]]) List[int]

Converts a partition dictionary into a list of labels.

Parameters:

partition (Dict[int, List[int]]) – Partition dictionary mapping labels to their corresponding indices.

Returns:

List of labels.

Return type:

List[int]

label_examples(unlabled_dataloader: DataLoader, model: Module, device: device)

Labels examples using a trained model.

Parameters:
  • unlabeled_dataloader (torch.utils.data.DataLoader) – Dataloader containing unlabeled examples.

  • model (torch.nn.Module) – Trained model for labeling examples.

  • device (torch.device) – Device to use for computations.

Returns:

List of predicted labels.

Return type:

List[int]

pairwise_similarity(Z1: tensor, Z2: tensor, block_size: int = 1024)

Computes pairwise similarity between two sets of embeddings.

Parameters:
  • Z1 (torch.tensor) – Tensor containing the first set of embeddings.

  • Z2 (torch.tensor) – Tensor containing the second set of embeddings.

  • block_size (int) – Size of the blocks for computing similarity. Default is 1024.

Returns:

Pairwise similarity matrix.

Return type:

np.array

get_group_ratios(indices: List[int], group_partition: Dict[Tuple[int, int], List[int]])

Returns the ratio of each group found in the given indices

Parameters:
  • Z1 (torch.tensor) – Tensor containing the first set of embeddings.

  • Z2 (torch.tensor) – Tensor containing the second set of embeddings.