SpuCo Datasets

The package currently provides the following datasets:

  • SpuCoMNIST: A controllable synthetic dataset that explores real-world data properties (spurious feature: colored background, core feature: MNIST digit)

  • SpuCoAnimals: A large-scale vision dataset curated from ImageNet to explore real-world spurious correlations (spurious feature: background, core_feature: animal)

SpuCoMNIST

SpuCoMNIST allows controllable:

  • number of classes and size of classes: digits can be grouped to form classes of SpuCoMNIST (e.g. [0,1] [2,3] [4,5] [6,7] [8,9])

  • spurious feature magnitude: large magnitude = easier to learn, small magnitude = harder to learn

  • spurious feature variance: low variance = easier to learn, high variance = harder to learn

  • label noise: proportion of examples with incorrect labels

  • feature noise: proportion of examples with corrupted core feature i.e. digit

Examples from SpuCoMNIST

SpuCoAnimals

Next, we introduce SpuCoAnimals, a large-scale vision dataset curated from ImageNet with two realistic spurious correlations.

SpuCoAnimals has 4 classes:

  • landbirds

  • waterbirds

  • small dog breeds

  • big dog breeds.

Waterbirds and Landbirds are spuriously correlated with water and land backgrounds, respectively. Small dogs and big dogs are spuriously correlated with indoor and outdoor backgrounds, respectively.

Examples from SpuCoAnimals

SpuCoDogs

Subset of SpuCoAnimals containing only dogs.

SpuCoBirds

Subset of SpuCoAnimals containing only birds.