fresco.data_loaders package
Submodules
fresco.data_loaders.data_utils module
Module for loading pre-generated data.
- class fresco.data_loaders.data_utils.AddNoise(unk_token, max_pad_len, vocab_size, switch_rate, seed=None)[source]
Bases:
object
Optional transform object for the PathReports dataset.
This transform adds a random amount of padding at the front of the document using unk_token to reduce HiSAN overfitting. It also randomly replaces words with randomly selected other words to reduce overfitting.
- Parameters:
unk_token (int) – Integer mapping for unknown tokens.
max_pad_len (int) – Maximum amount of padding at the front of the document.
vocab_size (int) – Size of the vocabulary matrix or the maximum integer value to use when randomly replacing word tokens.
(float (switch_rate) – 0.1): Percentage of words to randomly replace with random tokens.
default – 0.1): Percentage of words to randomly replace with random tokens.
- class fresco.data_loaders.data_utils.DataHandler(data_source: str, model_args: dict, clc_flag: bool = False)[source]
Bases:
object
Class for loading data.
- data_source
Defines how data was generated. Needs to be ‘pre-generated’, ‘pipeline’, or ‘official’ (not implemented).
- Type:
str
- model_args
Keywords necessary to load data, build, and train a model_args.
- Type:
dict
Note: The present implementation is only for 1 fold. We presently cannot generate more than one fold of data.
- convert_y()[source]
Add task unknown labels to Y and map values to integers for inference.
- Post-condition:
The data frame with the output, the ys, is modified in place by this function. It maps the string values to ints for inference, i.e., C50 -> 48 for the site task.
- Note: If loading data separately from creating torch dataloaders,
this function should be called if you want ints and not strings.
- inference_loader(reproducible: bool = True, seed: int | None = None, batch_size: int = 128) dict [source]
Create torch DataLoader class for inference from a trained model.
Returns a dictionary of PyTorch DataLoaders (test) for inference.
- Parameters:
reproducible (bool) – Seet all random number seeds.
seed (int) – Random number generator seed.
batch_size (int) – Batch size for inference.
- load_folds(fold: int = 0, subset_frac: float | None = None)[source]
Load data for each fold in the dataset.
- Parameters:
fold (int) – Integer number of the fold to load.
subset_frac (float) – Float value for proportion to load.
Pre-condition: __init__ called and model_args is not None.
Post-condition: Class attributes populated.
Note: Case level context model will load fold 0 by default. See run_clc.py line 225.
- load_from_saved(fold: int, subset_frac: float | None = None) dict [source]
Load data files.
- Parameters:
fold (int) – Fold number. Should always be 0 for now.
subset_frac (float) – Proportion of data to load.
- Post-condition:
Modifies self.splits in-place.
- load_weights(fold)[source]
Loads class weights from pickle file or dict in model_args file.
- Parameters:
fold (int) – Data fold to be loaded
- make_grouped_cases(doc_embeds, clc_args, device)[source]
Created GroupedCases class for torch DataLoaders.
- make_torch_dataloaders(switch_rate: float, reproducible: bool = False, seed: int | None = None) dict [source]
Create torch DataLoader classes for training module.
Returns a dictionary of PyTorch DataLoaders (train, val, test) for the training module.
- Parameters:
unk_tok (int) – Token to convert unknown words to.
vocab_size (int) – Number of words in the vocabulary.
switch_rate (float) – Proportion of words in each document to randomly flip.
- class fresco.data_loaders.data_utils.GroupedCases(doc_embeds, Y, tasks, metadata, device, exclude_single=True, shuffle_data_order=True)[source]
Bases:
Dataset
Class for grouping cases for clc.
- class fresco.data_loaders.data_utils.LabelDict(data, missing)[source]
Bases:
dict
Create dict subclass for correct behaviour when mapping task unks.