fresco.data_loaders package

Submodules

fresco.data_loaders.data_utils module

Module for loading pre-generated data.

class fresco.data_loaders.data_utils.AddNoise(unk_token, max_pad_len, vocab_size, switch_rate, seed=None)[source]

Bases: object

Optional transform object for the PathReports dataset.

This transform adds a random amount of padding at the front of the document using unk_token to reduce HiSAN overfitting. It also randomly replaces words with randomly selected other words to reduce overfitting.

Parameters:

unk_token (int) – Integer mapping for unknown tokens.
max_pad_len (int) – Maximum amount of padding at the front of the document.
vocab_size (int) – Size of the vocabulary matrix or the maximum integer value to use when randomly replacing word tokens.
(float (switch_rate) – 0.1): Percentage of words to randomly replace with random tokens.
default – 0.1): Percentage of words to randomly replace with random tokens.

class fresco.data_loaders.data_utils.DataHandler(data_source: str, model_args: dict, clc_flag: bool = False)[source]

Bases: object

Class for loading data.

data_source

Defines how data was generated. Needs to be ‘pre-generated’, ‘pipeline’, or ‘official’ (not implemented).

Type:: str

model_args

Keywords necessary to load data, build, and train a model_args.

Type:: dict

Note: The present implementation is only for 1 fold. We presently cannot generate more than one fold of data.

convert_y()[source]

Add task unknown labels to Y and map values to integers for inference.

Post-condition:: The data frame with the output, the ys, is modified in place by this function. It maps the string values to ints for inference, i.e., C50 -> 48 for the site task.
Note: If loading data separately from creating torch dataloaders,: this function should be called if you want ints and not strings.

get_vocab()[source]: Get the vocab and word embeddings from tokenized data.

inference_loader(reproducible: bool = True, seed: int | None = None, batch_size: int = 128) → dict[source]

Create torch DataLoader class for inference from a trained model.

Returns a dictionary of PyTorch DataLoaders (test) for inference.

Parameters:

reproducible (bool) – Seet all random number seeds.
seed (int) – Random number generator seed.
batch_size (int) – Batch size for inference.

load_folds(fold: int = 0, subset_frac: float | None = None)[source]

Load data for each fold in the dataset.

Parameters:

fold (int) – Integer number of the fold to load.
subset_frac (float) – Float value for proportion to load.

Pre-condition: __init__ called and model_args is not None.

Post-condition: Class attributes populated.

Note: Case level context model will load fold 0 by default. See run_clc.py line 225.

load_from_saved(fold: int, subset_frac: float | None = None) → dict[source]

Load data files.

Parameters:

fold (int) – Fold number. Should always be 0 for now.
subset_frac (float) – Proportion of data to load.

Post-condition:: Modifies self.splits in-place.

load_weights(fold)[source]

Loads class weights from pickle file or dict in model_args file.

Parameters:: fold (int) – Data fold to be loaded

make_grouped_cases(doc_embeds, clc_args, device)[source]: Created GroupedCases class for torch DataLoaders.

make_torch_dataloaders(switch_rate: float, reproducible: bool = False, seed: int | None = None) → dict[source]

Create torch DataLoader classes for training module.

Returns a dictionary of PyTorch DataLoaders (train, val, test) for the training module.

Parameters:

unk_tok (int) – Token to convert unknown words to.
vocab_size (int) – Number of words in the vocabulary.
switch_rate (float) – Proportion of words in each document to randomly flip.

static seed_worker(worker_id)[source]: Set random seed for everything.

class fresco.data_loaders.data_utils.GroupedCases(doc_embeds, Y, tasks, metadata, device, exclude_single=True, shuffle_data_order=True)[source]

Bases: Dataset

Class for grouping cases for clc.

class fresco.data_loaders.data_utils.LabelDict(data, missing)[source]

Bases: dict

Create dict subclass for correct behaviour when mapping task unks.

class fresco.data_loaders.data_utils.PathReports(X, Y, tasks, label_encoders, max_len=3000, transform=None, multilabel=False)[source]: Bases: Dataset

fresco.data_loaders.data_utils.word2int(tok, vocab)[source]

Map words to tokens for random embeddings.

If a word doesn’t exist in the training/val split, it is mapped to ‘unk’, the unknown token.

fresco.data_loaders package

Submodules

fresco.data_loaders.data_utils module

Module contents