Data Module¶
The physioex.data
module provides the API to read the data from the disk once the raw datasets have been processed by the Preprocess
module. It consists of two classes:
physioex.data.PhysioExDataset
which serialize the disk processed version of the dataset into aPyTorch Dataset
physioex.data.PhysioExDataModule
which transforms the datasets toPyTorch DataLoaders
ready for training.
Example of Usage¶
PhysioExDataset¶
The PhysioExDataset
class is automatically handled by the PhysioExDataModule
class when you need to use it for training or testing purposes. In most of the cases you don't need to interact with the PhysioExDataset
class.
The class is instead really helpfull when you need to visualize your data, or you need to get some samples of your data to provide them as input to Explainable AI algorithms.
In these cases you need to instantiate a PhysioExDataset
:
Then you can use a python plotting library to plot visualize the data
Example
PhysioExDataModule¶
The PhysioExDataModule
class is designed to transform datasets into PyTorch DataLoaders
ready for training. It handles the batching, shuffling, and splitting of the data into training, validation, and test sets.
To use the PhysioExDataModule
, you need to instantiate it with the required parameters:
PhysiEx is built on pytorch_lightning
for model training and testig, hence you can use PhysioExDataModule
in combination with pl.Trainer
Documentation¶
PhysioExDataset
Bases: Dataset
A PyTorch Dataset class for handling physiological data from multiple datasets.
Attributes:
Name | Type | Description |
---|---|---|
datasets |
List[str]
|
List of dataset names. |
L |
int
|
Sequence length. |
channels_index |
List[int]
|
Indices of selected channels. |
readers |
List[DataReader]
|
List of DataReader objects for each dataset. |
tables |
List[DataFrame]
|
List of data tables for each dataset. |
dataset_idx |
ndarray
|
Array indicating the dataset index for each sample. |
target_transform |
Callable
|
Optional transform to be applied to the target. |
len |
int
|
Total number of samples across all datasets. |
Methods:
Name | Description |
---|---|
__len__ |
Returns the total number of samples. |
split |
int = -1, dataset_idx: int = -1): Splits the data into train, validation, and test sets. |
get_num_folds |
Returns the minimum number of folds across all datasets. |
__getitem__ |
Returns the input and target for a given index. |
get_sets |
Returns the indices for the train, validation, and test sets. |
__init__()
¶
Initializes the PhysioExDataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets
|
List[str]
|
List of dataset names. |
required |
data_folder
|
str
|
Path to the folder containing the data. |
required |
preprocessing
|
str
|
Type of preprocessing to apply. Defaults to "raw". |
required |
selected_channels
|
List[int]
|
List of selected channels. Defaults to ["EEG"]. |
required |
sequence_length
|
int
|
Length of the sequence. Defaults to 21. |
required |
target_transform
|
Callable
|
Optional transform to be applied to the target. Defaults to None. |
required |
hpc
|
bool
|
Flag indicating whether to use high-performance computing. Defaults to False. |
required |
indexed_channels
|
List[int]
|
List of indexed channels. Defaults to ["EEG", "EOG", "EMG", "ECG"]. If you used a custom Preprocessor and you saved your signal channels in a different order, you should provide the correct order here. In any other case ignore this parameter. |
required |
__getitem__(idx)
¶
Returns the input and target sequence for a given index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx
|
int
|
Index of the sample to retrieve. |
required |
Returns:
Name | Type | Description |
---|---|---|
tuple |
Input and target for the given index. |
__len__()
¶
Returns the total number of sequences of epochs across all the datasets.
Returns:
Name | Type | Description |
---|---|---|
int |
Total number of sequences. |
split(fold=-1, dataset_idx=-1)
¶
Splits the data into train, validation, and test sets.
if fold is -1, and dataset_idx is -1 : set the split to a random fold for each dataset if fold is -1, and dataset_idx is not -1 : set the split to a random fold for the selected dataset if fold is not -1, and dataset_idx is -1 : set the split to the selected fold for each dataset if fold is not -1, and dataset_idx is not -1 : set the split to the selected fold for the selected dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fold
|
int
|
Fold number to use for splitting. Defaults to -1. |
-1
|
dataset_idx
|
int
|
Index of the dataset to split. Defaults to -1. |
-1
|
get_num_folds()
¶
Returns the minimum number of folds across all datasets.
Returns:
Name | Type | Description |
---|---|---|
int |
Minimum number of folds. |
get_sets()
¶
Returns the indices for the train, validation, and test sets.
Returns:
Name | Type | Description |
---|---|---|
tuple |
Indices for the train, validation, and test sets. |
PhysioExDataModule
Bases: LightningDataModule
A PyTorch Lightning DataModule for handling physiological data from multiple datasets.
Attributes:
Name | Type | Description |
---|---|---|
datasets_id |
List[str]
|
List of dataset names. |
num_workers |
int
|
Number of workers for data loading. |
dataset |
PhysioExDataset
|
The dataset object. |
batch_size |
int
|
Batch size for the DataLoader. |
hpc |
bool
|
Flag indicating whether to use high-performance computing. |
train_dataset |
Union[PhysioExDataset, Subset]
|
Training dataset. |
valid_dataset |
Union[PhysioExDataset, Subset]
|
Validation dataset. |
test_dataset |
Union[PhysioExDataset, Subset]
|
Test dataset. |
train_sampler |
Union[SubsetRandomSampler, Subset]
|
Sampler for the training dataset. |
valid_sampler |
Union[SubsetRandomSampler, Subset]
|
Sampler for the validation dataset. |
test_sampler |
Union[SubsetRandomSampler, Subset]
|
Sampler for the test dataset. |
Methods:
Name | Description |
---|---|
setup |
str): Sets up the datasets for different stages. |
train_dataloader |
Returns the DataLoader for the training dataset. |
val_dataloader |
Returns the DataLoader for the validation dataset. |
test_dataloader |
Returns the DataLoader for the test dataset. |
__init__()
¶
Initializes the PhysioExDataModule.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datasets
|
List[str]
|
List of dataset names. |
required |
batch_size
|
int
|
Batch size for the DataLoader. Defaults to 32. |
required |
preprocessing
|
str
|
Type of preprocessing to apply. Defaults to "raw". |
required |
selected_channels
|
List[int]
|
List of selected channels. Defaults to ["EEG"]. |
required |
sequence_length
|
int
|
Length of the sequence. Defaults to 21. |
required |
target_transform
|
Callable
|
Optional transform to be applied to the target. Defaults to None. |
required |
folds
|
Union[int, List[int]]
|
Fold number(s) for splitting the data. Defaults to -1. |
required |
data_folder
|
str
|
Path to the folder containing the data. Defaults to None. |
required |
num_nodes
|
int
|
Number of nodes for distributed training. Defaults to 1. |
required |
num_workers
|
int
|
Number of workers for data loading. Defaults to os.cpu_count(). |
required |
train_dataloader()
¶
Returns the DataLoader for the training dataset.
Returns:
Name | Type | Description |
---|---|---|
DataLoader |
DataLoader for the training dataset. |
test_dataloader()
¶
Returns the DataLoader for the test dataset.
Returns:
Name | Type | Description |
---|---|---|
DataLoader |
DataLoader for the test dataset. |