Data Module¶

The physioex.data module provides the API to read the data from the disk once the raw datasets have been processed by the Preprocess module. It consists of two classes:

physioex.data.PhysioExDataset which serialize the disk processed version of the dataset into a PyTorch Dataset
physioex.data.PhysioExDataModule which transforms the datasets to PyTorch DataLoaders ready for training.

Example of Usage¶

PhysioExDataset¶

The PhysioExDataset class is automatically handled by the PhysioExDataModule class when you need to use it for training or testing purposes. In most of the cases you don't need to interact with the PhysioExDataset class.

The class is instead really helpfull when you need to visualize your data, or you need to get some samples of your data to provide them as input to Explainable AI algorithms.

In these cases you need to instantiate a PhysioExDataset:

from physioex.data import PhysioExDataset

data = PhysioExDataset(
    datasets = ["hmc"], # you can read different datasets merged together in this way
    preprocessing = "raw",  
    selected_channels = ["EEG", "EOG", "EMG"],     
    data_folder = "/your/data/path/",
)

# you can now access any sequence of epochs in the dataset
signal, label = data[0]

signal.shape # will be [21 (default sequence lenght), 3, 3000]
label.shape # will be [21]

Then you can use a python plotting library to plot visualize the data

Example

import seaborn as sns
import numpy as np 

hypnogram = np.ones((21, 3000)) * label.numpy().reshape(-1, 1)

# plot a subfigure with one column for each element of the sequence (21)
fig, ax = plt.subplots(4, 1, figsize = (21, 8), sharex="col", sharey="row")

hypnogram = hypnogram.reshape( -1 )
signals = signal.numpy().transpose(1, 0, 2).reshape(3, -1)

# set tytle for each subplot
sns.lineplot( x = range(3000*21), y = hypnogram, ax = ax[0], color = "blue")
# then the channels:
sns.lineplot( x = range(3000*21), y = signals[ 0], ax = ax[1], color = "red")
sns.lineplot( x = range(3000*21), y = signals[ 1], ax = ax[2], color = "green")
sns.lineplot( x = range(3000*21), y = signals[ 2], ax = ax[3], color = "purple")    

# check the examples notebook "visualize_data.ipynb" to see how to customize the plot properly

plt.tight_layout()

png

PhysioExDataModule¶

The PhysioExDataModule class is designed to transform datasets into PyTorch DataLoaders ready for training. It handles the batching, shuffling, and splitting of the data into training, validation, and test sets.

To use the PhysioExDataModule, you need to instantiate it with the required parameters:

from physioex.data import PhysioExDataModule

datamodule = PhysioExDataModule(
    datasets=["hmc", "mass"],  # list of datasets to be used
    batch_size=64,             # batch size for the DataLoader
    preprocessing="raw",       # preprocessing method
    selected_channels=["EEG", "EOG", "EMG"],  # channels to be selected
    sequence_length=21,        # length of the sequence
    data_folder="/your/data/path/",  # path to the data folder
)

# get the DataLoaders
train_loader = datamodule.train_dataloader()
val_loader = datamodule.val_dataloader()
test_loader = datamodule.test_dataloader()

PhysiEx is built on pytorch_lightning for model training and testig, hence you can use PhysioExDataModule in combination with pl.Trainer

from pytorch_lightning import Trainer

model = SomePytorchModel()

trainer = Trainer(
    devices="auto"
    max_epochs=10,
    deterministic=True,
)

# setup the model in training mode if needed
model = model.train()
# Start training
trainer.fit(model, datamodule=datamodule)
results = trainer.test( model, datamodule = datamodule)

Documentation¶

PhysioExDataset

Bases: Dataset

A PyTorch Dataset class for handling physiological data from multiple datasets.

Attributes:

Name	Type	Description
`datasets`	`List[str]`	List of dataset names.
`L`	`int`	Sequence length.
`channels_index`	`List[int]`	Indices of selected channels.
`readers`	`List[DataReader]`	List of DataReader objects for each dataset.
`tables`	`List[DataFrame]`	List of data tables for each dataset.
`dataset_idx`	`ndarray`	Array indicating the dataset index for each sample.
`target_transform`	`Callable`	Optional transform to be applied to the target.
`len`	`int`	Total number of samples across all datasets.

Methods:

Name	Description
`__len__`	Returns the total number of samples.
`split`	int = -1, dataset_idx: int = -1): Splits the data into train, validation, and test sets.
`get_num_folds`	Returns the minimum number of folds across all datasets.
`__getitem__`	Returns the input and target for a given index.
`get_sets`	Returns the indices for the train, validation, and test sets.

`init()` ¶

Initializes the PhysioExDataset.

Parameters:

Name	Type	Description	Default
`datasets`	`List[str]`	List of dataset names.	required
`data_folder`	`str`	Path to the folder containing the data.	required
`preprocessing`	`str`	Type of preprocessing to apply. Defaults to "raw".	required
`selected_channels`	`List[int]`	List of selected channels. Defaults to ["EEG"].	required
`sequence_length`	`int`	Length of the sequence. Defaults to 21.	required
`target_transform`	`Callable`	Optional transform to be applied to the target. Defaults to None.	required
`hpc`	`bool`	Flag indicating whether to use high-performance computing. Defaults to False.	required
`indexed_channels`	`List[int]`	List of indexed channels. Defaults to ["EEG", "EOG", "EMG", "ECG"]. If you used a custom Preprocessor and you saved your signal channels in a different order, you should provide the correct order here. In any other case ignore this parameter.	required

`getitem(idx)` ¶

Returns the input and target sequence for a given index.

Parameters:

Name	Type	Description	Default
`idx`	`int`	Index of the sample to retrieve.	required

Returns:

Name	Type	Description
`tuple`		Input and target for the given index.

`len()` ¶

Returns the total number of sequences of epochs across all the datasets.

Returns:

Name	Type	Description
`int`		Total number of sequences.

`split(fold=-1, dataset_idx=-1)` ¶

Splits the data into train, validation, and test sets.

if fold is -1, and dataset_idx is -1 : set the split to a random fold for each dataset if fold is -1, and dataset_idx is not -1 : set the split to a random fold for the selected dataset if fold is not -1, and dataset_idx is -1 : set the split to the selected fold for each dataset if fold is not -1, and dataset_idx is not -1 : set the split to the selected fold for the selected dataset

Parameters:

Name	Type	Description	Default
`fold`	`int`	Fold number to use for splitting. Defaults to -1.	`-1`
`dataset_idx`	`int`	Index of the dataset to split. Defaults to -1.	`-1`

`get_num_folds()` ¶

Returns the minimum number of folds across all datasets.

Returns:

Name	Type	Description
`int`		Minimum number of folds.

`get_sets()` ¶

Returns the indices for the train, validation, and test sets.

Returns:

Name	Type	Description
`tuple`		Indices for the train, validation, and test sets.

PhysioExDataModule

Bases: LightningDataModule

A PyTorch Lightning DataModule for handling physiological data from multiple datasets.

Attributes:

Name	Type	Description
`datasets_id`	`List[str]`	List of dataset names.
`num_workers`	`int`	Number of workers for data loading.
`dataset`	`PhysioExDataset`	The dataset object.
`batch_size`	`int`	Batch size for the DataLoader.
`hpc`	`bool`	Flag indicating whether to use high-performance computing.
`train_dataset`	`Union[PhysioExDataset, Subset]`	Training dataset.
`valid_dataset`	`Union[PhysioExDataset, Subset]`	Validation dataset.
`test_dataset`	`Union[PhysioExDataset, Subset]`	Test dataset.
`train_sampler`	`Union[SubsetRandomSampler, Subset]`	Sampler for the training dataset.
`valid_sampler`	`Union[SubsetRandomSampler, Subset]`	Sampler for the validation dataset.
`test_sampler`	`Union[SubsetRandomSampler, Subset]`	Sampler for the test dataset.

Methods:

Name	Description
`setup`	str): Sets up the datasets for different stages.
`train_dataloader`	Returns the DataLoader for the training dataset.
`val_dataloader`	Returns the DataLoader for the validation dataset.
`test_dataloader`	Returns the DataLoader for the test dataset.

`init()` ¶

Initializes the PhysioExDataModule.

Parameters:

Name	Type	Description	Default
`datasets`	`List[str]`	List of dataset names.	required
`batch_size`	`int`	Batch size for the DataLoader. Defaults to 32.	required
`preprocessing`	`str`	Type of preprocessing to apply. Defaults to "raw".	required
`selected_channels`	`List[int]`	List of selected channels. Defaults to ["EEG"].	required
`sequence_length`	`int`	Length of the sequence. Defaults to 21.	required
`target_transform`	`Callable`	Optional transform to be applied to the target. Defaults to None.	required
`folds`	`Union[int, List[int]]`	Fold number(s) for splitting the data. Defaults to -1.	required
`data_folder`	`str`	Path to the folder containing the data. Defaults to None.	required
`num_nodes`	`int`	Number of nodes for distributed training. Defaults to 1.	required
`num_workers`	`int`	Number of workers for data loading. Defaults to os.cpu_count().	required

`train_dataloader()` ¶

Returns the DataLoader for the training dataset.

Returns:

Name	Type	Description
`DataLoader`		DataLoader for the training dataset.

`test_dataloader()` ¶

Returns the DataLoader for the test dataset.

Returns:

Name	Type	Description
`DataLoader`		DataLoader for the test dataset.

Data Module¶

Example of Usage¶

PhysioExDataset¶

PhysioExDataModule¶

Documentation¶

__init__() ¶

__getitem__(idx) ¶

__len__() ¶

split(fold=-1, dataset_idx=-1) ¶

get_num_folds() ¶

get_sets() ¶

__init__() ¶

train_dataloader() ¶

test_dataloader() ¶

`init()` ¶

`getitem(idx)` ¶

`len()` ¶

`split(fold=-1, dataset_idx=-1)` ¶

`get_num_folds()` ¶

`get_sets()` ¶

`init()` ¶

`train_dataloader()` ¶

`test_dataloader()` ¶