Skip to content

Data Module

The physioex.data module provides the API to read the data from the disk once the raw datasets have been processed by the Preprocess module. It consists of two classes:

  • physioex.data.PhysioExDataset which serialize the disk processed version of the dataset into a PyTorch Dataset
  • physioex.data.PhysioExDataModule which transforms the datasets to PyTorch DataLoaders ready for training.

Example of Usage

PhysioExDataset

The PhysioExDataset class is automatically handled by the PhysioExDataModule class when you need to use it for training or testing purposes. In most of the cases you don't need to interact with the PhysioExDataset class.

The class is instead really helpfull when you need to visualize your data, or you need to get some samples of your data to provide them as input to Explainable AI algorithms.

In these cases you need to instantiate a PhysioExDataset:

from physioex.data import PhysioExDataset

data = PhysioExDataset(
    datasets = ["hmc"], # you can read different datasets merged together in this way
    preprocessing = "raw",  
    selected_channels = ["EEG", "EOG", "EMG"],     
    data_folder = "/your/data/path/",
)

# you can now access any sequence of epochs in the dataset
signal, label = data[0]

signal.shape # will be [21 (default sequence lenght), 3, 3000]
label.shape # will be [21]

Then you can use a python plotting library to plot visualize the data

Example

import seaborn as sns
import numpy as np 

hypnogram = np.ones((21, 3000)) * label.numpy().reshape(-1, 1)

# plot a subfigure with one column for each element of the sequence (21)
fig, ax = plt.subplots(4, 1, figsize = (21, 8), sharex="col", sharey="row")

hypnogram = hypnogram.reshape( -1 )
signals = signal.numpy().transpose(1, 0, 2).reshape(3, -1)

# set tytle for each subplot
sns.lineplot( x = range(3000*21), y = hypnogram, ax = ax[0], color = "blue")
# then the channels:
sns.lineplot( x = range(3000*21), y = signals[ 0], ax = ax[1], color = "red")
sns.lineplot( x = range(3000*21), y = signals[ 1], ax = ax[2], color = "green")
sns.lineplot( x = range(3000*21), y = signals[ 2], ax = ax[3], color = "purple")    

# check the examples notebook "visualize_data.ipynb" to see how to customize the plot properly

plt.tight_layout()

png

PhysioExDataModule

The PhysioExDataModule class is designed to transform datasets into PyTorch DataLoaders ready for training. It handles the batching, shuffling, and splitting of the data into training, validation, and test sets.

To use the PhysioExDataModule, you need to instantiate it with the required parameters:

from physioex.data import PhysioExDataModule

datamodule = PhysioExDataModule(
    datasets=["hmc", "mass"],  # list of datasets to be used
    batch_size=64,             # batch size for the DataLoader
    preprocessing="raw",       # preprocessing method
    selected_channels=["EEG", "EOG", "EMG"],  # channels to be selected
    sequence_length=21,        # length of the sequence
    data_folder="/your/data/path/",  # path to the data folder
)

# get the DataLoaders
train_loader = datamodule.train_dataloader()
val_loader = datamodule.val_dataloader()
test_loader = datamodule.test_dataloader()

PhysiEx is built on pytorch_lightning for model training and testig, hence you can use PhysioExDataModule in combination with pl.Trainer

from pytorch_lightning import Trainer

model = SomePytorchModel()

trainer = Trainer(
    devices="auto"
    max_epochs=10,
    deterministic=True,
)

# setup the model in training mode if needed
model = model.train()
# Start training
trainer.fit(model, datamodule=datamodule)
results = trainer.test( model, datamodule = datamodule)

Documentation

PhysioExDataset

Bases: Dataset

A PyTorch Dataset class for handling physiological data from multiple datasets.

Attributes:

Name Type Description
datasets List[str]

List of dataset names.

L int

Sequence length.

channels_index List[int]

Indices of selected channels.

readers List[DataReader]

List of DataReader objects for each dataset.

tables List[DataFrame]

List of data tables for each dataset.

dataset_idx ndarray

Array indicating the dataset index for each sample.

target_transform Callable

Optional transform to be applied to the target.

len int

Total number of samples across all datasets.

Methods:

Name Description
__len__

Returns the total number of samples.

split

int = -1, dataset_idx: int = -1): Splits the data into train, validation, and test sets.

get_num_folds

Returns the minimum number of folds across all datasets.

__getitem__

Returns the input and target for a given index.

get_sets

Returns the indices for the train, validation, and test sets.

__init__()

Initializes the PhysioExDataset.

Parameters:

Name Type Description Default
datasets List[str]

List of dataset names.

required
data_folder str

Path to the folder containing the data.

required
preprocessing str

Type of preprocessing to apply. Defaults to "raw".

required
selected_channels List[int]

List of selected channels. Defaults to ["EEG"].

required
sequence_length int

Length of the sequence. Defaults to 21.

required
target_transform Callable

Optional transform to be applied to the target. Defaults to None.

required
hpc bool

Flag indicating whether to use high-performance computing. Defaults to False.

required
indexed_channels List[int]

List of indexed channels. Defaults to ["EEG", "EOG", "EMG", "ECG"]. If you used a custom Preprocessor and you saved your signal channels in a different order, you should provide the correct order here. In any other case ignore this parameter.

required

__getitem__(idx)

Returns the input and target sequence for a given index.

Parameters:

Name Type Description Default
idx int

Index of the sample to retrieve.

required

Returns:

Name Type Description
tuple

Input and target for the given index.

__len__()

Returns the total number of sequences of epochs across all the datasets.

Returns:

Name Type Description
int

Total number of sequences.

split(fold=-1, dataset_idx=-1)

Splits the data into train, validation, and test sets.

if fold is -1, and dataset_idx is -1 : set the split to a random fold for each dataset if fold is -1, and dataset_idx is not -1 : set the split to a random fold for the selected dataset if fold is not -1, and dataset_idx is -1 : set the split to the selected fold for each dataset if fold is not -1, and dataset_idx is not -1 : set the split to the selected fold for the selected dataset

Parameters:

Name Type Description Default
fold int

Fold number to use for splitting. Defaults to -1.

-1
dataset_idx int

Index of the dataset to split. Defaults to -1.

-1

get_num_folds()

Returns the minimum number of folds across all datasets.

Returns:

Name Type Description
int

Minimum number of folds.

get_sets()

Returns the indices for the train, validation, and test sets.

Returns:

Name Type Description
tuple

Indices for the train, validation, and test sets.


PhysioExDataModule

Bases: LightningDataModule

A PyTorch Lightning DataModule for handling physiological data from multiple datasets.

Attributes:

Name Type Description
datasets_id List[str]

List of dataset names.

num_workers int

Number of workers for data loading.

dataset PhysioExDataset

The dataset object.

batch_size int

Batch size for the DataLoader.

hpc bool

Flag indicating whether to use high-performance computing.

train_dataset Union[PhysioExDataset, Subset]

Training dataset.

valid_dataset Union[PhysioExDataset, Subset]

Validation dataset.

test_dataset Union[PhysioExDataset, Subset]

Test dataset.

train_sampler Union[SubsetRandomSampler, Subset]

Sampler for the training dataset.

valid_sampler Union[SubsetRandomSampler, Subset]

Sampler for the validation dataset.

test_sampler Union[SubsetRandomSampler, Subset]

Sampler for the test dataset.

Methods:

Name Description
setup

str): Sets up the datasets for different stages.

train_dataloader

Returns the DataLoader for the training dataset.

val_dataloader

Returns the DataLoader for the validation dataset.

test_dataloader

Returns the DataLoader for the test dataset.

__init__()

Initializes the PhysioExDataModule.

Parameters:

Name Type Description Default
datasets List[str]

List of dataset names.

required
batch_size int

Batch size for the DataLoader. Defaults to 32.

required
preprocessing str

Type of preprocessing to apply. Defaults to "raw".

required
selected_channels List[int]

List of selected channels. Defaults to ["EEG"].

required
sequence_length int

Length of the sequence. Defaults to 21.

required
target_transform Callable

Optional transform to be applied to the target. Defaults to None.

required
folds Union[int, List[int]]

Fold number(s) for splitting the data. Defaults to -1.

required
data_folder str

Path to the folder containing the data. Defaults to None.

required
num_nodes int

Number of nodes for distributed training. Defaults to 1.

required
num_workers int

Number of workers for data loading. Defaults to os.cpu_count().

required

train_dataloader()

Returns the DataLoader for the training dataset.

Returns:

Name Type Description
DataLoader

DataLoader for the training dataset.

test_dataloader()

Returns the DataLoader for the test dataset.

Returns:

Name Type Description
DataLoader

DataLoader for the test dataset.