Preprocess Module

The preprocess module implements a standard API to import custom or benchmark sleep staging datasets into PhysioEx and let them serializable into a PhysioExDataset. This functionality is provided by the physioex.preprocess.preprocessor.Preprocessor class.

The Preprocessor class is designed to facilitate the preprocessing of physiological datasets. This class provides methods for downloading datasets, reading subject records, applying preprocessing functions, and organizing the data into a structured format suitable for machine learning tasks.

Example Usage

If you want to preprocess a dataset using a preprocessing function

# Preprocessor is an abstract class, you need an implementation of it to use it. 
# This can be you own defined Preprocessor or one of the available into PhysioEx
from physioex.preprocess.hmc import HMCPreprocessor  

# Define preprocessing functions, it's just a Callable method on each signal
from physioex.preprocess.utils.signal import  xsleepnet_preprocessing

# Initialize Preprocessor
preprocessor = HMCPreprocessor(
    preprocessors_name = ["xsleepnet"], # the name of the preprocessor
    preprocessors = [xsleepnet_preprocessing], # the callable preprocessing method
    preprocessor_shape = [[4, 29, 129]], # the output of the signal after preprocessing, 
                                         # the first element (4) depends on the number of 
                                         # channels available in your system. In HMC they are 4.
    data_folder = "/your/data/path/"

# Run preprocessing

# at this point you can read the dataset from the disk using PhysioExDataset

from import PhysioExDataset

data = PhysioExDataset(
    datasets = ["hmc"],
    preprocessing = "xsleepnet",    # can be "raw" also because the Preprocessor 
                                    # will always save also the raw data
    selected_channels = ["EEG", "EOG", "EMG", "ECG"], # in case you want to read 
                                                      # all the channels available
    data_folder = "/your/data/path/",

# you can now access any sequence of epochs in the dataset

signal, label = data[0]

signal.shape # will be [21 (default sequence lenght), 4, 29, 129]
label.shape # will be [21]


If you want to use the standard PhysioEx implementation of the preprocessor, which will save the data in the raw format and in the format proposed by XSleepNet, you can use the CLI tool, once the library is istalled just type:

preprocess --dataset hmc --data_folder  "/your/data/path/"

The list of available dataset is:

Note that for the HMC and DCSM dataset the library will take care to download the dataset if not available into /your/data/path/.


Preprocessing script for preparing datasets.

This script allows you to preprocess datasets for training and testing models.


$ preprocess [PARAMS] You can use the preprocess -h --help command to access the command documentation.


Name Type Description Default
--dataset str

The name of the dataset to preprocess. Defaults to "hmc". Note: The dataset name should be one of the supported datasets (e.g., "hmc", "mass", "shhs", "mesa", "mros", "dcsm"). If a custom dataset is used use the preprocessor argument.

--data_folder str

The absolute path of the directory where the physioex dataset are stored, if None the home directory is used. Defaults to None. Note: Provide the path to the directory containing the datasets.

--preprocessor str

The name of the preprocessor in case of a custom Preprocessor. Defaults to None. Note: The preprocessor should extend physioex.preprocess.proprocessor:Preprocessor and be passed as a string in the format

--config str

Specify the path to the configuration .yaml file where to store the options to preprocess the dataset with. Defaults to None. Note: The configuration file can override command line arguments. You can specify also the preprocessor_kwargs in the configuration file.


$ preprocess --dataset mass --data_folder /path/to/datasets
This command preprocesses the mass dataset using the MASSPreprocessor preprocessor.

For HMC and DCSM datasets, PhysioEx will automatically download the datasets. The other datasets needs to be obtained first, most of them are easily accessible from

The SHHS and MASS dataset needs to be further processed after download with the script in:


Once you obtain the mat/ folder using this processing scripts place them into data_folder/dataset_name/mat/ and run the preprocess command.

The command can use a .yaml configuration file to specify the preprocessor_kwargs:

    dataset: null
    data_folder : /path/to/your/data
    preprocessor : physioex.preprocess.hmc:HMCPreprocessor # can be also your custom preprocessor
        # signal_shape: [4, 3000]
            - "your_preprocessor"
            - "xsleepnet"
            - physioex.preprocess.utils.signal:xsleepnet_preprocessing
            - [4, 3000]
            - [4, 3000]            
            - [4, 29, 129]
  • Ensure that the datasets are properly formatted and stored in the specified data folder using the preprocess script.
  • The configuration file, if provided, should be in YAML format and contain valid key-value pairs for the script options.

Extending the Preprocessor Class

To build you own defined preprocessor you should extend the Preprocessor class.

For instance lets consider how we extended the Preprocesor class to preprocess the HMC dataset

from physioex.preprocess.preprocessor import Preprocessor

class HMCPreprocessor(Preprocessor):
    def __init__(self, 
            preprocessors_name: List[str] = ["xsleepnet"],
            preprocessors = [xsleepnet_preprocessing],
            preprocessor_shape = [[4, 29, 129]],
            data_folder: str = None

        # calls the Preprocessor constructor, required at the end of your custom setup
            dataset_name="hmc",     # this is the name of the dataset PhysioEx will use 
                                    # as PhysioExDataset( dataset=[dataset_name] )
            signal_shape=[4, 3000], # PhysioEx reads sleep epochs of 30 seconds sampled at 100Hz. 
                                    # 4 Is the total amount of channel available in the dataset

    def download_dataset(self) -> None: 
        # download the dataset into the data_folder/download/
        # extract the zip 

        download_dir = os.path.join(self.dataset_folder, "download")

        if not os.path.exists(download_dir):
            os.makedirs(download_dir, exist_ok=True)

            zip_file = os.path.join(self.dataset_folder, "")

            if not os.path.exists(zip_file):

            extract_large_zip(zip_file, download_dir)

    def get_subjects_records(self) -> List[str]:
        # read the RECORDS file into the extracted directory and returns the list of the available records
        subjects_dir = os.path.join(

        records_file = os.path.join(subjects_dir, "RECORDS")

        with open(records_file, "r") as file:
            records = file.readlines()

        records = [record.rstrip("\n") for record in records]

        return records

    def read_subject_record(self, record: str) -> Tuple[np.array, np.array]:
        # read each RECORD ( which is an .edf file ) and return its signal and labels
        return read_edf(

Here there is the pseudocode for a possible implementation of the read_edf method using pyedflib

# an example of a read_edf method
# Note: if you use a dataset from NSRR you can directly use 
# physioex.preprocess.utils.sleepdata:process_sleepdata_file
# if you want to read different channels you can choose them here
stages_map = [  # used to map each stage in the annotation file
                # to an identifier ( the index of the list )
    "Sleep stage W",
    "Sleep stage N1",
    "Sleep stage N2",
    "Sleep stage N3",
    "Sleep stage R",

fs = 256    # sampling frequency of the signal readed
            # can be readed directly from the .edf file if you have
            # different sampling frequencies for different channels

# the channels you want to read and preprocess in your dataset

def read_edf(file_path):

    # read the annotatations
    # Note: tipycally record and annotation files have the same name.
    #       a best practice could be to store them as:
    #       filepath_stages.edf
    #       filepath_signal.edf
    #       in the same directory, 
    #       if this is not the case in your dataset consider returning a tuple
    #       in the get_subjects_records method : 
    #       file_path = ( signal_path, annotation_path )

    stages_path = file_path + "_stages.edf" 
    signal_path = file_path + "_signal.edf"

    f = pyedflib.EdfReader(stages_path)
    _, _, annt = f.readAnnotations()

    # convert the annotations from string to the index of stages_map
    stages = []
    for a in annt:
        if a in stages_map:

    # convert it to a numpy array 
    stages = np.reshape(np.array(stages).astype(int), (-1))

    # read the signal
    f = pyedflib.EdfReader(signal_path)
    buffer = []
    for indx, modality in enumerate(AVAILABLE_CHANNELS):

        signal = f.readSignal( modality ).reshape( -1 )

        # filtering
        # pass band the signal between 0.3 and 40 Hz
        # you can use physioex.preprocess.utils.signal:bandpass_filter
        if modality != "EMG":
            signal = bandpass_filter(signal, 0.3, 40, fs)
            # if EMG signal filter at 10Hz
            b_band = firwin(101, 10, pass_zero=False, fs=fs)
            signal = filtfilt(b_band, 1, signal)

        # resampling
        # 100 Hz * 30 sec * num_epochs ( annotations.shape[0] )
        # you can use scipy.signal.resample
        signal = resample(signal, num= 30 * 100 * annotations.shape[0])

        # windowing
        signal = signal.reshape(-1, 3000)

    buffer = np.array(buffer) # shape is len(AVAILABLE_CHANNELS), num_epochs, 3000
    signal = np.transpose(buffer, (1, 0, 2)) #  num_epochs, len(AVAILABLE_CHANNELS), 3000
    del buffer

    # now you should check if Wake is the biggest class for your subject
    count_stage = np.bincount(stages)
    if count_stage[0] > max(count_stage[1:]):  # Wake is the biggest class
        second_largest = max(count_stage[1:])

        W_ind = stages == 0  # W indices
        last_evening_W_index = np.where(np.diff(W_ind) != 0)[0][0] + 1
        if stages[0] == 0:  # only true if the first epoch is W
            num_evening_W = last_evening_W_index
            num_evening_W = 0

        first_morning_W_index = np.where(np.diff(W_ind) != 0)[0][-1] + 1
        num_morning_W = len(stages) - first_morning_W_index + 1

        nb_pre_post_sleep_wake_eps = num_evening_W + num_morning_W
        if nb_pre_post_sleep_wake_eps > second_largest:
            total_W_to_remove = nb_pre_post_sleep_wake_eps - second_largest
            if num_evening_W > total_W_to_remove:
                stages = stages[total_W_to_remove:]
                signal = signal[total_W_to_remove:]
                evening_W_to_remove = num_evening_W
                morning_W_to_remove = total_W_to_remove - evening_W_to_remove
                stages = stages[evening_W_to_remove : len(stages) - morning_W_to_remove]
                signal = signal[evening_W_to_remove : len(signal) - morning_W_to_remove]

    return signal, stages


The list of the methods that the user need to reimplement to extend the Preprocessor class is:



Initializes the Preprocessor class.


Name Type Description Default
dataset_name str

The name of the dataset to be processed.

signal_shape List[int]

A list containing two elements representing the number of channels and the number of timestamps in the signal.

preprocessors_name List[str]

A list of names for the preprocessing functions.

preprocessors List[Callable]

A list of callable preprocessing functions to be applied to the signals.

preprocessors_shape List[List[int]]

A list of shapes corresponding to the output of each preprocessing function.

data_folder str

The folder where the dataset is stored. If None, the default data folder is used.



Reads a subject's record and returns a tuple containing the signal and labels.

(Required) Method should be provided by the user.


Name Type Description Default
record str

The path to the subject's record.



Type Description

Tuple[np.array, np.array]: A tuple containing the signal and labels with shapes [n_windows, n_channels, n_timestamps] and [n_windows], respectively. If the record should be skipped, the function should return None, None.


Downloads the dataset if it is not already present on disk.

(Optional) Method to be implemented by the user.


Customizes the dataset table before saving it.

(Optional) Method to be provided by the user.


Name Type Description Default
table DataFrame

The dataset table to be customized.



Type Description

pd.DataFrame: The customized dataset table.


Returns the train, validation, and test subjects.

(Optional) Method to be provided by the user. By default, the method splits the subjects randomly with 70% for training, 15% for validation, and 15% for testing.


Type Description

Tuple[List, List, List]: A tuple containing the train, validation, and test subjects.