Preprocess Module¶
The preprocess module implements a standard API to import custom or benchmark sleep staging datasets into PhysioEx and let them serializable into a PhysioExDataset
. This functionality is provided by the physioex.preprocess.preprocessor.Preprocessor
class.
The Preprocessor
class is designed to facilitate the preprocessing of physiological datasets. This class provides methods for downloading datasets, reading subject records, applying preprocessing functions, and organizing the data into a structured format suitable for machine learning tasks.
Example Usage¶
If you want to preprocess a dataset using a preprocessing function
CLI¶
If you want to use the standard PhysioEx implementation of the preprocessor, which will save the data in the raw format and in the format proposed by XSleepNet, you can use the CLI tool, once the library is istalled just type:
The list of available dataset is:
- SHHS (Sleep Heart Health Study)
- MASS (Montreal Archive of Sleep Studies)
- MESA (Multi-Ethnic Study of Atherosclerosis)
- MrOS (The Osteoporotic Fractures in Men Study)
- HMC (Haaglanden Medisch Centrum)
- DCSM (Danish Center for Sleep Medicine)
Note that for the HMC and DCSM dataset the library will take care to download the dataset if not available into /your/data/path/
.
preprocess
Preprocessing script for preparing datasets.
This script allows you to preprocess datasets for training and testing models.
Usage
$ preprocess [PARAMS]
You can use the preprocess -h --help
command to access the command documentation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
--dataset
|
str
|
The name of the dataset to preprocess. Defaults to "hmc".
Note: The dataset name should be one of the supported datasets (e.g., "hmc", "mass", "shhs", "mesa", "mros", "dcsm"). If a custom dataset is used use the |
required |
--data_folder
|
str
|
The absolute path of the directory where the physioex dataset are stored, if None the home directory is used. Defaults to None. Note: Provide the path to the directory containing the datasets. |
required |
--preprocessor
|
str
|
The name of the preprocessor in case of a custom Preprocessor. Defaults to None.
Note: The preprocessor should extend |
required |
--config
|
str
|
Specify the path to the configuration .yaml file where to store the options to preprocess the dataset with. Defaults to None. Note: The configuration file can override command line arguments. You can specify also the preprocessor_kwargs in the configuration file. |
required |
Example
mass
dataset using the MASSPreprocessor
preprocessor.
For HMC and DCSM datasets, PhysioEx will automatically download the datasets. The other datasets needs to be obtained first, most of them are easily accessible from sleepdata.org.
The SHHS and MASS dataset needs to be further processed after download with the script in:
1 2 |
|
Once you obtain the mat/ folder using this processing scripts place them into data_folder/dataset_name/mat/ and run the preprocess command.
The command can use a .yaml configuration file to specify the preprocessor_kwargs:
Notes
- Ensure that the datasets are properly formatted and stored in the specified data folder using the preprocess script.
- The configuration file, if provided, should be in YAML format and contain valid key-value pairs for the script options.
Extending the Preprocessor Class¶
To build you own defined preprocessor you should extend the Preprocessor class.
For instance lets consider how we extended the Preprocesor class to preprocess the HMC dataset
Here there is the pseudocode for a possible implementation of the read_edf method using pyedflib
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
|
Documentation¶
The list of the methods that the user need to reimplement to extend the Preprocessor class is:
Preprocessor
__init__()
¶
Initializes the Preprocessor class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name
|
str
|
The name of the dataset to be processed. |
required |
signal_shape
|
List[int]
|
A list containing two elements representing the number of channels and the number of timestamps in the signal. |
required |
preprocessors_name
|
List[str]
|
A list of names for the preprocessing functions. |
required |
preprocessors
|
List[Callable]
|
A list of callable preprocessing functions to be applied to the signals. |
required |
preprocessors_shape
|
List[List[int]]
|
A list of shapes corresponding to the output of each preprocessing function. |
required |
data_folder
|
str
|
The folder where the dataset is stored. If None, the default data folder is used. |
required |
read_subject_record()
¶
Reads a subject's record and returns a tuple containing the signal and labels.
(Required) Method should be provided by the user.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record
|
str
|
The path to the subject's record. |
required |
Returns:
Type | Description |
---|---|
Tuple[np.array, np.array]: A tuple containing the signal and labels with shapes [n_windows, n_channels, n_timestamps] and [n_windows], respectively. If the record should be skipped, the function should return None, None. |
download_dataset()
¶
Downloads the dataset if it is not already present on disk.
(Optional) Method to be implemented by the user.
customize_table()
¶
Customizes the dataset table before saving it.
(Optional) Method to be provided by the user.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
DataFrame
|
The dataset table to be customized. |
required |
Returns:
Type | Description |
---|---|
pd.DataFrame: The customized dataset table. |
get_sets()
¶
Returns the train, validation, and test subjects.
(Optional) Method to be provided by the user. By default, the method splits the subjects randomly with 70% for training, 15% for validation, and 15% for testing.
Returns:
Type | Description |
---|---|
Tuple[List, List, List]: A tuple containing the train, validation, and test subjects. |