Data Preparation#
The DataManager
object is in charge of handling data preparation and splitting.
Import DataManager#
from pyshred import DataManager
Initialize Data Manager#
The DataManager
object takes in lags
(int), train_size
(float), val_size
(float), and test_size
(float). Lags
represents the length of each sensor sequence being fed into the sequence model. train_size
, val_size
, and test_size
are the proportions used to split the data into train, validation, and test datasets. These fields three arguments must sum up to 1.0.
Example: Each input will be a sequence of 52 sensor measurement timesteps.
manager = DataManager(
lags=52, # build lagged sequences of 52 sensor measurement
train_size=0.8, # use 80% of the data for training
val_size=0.1, # use 10% of the data for validation
test_size=0.1 # use 10% of the data for testing
)
Add Data#
The add_data
method takes in:
data
: takes in a file path (.npy or .npz), a numpy array, or a torch.Tensor. The data should be array-like with time on the first axis (axis 0) followed by the spatial axes.id
: a str to identify the dataset by. Each dataset must have a uniqueid
.random (optional)
: number of randomly placed sensorsstationary (optional)
: a list of stationary sensor locations. Each stationary sensor location should be a tuple representing its spatial coordinates.mobile (optional)
: a list of mobile sensor locations. Each mobile sensor location should be a list representing its spatial coordinates at each timestep. The length of each mobile sensor location list should be equivalent to the size of the length of the first axis (axis 0) indata
.measurements (optional)
: a 2D NumPy array (T, s) of raw sensor measurementscompress (optional)
: the number of SVD components/POD modes to keep fordata
. Set toFalse
orNone
for no compression.
Example:
manager.add_data(
data="sst_data.npy", # path to sea-surface temperature data of shape (1400, 180, 360)
id="SST", # dataset unique identifier set as "SST"
random=3, # use three randomly placed sensors
compress=50, # use 50 POD modes
)
To add additional datasets (e.g. different fields), call add_data
for each additional dataset.
Train/Validation/Test Split#
The prepare
method generates a train, validation, and test dataset.
Example:
train_dataset, val_dataset, test_dataset= manager.prepare()
Useful/Common Attributes#
sensor_summary_df
: a dataframe on sensor information with sensors as rows and sensor features/descriptors as columns.sensor_measurements_df
: a dataframe of raw sensor measurments, with time as rows and sensors as columns.train_sensor_measurements
: an array of sensor measurements (T,s) used for trainingval_sensor_measurements
: an array of sensor measurements (T,s) used for validationtest_sensor_measurements
: an array of sensor measurements (T,s) used for testing