Skip to content

Evaluation & Metrics

foreblocks.evaluation.ModelEvaluator wraps a trained Trainer and provides batched inference, rolling cross-validation, loss curve plotting, and a human-readable training summary.

Setup

ModelEvaluator takes a Trainer instance directly — it reuses the model, device, and training history stored on it.

from foreblocks.training import Trainer
from foreblocks.evaluation import ModelEvaluator

trainer = Trainer(model, ...)
trainer.fit(train_loader, val_loader, epochs=50)

evaluator = ModelEvaluator(trainer)

Batched prediction

import torch

X_test = torch.randn(200, 96, 7)   # (samples, input_len, channels)
preds = evaluator.predict(X_test, batch_size=256, use_amp=True)
# preds: (200, horizon, channels)

use_amp=True enables automatic mixed precision on CUDA; it is silently ignored on CPU.

Point metrics

y_test = torch.randn(200, 24, 7)   # (samples, horizon, channels)
metrics = evaluator.compute_metrics(X_test, y_test)

# {'mse': ..., 'rmse': ..., 'mae': ..., 'mape': ...}
print(metrics)
Metric Formula
MSE mean((ŷ − y)²)
RMSE √MSE
MAE mean(
MAPE mean(

Rolling cross-validation

cross_validation slides a window of size horizon across the dataset and evaluates the already-trained model on each window.

cv = evaluator.cross_validation(
    X=X_test,
    y=y_test,
    n_windows=10,
    horizon=24,
    step_size=None,   # defaults to horizon (non-overlapping)
    batch_size=256,
)

print(cv['overall'])          # aggregate metrics
print(cv['window_metrics'])   # list of per-window dicts
preds_all = cv['predictions'] # concatenated predictions tensor

Return dict keys:

Key Type Description
overall dict Aggregate MAE / RMSE / MAPE / MSE over all windows
window_metrics list[dict] Per-window metrics including start_idx / end_idx
predictions Tensor Concatenated predictions across all windows
targets Tensor Concatenated targets across all windows
n_windows int Number of windows actually evaluated
total_points int Total sample count

Model is not retrained per fold

This is a walk-forward evaluation of a fixed model, not k-fold retraining. Use it to assess generalisation across temporal shifts, not for model selection.

Plots

All plotting methods require matplotlib:

pip install foreblocks[plotting]

Cross-validation results

fig = evaluator.plot_cv_results(cv, figsize=(15, 8))
fig.savefig("cv_results.png")

Produces a 2×2 grid: per-window MAE, RMSE, and MAPE curves with overall means, plus a text summary box.

Learning curves

fig = evaluator.plot_learning_curves(figsize=(15, 5))

Three subplots: train/val loss, learning rate schedule, and (if using distillation) task vs. distillation loss components.

Training summary

evaluator.print_summary()

Prints epoch count, final and best validation loss, and model size (parameter count and memory footprint in MB if the model exposes get_model_size()).