Transformer Guide
ForeBlocks ships a flexible encoder-decoder transformer stack centered on TransformerEncoder and TransformerDecoder.
The current implementation supports:
- multiple attention backends and per-layer attention routing
- encoder and decoder patching
- CT-PatchTST-style encoder tokenization
- paper-style Attention Residuals
- GateSkip
- Mixture-of-Depths (MoD)
- mHC residual stream mixing
- MoE feedforward blocks
- gradient checkpointing and shared-layer reuse
Related docs:
Import
from foreblocks import TransformerEncoder, TransformerDecoderMental model
The safest path is:
- start with a dense encoder-decoder transformer
- patch the encoder if the source sequence is long
- keep the decoder timestep-level
- verify full-sequence and autoregressive inference
- only then add MoE, GateSkip, MoD, or mHC
For most time-series setups, the best default is:
patch_encoder=Truepatch_decoder=Falseattention_mode="standard"norm_strategy="pre_norm"custom_norm="rms"
Baseline encoder
encoder = TransformerEncoder(
input_size=8,
d_model=256,
nhead=8,
num_layers=4,
dim_feedforward=1024,
dropout=0.1,
attention_mode="standard",
norm_strategy="pre_norm",
custom_norm="rms",
patch_encoder=True,
patch_len=16,
patch_stride=8,
)Baseline decoder
decoder = TransformerDecoder(
input_size=1,
output_size=1,
d_model=256,
nhead=8,
num_layers=4,
patch_decoder=False,
informer_like=False,
)Constructor groups
Backbone
d_modelnheadnum_layersdim_feedforwarddropoutmax_seq_len
Normalization and FFN
norm_strategy:pre_norm,post_norm, orsandwich_normcustom_norm:rms,layer, and other norm-factory variantsuse_final_normuse_swiglu
Attention selection
att_typeattention_modefreq_modes
Supported attention_mode values currently include:
standardlinearsypehybridkimihybrid_kimikimi_3to1gated_deltahybrid_gdngdn_3to1
Important behavior:
- if
attention_mode="standard"butatt_typeis a routed type such aslinear,sype,kimi, orgated_delta, the model promotesattention_modeautomatically
Patching
patch_encoderpatch_decoderpatch_lenpatch_stridepatch_pad_end
Efficiency
use_gradient_checkpointingshare_layers
Advanced modules
- Attention Residuals:
use_attention_residual,attn_residual_type,attention_residual_block_size - GateSkip:
use_gateskip,gate_budget,gate_lambda - MoD:
use_mod,mod_mode,mod_lambda,mod_budget_scheduler - mHC:
use_mhc,mhc_n_streams,mhc_sinkhorn_iters,mhc_collapse - MoE:
use_moe,num_experts,top_k,moe_aux_lambda
Recommended patching strategy
The recommended pattern for forecasting is:
patch_encoder=Truepatch_decoder=False
Why:
- the encoder benefits from shorter token sequences
- the decoder stays easier to reason about
- autoregressive decoding stays compatible with
forward_one_step(...)
patch_decoder=True is supported for full-sequence decoding, but it is not compatible with KV-cached incremental decoding.
When the encoder is patched, the memory sequence length becomes patch-token length. The decoder validates that memory_key_padding_mask matches the actual memory length, so patched and unpatched masks cannot be mixed silently.
CT-PatchTST encoder mode
The encoder also supports a channel-token PatchTST-style path:
encoder = TransformerEncoder(
input_size=8,
ct_patchtst=True,
ct_patch_len=16,
ct_patch_stride=8,
ct_patch_pad_end=True,
ct_patch_fuse="linear", # or "mean"
d_model=256,
)This path:
- patchifies across time per channel
- embeds each channel patch
- fuses channels into transformer tokens
Use it when timestep-level tokenization is too expensive for long multivariate histories.
Inference modes
Full-sequence encoder-decoder
This is the default path for standard sequence-to-sequence forecasting:
memory = encoder(src, src_key_padding_mask=src_kpm)
out = decoder(
tgt,
memory,
memory_key_padding_mask=src_kpm_or_patchified_memory_kpm,
)Informer-like decoding
model_type="informer-like" changes defaults so that:
- encoder time encoding is enabled
- decoder informer-like behavior is enabled
- decoder prompt masking follows
label_len
Typical setup:
decoder = TransformerDecoder(
input_size=1,
output_size=1,
model_type="informer-like",
label_len=12,
d_model=256,
nhead=8,
num_layers=4,
)label_len controls how much of the decoder input is treated as observed prompt.
Important behavior:
- set
label_lenexplicitly for true Informer-style masking - when
label_len <= 0, the implementation now skips the automatic Informer padding mask instead of masking the whole decoder input
Autoregressive decoding
forward_one_step(...) is intended for KV-cached autoregressive decoding.
step_out, state = decoder.forward_one_step(tgt_prefix, memory)
step_out, state = decoder.forward_one_step(
next_token,
memory,
incremental_state=state,
memory_key_padding_mask=memory_kpm,
)Recommended usage:
- first call: pass the available prefix
- later calls: pass either the growing prefix or only the newest token
- once cache exists, the implementation consumes only the newest step
Current constraints:
- requires
patch_decoder=False - does not support
use_mod=True - does not support
use_mhc=True
Active-position masks for time series
Both GateSkip and MoD operate over active positions. The public runtime input is:
- encoder:
gateskip_active_mask - decoder:
gateskip_active_mask
For time series, the intended meaning is:
True: this timestep or token participates in budgeting or routingFalse: inactive position such as padding or masked-out region
Default behavior:
- encoder: active positions are derived from
src_key_padding_maskwhen available - decoder: active positions are derived from the user-provided target padding mask
- the auto-generated Informer forecast mask is intentionally not treated as inactivity for GateSkip or MoD
With patching enabled, the active mask is patchified too, so routing stays aligned with patch tokens.
Attention Residuals
The transformer now implements paper-style Attention Residuals rather than the older local residual trick.
Controls:
use_attention_residualattn_residual_type:fullorblockattention_residual_block_size
Behavior:
full: aggregates over the running layer historyblock: aggregates over block summaries
Notes:
- this is enabled by default
- it replaces the normal residual path for the affected blocks
Current compatibility rules:
- not compatible with
use_gateskip=True - not compatible with
use_mhc=True - not compatible with
use_mod=True
If you want GateSkip, MoD, or mHC, disable Attention Residuals explicitly:
use_attention_residual=FalseGateSkip
GateSkip applies residual gating at the sublayer level.
Controls:
use_gateskipgate_budgetgate_lambda
For time series, GateSkip budgets over valid positions rather than LM-style EOS handling.
Recommendation:
- keep it off until the dense baseline is stable
- when using it, pass an explicit
gateskip_active_maskif you want forecast-only gating rather than all valid positions
Current compatibility rules:
- not wired together with Attention Residuals
- not wired together with MoD
Mixture-of-Depths
The transformer supports paper-style MoD token routing.
Controls:
use_modmod_modemod_lambdamod_budget_scheduler
Current behavior:
- only
mod_mode="token"is supported - routing is top-k over active positions
- packed routed tokens are processed and scattered back
For time series:
- routing is timestep or patch-token routing
- default active positions are all valid positions
- if you want forecast-only routing, provide an explicit
gateskip_active_mask
Current compatibility rules:
- not compatible with Attention Residuals
- not compatible with GateSkip
- not compatible with mHC
- not supported in
forward_one_step(...)
mHC residual streams
mHC adds multiple residual streams and dynamic hyper-connections between them.
Controls:
use_mhcmhc_n_streamsmhc_sinkhorn_itersmhc_collapse:firstormean
Current behavior:
- paper-style stream init is
(x, 0, ..., 0) - stream read/write and residual mixing are token-wise and input-dependent
mhc_collapse="first"is the safest default
Current compatibility rules:
- not compatible with Attention Residuals
- not compatible with MoD
- not supported in decoder KV-cached autoregressive decoding
Use it as a research feature rather than a first production default.
MoE in transformer layers
MoE is enabled at the feedforward block level:
encoder = TransformerEncoder(
input_size=8,
d_model=256,
nhead=8,
num_layers=4,
use_moe=True,
num_experts=8,
top_k=2,
moe_aux_lambda=1.0,
)See the dedicated guide for routing and auxiliary-loss details:
Integration with ForecastingModel
from foreblocks import ForecastingModel, TransformerEncoder, TransformerDecoder
encoder = TransformerEncoder(
input_size=8,
d_model=128,
nhead=4,
num_layers=3,
patch_encoder=True,
patch_len=16,
patch_stride=8,
)
decoder = TransformerDecoder(
input_size=1,
output_size=1,
d_model=128,
nhead=4,
num_layers=3,
patch_decoder=False,
informer_like=False,
)
model = ForecastingModel(
encoder=encoder,
decoder=decoder,
forecasting_strategy="transformer_seq2seq",
model_type="transformer",
target_len=24,
output_size=1,
)Recommended tuning order
- Get a plain encoder-decoder transformer running with
standardattention. - Enable encoder patching if source sequence length is large.
- Verify full-sequence and autoregressive inference.
- Explore
attention_modevariants. - Add MoE only after the dense baseline is stable.
- Add GateSkip, MoD, or mHC only for targeted experiments.
Troubleshooting
Sequence length exceeds max_seq_len: increasemax_seq_lenor enable patching.- Decoder/memory mask mismatch: the encoder may be patched while the memory padding mask was not patchified consistently.
patch_decoder=Truewith KV caching: unsupported; keep decoder patching off for autoregressive decoding.forward_one_step(...)errors with MoD or mHC: those features are intentionally disabled in the incremental path.- Attention Residuals with GateSkip, MoD, or mHC: unsupported in the current implementation.
- Informer-like mode behaving like plain decoding: set
label_lenexplicitly. - OOM: reduce
d_model,num_layers,dim_feedforward, or enable gradient checkpointing.