Transformer Guide¶
ForeBlocks ships a broad transformer stack centered on TransformerEncoder and TransformerDecoder.
The implementation supports:
- multiple self-attention kernels
- per-layer attention routing
- encoder/decoder patching
- CT-PatchTST-style encoder tokenization
- MoE feedforward layers
- dynamic layer skipping
- mHC residual stream mixing
- gradient checkpointing and shared-layer reuse
Related docs:
Import¶
Mental model¶
The current stack has three levels of control:
- backbone dimensions and normalization
- tokenization and attention-routing behavior
- optional advanced modules such as MoE, mHC, and layer skipping
The safest baseline is:
- patch the encoder
- keep the decoder timestep-level
- use
pre_norm - start with standard attention
- leave MoE, mHC, and dynamic skipping off until the basic path works
Baseline encoder¶
encoder = TransformerEncoder(
input_size=8,
d_model=256,
nhead=8,
num_layers=4,
dim_feedforward=1024,
dropout=0.1,
att_type="standard",
attention_mode="standard",
norm_strategy="pre_norm",
custom_norm="rms",
patch_encoder=True,
patch_len=16,
patch_stride=8,
)
Baseline decoder¶
decoder = TransformerDecoder(
input_size=1,
output_size=1,
d_model=256,
nhead=8,
num_layers=4,
label_len=12,
informer_like=False,
patch_decoder=False,
)
Core constructor groups¶
Dimensions and depth¶
d_modelnheadnum_layersdim_feedforwarddropoutmax_seq_len
Normalization and residual behavior¶
norm_strategy:pre_norm,post_norm, orsandwich_normcustom_norm:rms,layer, and other norm-factory variantsuse_final_normuse_swiglu
Attention selection¶
att_type: the base attention family for standard layersattention_mode: how attention types are assigned across layers
Supported routed modes in the current implementation include:
standardlinearsypehybridkimihybrid_kimikimi_3to1gated_deltahybrid_gdngdn_3to1
Important behavior:
- if
attention_mode="standard"butatt_typeis a routed type such aslinear,sype,kimi, orgated_delta, the model promotesattention_modeautomatically so the requested path is actually used
Patching and tokenization¶
patch_encoderpatch_decoderpatch_lenpatch_stridepatch_pad_end
Encoder patching is the recommended default. The encoder returns patch-memory tokens without unpatching, and the decoder cross-attends to that patch memory.
Efficiency and parameter sharing¶
use_gradient_checkpointingshare_layers
Advanced modules¶
use_moe,num_experts,top_k,moe_aux_lambdause_layer_skipping,layer_skip_mode,layer_skip_temperature,layer_skip_lambdause_mhc,mhc_n_streams,mhc_sinkhorn_iters,mhc_temperature,mhc_collapse
Attention routing patterns¶
attention_mode controls how attention kernels are assigned across the layer stack.
Common choices:
standard: all layers use standard attentionlinear: all layers use linear attentionhybrid: early layers use linear attention, final layer uses standard attentionhybrid_kimi: early layers use Kimi attention, final layer uses standard attentionkimi_3to1: three Kimi layers followed by one standard layer in repeating groupshybrid_gdn: early layers use Gated DeltaNet, final layer uses standard attention
Use these routed modes when:
- sequence length is large
- you want cheaper early layers with a stronger final layer
- you are experimenting with linear or state-space-like attention variants
Patching strategy¶
The current implementation is explicit about patching behavior.
Recommended pattern¶
patch_encoder=Truepatch_decoder=False
Why:
- the encoder benefits from shorter token sequences
- the decoder remains easier to reason about
- autoregressive decoding remains compatible with
forward_one_step(...)
Decoder patching caveat¶
patch_decoder=True is supported for non-incremental decoding, but it is not compatible with KV-cached incremental decoding.
Memory mask alignment¶
When the encoder is patched, the memory sequence length changes from timestep length to patch-token length. The implementation validates that memory_key_padding_mask matches the actual memory length, so patched and unpatched masks cannot be mixed accidentally.
CT-PatchTST encoder mode¶
The encoder also supports a channel-token PatchTST-style path:
encoder = TransformerEncoder(
input_size=8,
ct_patchtst=True,
ct_patch_len=16,
ct_patch_stride=8,
ct_patch_pad_end=True,
ct_patch_fuse="linear", # or "mean"
d_model=256,
)
This path:
- patchifies across time per channel
- embeds each channel-patch
- fuses channels into transformer tokens
Use it when long input sequences make timestep-level tokenization too expensive.
Informer-like mode¶
model_type="informer-like" changes defaults in the current implementation:
- encoder time encoding is enabled
- decoder informer-like behavior is enabled
- decoder prompt masking behavior follows
label_len
Typical decoder setup:
decoder = TransformerDecoder(
input_size=1,
output_size=1,
model_type="informer-like",
label_len=12,
d_model=256,
nhead=8,
num_layers=4,
)
Decoder behavior and constraints¶
Prompting¶
The decoder consumes:
tgt: decoder prompt sequencememory: encoder output sequence
label_len controls how much prompt is treated as observed context in informer-like decoding.
Incremental decoding¶
forward_one_step(...) is intended for autoregressive decoding with KV caching.
Current constraints:
- requires
patch_decoder=False - does not support dynamic layer skipping
- mHC is not supported with incremental decoder state
MTP targets¶
The decoder supports optional multi-token prediction targets for MoE FFNs in decoder layers. That is an advanced path and should only be enabled when you intentionally want auxiliary decoder-horizon supervision inside the FFN block.
Dynamic layer skipping¶
The transformer base supports MoD-style layer skipping.
Key controls:
use_layer_skippinglayer_skip_mode:seqortokenlayer_skip_temperaturelayer_skip_hardlayer_skip_lambda
Current behavior:
seqmode can skip whole layers and save computetokenmode is behavioral mixing, not a true compute-saving token-pruning path
Recommendation:
- leave this off until you have a stable baseline
- prefer
seqmode first if you want actual compute savings
mHC residual streams¶
mHC adds multiple residual streams internally and mixes them with a Sinkhorn-constrained residual mixer.
Key controls:
use_mhcmhc_n_streamsmhc_sinkhorn_itersmhc_temperaturemhc_collapse:firstormean
Use it for research exploration, not as a first-line production default. It changes the residual dynamics substantially and has more runtime constraints than the plain transformer path.
MoE in transformer layers¶
MoE is enabled at the feedforward block level through the transformer constructors:
encoder = TransformerEncoder(
input_size=8,
d_model=256,
nhead=8,
num_layers=4,
use_moe=True,
num_experts=8,
top_k=2,
moe_aux_lambda=1.0,
)
See the dedicated guide for the routing and auxiliary-loss details:
Integration with ForecastingModel¶
from foreblocks import ForecastingModel, TransformerEncoder, TransformerDecoder
encoder = TransformerEncoder(
input_size=8,
d_model=128,
nhead=4,
num_layers=3,
patch_encoder=True,
patch_len=16,
patch_stride=8,
)
decoder = TransformerDecoder(
input_size=1,
output_size=1,
d_model=128,
nhead=4,
num_layers=3,
patch_decoder=False,
)
model = ForecastingModel(
encoder=encoder,
decoder=decoder,
forecasting_strategy="transformer_seq2seq",
model_type="transformer",
target_len=24,
output_size=1,
)
Recommended tuning order¶
- Get a plain encoder-decoder transformer running with
standardattention. - Enable encoder patching if sequence length is large.
- Explore
attention_modevariants. - Add MoE only after the dense baseline is stable.
- Add layer skipping or mHC only for targeted experiments.
Troubleshooting¶
Sequence length exceeds max_seq_len: increasemax_seq_lenor enable patching.- Decoder/memory mask mismatch: this often means the encoder is patched but the memory padding mask was not patchified consistently.
patch_decoder=Truewith KV caching: unsupported; keep decoder patching off for autoregressive decoding.forward_one_step(...)errors with layer skipping or mHC: those features are intentionally disabled in the incremental path.- OOM: reduce
d_model,num_layers,dim_feedforward, or enable gradient checkpointing.