Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering

Read the Full Academic Paper

For the complete mathematical proofs, systems architecture, and detailed experimental evaluation, download our formal publication.

Download PDF

Research Overview

Deploying deep neural networks on edge platforms relies heavily on static INT8 quantization to reduce memory usage and acceleration latency. The standard ONNX Runtime (ORT) ecosystem provides post-training quantization utilities that insert QuantizeLinear (Q) and DequantizeLinear (DQ) nodes into the computation graph. However, because standard tools evaluate nodes locally in isolation, they place the DequantizeLinear node directly between a Convolution and its subsequent Non-Linear Activation (e.g., ReLU), inadvertently breaking hardware-level kernel fusion.

This paper presents Kenosis, a proprietary Rust graph-optimization engine that implements Fusion-Aware QDQ Placement. By proving and leveraging the positive-homogeneous properties of standard activation layers, Kenosis commutes dequantization nodes past activation boundaries. This preserves the contiguous Conv-Activation pattern, allowing backend execution engines to trigger fused kernels natively on stock runtime engines. Our benchmarks across three classifier architectures and one object detector demonstrate latency speedups of up to 2.42× over FP32 baselines, with high output fidelity validated on a 1,000-image validation set for each architecture.

The Bottleneck: Broken Kernel Fusion

Hardware execution providers maximize throughput by minimizing memory cycles. "Kernel Fusion" collapses multiple sequential operations (such as Convolution and ReLU) into a single hardware instruction, avoiding intermediate writes to main memory. In a standard FP32 vision model, a Convolution and its activation form a contiguous pattern:

Conv ReLU FP32 baseline — fused by the execution provider into a single memory cycle

When the standard ORT Python quantizer converts this to INT8, it evaluates the Convolution in isolation, wrapping it in QDQ nodes and severing the contiguous path:

Quantize Conv Dequantize ReLU ORT Python output — Dequantize severs Conv-ReLU contiguity, breaking fusion

Because the Dequantize node is placed before ReLU, kernel fusion breaks. The engine is forced to write intermediate results to memory and reload them in FP32 to apply the activation, thrashing the cache and neutralizing the acceleration benefits of 8-bit integer math.

The Commutativity Solution

To restore kernel fusion, the Dequantize node must be shifted *after* the activation. Kenosis accomplishes this via graph rewriting. We prove that this reordering is numerically exact by exploiting the mathematical properties of positive-homogeneous activations. The dequantization formula with zero-point $z=0$ simplifies to positive scalar multiplication:

DQ(x) = x * scale   (where scale > 0)

For activations $f(x)$ satisfying positive homogeneity: $f(\alpha x) = \alpha f(x)$ for $\alpha > 0$. Therefore:

f(DQ(x)) = f(x * scale) = f(x) * scale = DQ(f(x))

This commutativity guarantees that the dequantization node can be safely moved past the activation without altering downstream values. This holds algebraically for **ReLU**, **LeakyReLU**, and **Clip(0, M)**. For non-homogeneous activations like **Sigmoid** or **Tanh**, Kenosis places the QDQ wrapper on the activation output, preserving graph validity while maximizing other downstream fusion blocks.

Quantize Conv ReLU Dequantize Kenosis output — Conv-ReLU contiguity preserved, QLinearConv fusion achieved

The Kenosis Optimization Pipeline

Kenosis implements a native Rust-based, topologically aware compiler pipeline consisting of seven optimization stages:

  1. Self-calibration: Generates deterministic synthetic calibration data and collects activation ranges via stock ORT. Fully supports multi-input and NLP attention masks.
  2. Weight quantization: Symmetrical INT8 per-tensor or per-channel weight quantization computed in f64.
  3. INT32 bias quantization: Automatically aligns bias scales to `activation_scale × weight_scale` with $z=0$.
  4. Nudged activation quantization: UINT8 asymmetric quantization with zero-point adjustment to ensure exact float 0.0 representation.
  5. Fusion-aware QDQ placement: Detects eligible Conv-Activation pairs and defers output dequantization past the activation.
  6. Non-vision tensor protection: Graph flood-fill traces secondary metadata inputs (e.g., scale factors) and excludes them from quantization.
  7. Model output protection: Preserves FP32 precision on direct graph outputs to protect classifier probabilities and detection coordinates.

Benchmark Results

Evaluation of Kenosis-optimized models against baseline FP32 configurations on stock ONNX Runtime 1.24 CPU execution provider (Intel i5-13420H, single-threaded execution). Classifiers are evaluated on a 1,000-image ImageNet-1K validation set; the object detector is evaluated on 1,000 images from MS COCO val2017.

Architecture FP32 Latency Kenosis INT8 Speedup Top-1 Agree. Size Reduction
ResNet50 v2 67.73 ms 28.04 ms 2.42× 94.8% 97.7 MB ➔ 30.6 MB (3.2×)
MobileNetV2 6.96 ms 5.22 ms 1.33× 89.9% 13.3 MB ➔ 7.1 MB (1.9×)
EfficientNet-Lite4 27.41 ms 19.39 ms 1.41× 83.1% 49.5 MB ➔ 16.5 MB (3.0×)
PP-YOLOE+ Small 43.61 ms 24.03 ms 1.81× 98.7%* 30.4 MB ➔ 7.8 MB (3.9×)

*Note: PP-YOLOE+ Small uses bounding box predict agreement. Latency numbers correspond to per-tensor weight quantization.

Controlled Ablation: Isolating the Placement Contribution

A direct comparison with standard ORT represents an uncontrolled evaluation because standard ORT and Kenosis differ across multiple pipeline dimensions (such as calibration dataset strategies, scale selection heuristics, bias alignment, and classifier head protections). To isolate the placement-specific effects, we evaluate using a controlled ablation where all calibration and quantization parameters are held constant, differing only in the topological position of QDQ nodes. Within this controlled setup, we observe that naive QDQ placement breaks kernel fusion and wastes 8-bit precision representing negative activation ranges that are later discarded by the activation. For example, on EfficientNet-Lite4, naive QDQ placement regresses predictive agreement to 62.8% (per-tensor), while fusion-aware placement achieves 83.1% with the same quantized weights. On MobileNetV2, naive placement produces a model that is 12% slower than FP32 (7.77 ms vs. 6.96 ms), while fusion-aware placement restores a 25% speedup (5.22 ms vs. 6.96 ms) — a qualitative reversal from regression to acceleration.

Conclusion

By shifting from localized, node-by-node QDQ injection to topologically aware graph reordering, Kenosis achieves native kernel fusion on stock execution engines without requiring custom runtime modifications. This unlocks the true acceleration potential of 8-bit integer math on edge devices, maintaining high predictive fidelity across diverse vision architectures.

Cite This Work

If you use Kenosis or reference this research in your work, please cite the paper as follows:

Coomler, C. (2026). Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering. Zenodo. https://doi.org/10.5281/zenodo.20657989

BibTeX Citation
@article{coomler2026fusion,
  title={Fusion-Aware QDQ Placement: Achieving Native Kernel Fusion in ONNX via Graph Reordering},
  author={Coomler, Cory},
  year={2026},
  publisher={Zenodo},
  doi={10.5281/zenodo.20657989},
  url={https://doi.org/10.5281/zenodo.20657989}
}
View Kenosis on GitHub