When Point Metrics Mislead

Structure-Aware Evaluation Reveals Conditional Ranking Shifts in Time Series Anomaly Detection

Youngmin Ko

Abstract

Time series anomaly detection is commonly reported with point-wise metrics such as AUC-ROC, while many benchmark anomalies are sustained temporal segments. This project evaluates when point-level and segment-aware metrics induce different pairwise model rankings, and provides a lightweight reproducibility artifact for validating the reported ranking-shift observations.

Key Findings

Finding Summary
Ranking flips Pairwise orderings differ between AUC-ROC and Aff-F1 in 14/60 deep-model comparisons and 44/126 comparisons when classical baselines are included.
Benchmark structure Under the processed labels used in this artifact, four industrial benchmarks contain no short anomaly segments.
SAEScore usage SAEScore is reported as a composite summary to expose reporting regimes, not as a universal replacement for constituent metrics.
TSB-AD-M audit scale The TSB-AD-M replication audit covers 25 models, 180 multivariate series, and 4,498 recomputed model-series rows.

Figures

Figure 1. Anomaly-duration taxonomy

Paper Figure 1 anomaly-duration taxonomy

Paper Figure 1 shows duration-stratified anomaly taxonomy across SWaT, MSL, SMAP, and WADI, computed using the processed label definitions retained in this artifact.

Open vector PDF

Figure 4. AUC-ROC versus Aff-F1

Paper Figure 4 metric disagreement

Paper Figure 4 compares AUC-ROC and Aff-F1, highlighting pairwise ranking discrepancies between point-level reporting and segment-aware evaluation across benchmark configurations.

Open vector PDF

Figure 5. TSB-AD-M replication audit

Paper Figure 5 TSB-AD-M replication audit

Paper Figure 5 presents the TSB-AD-M replication audit, showing taxonomy-weight distributions and the AUC-ROC versus SAEScore relationship across recomputed model-series rows.

Open vector PDF

Figure 13. Bootstrap rank comparison

Paper Figure 13 bootstrap rank comparison

Paper Figure 13 reports bootstrap-based rank comparisons between AUC-ROC and SAEScore, estimated under repeated resampling to characterize ranking uncertainty and stability.

Open vector PDF

Reproducibility Artifact

  • Derived summaries required to reproduce the reported evaluation tables and checks.
  • Validation scripts for rank-flip counting, alpha-stratified analysis, and bootstrap comparison.
  • Lightweight tests for consistency checks on retained artifact outputs.
  • No raw access-controlled datasets are redistributed; users must obtain upstream datasets independently.
python scripts/validate_tab_rfr_counts.py
python scripts/compute_tsbad_alpha_stratified_rfr.py
python scripts/compute_rfr_bootstrap_ci.py --n-boot 100

Citation

@misc{ko2026pointmetricsmislead,
  title={When Point Metrics Mislead: Structure-Aware Evaluation Reveals Conditional Ranking Shifts in Time Series Anomaly Detection},
  author={Ko, Youngmin},
  year={2026},
  note={arXiv preprint, arXiv ID coming soon}
}