ISMIR 2021 速度、节拍与强拍估计教程阅读笔记

教程链接: Tempo, Beat, and Downbeat Estimation — Tempo, Beat and Downbeat Estimation (


自动标注 + 自动修正 + 人工修正


See 音乐速度与节拍估计基本方法 | WiZardWen (



  • 将节拍估计问题视为二分类问题来进行评估
  • Typically use a tolerance window of +/- 70ms around each ground truth annotation.
  • 使用 F-measure 的好处:Catch either: i) natural human variation in tapping and not punish it, or ii) contend with cases like arpeggiated chords where it’s difficult to mark a single beat location.
  • 计算方式

Precision=TPTP+FPPrecision = \frac {TP} {TP + FP}

Recall=TPTP+FNRecall = \frac {TP} {TP + FN}

Fmeasure=(1+β2)PrecisionRecallβ2Precision+RecallFmeasure=(1+\beta ^2)\cdot \frac {Precision\cdot Recall} {\beta^2\cdot Precision + Recall}

Cemgil’s method

  • 使用高斯分布作为分值来进行评估
  • 将 F-measure 与 Cemgil 结合使用,F-measure 反映所估计节拍的 metrical level and phase,Cemgil 则反映节拍位置是否正确(更精确)

Continuity-based method

  • Consider beat i to be accurate if it falls within the tolerance window and that beat i-1 also falls within its respective tolerance window.
  • CMLc: “correct” (i.e., annotated) metrical level, with longest single continous segment.
  • CMLt: “correct” (i.e., annotated) metrical level, with the total of continous segments.
  • AMLc: “allowed” metrical levels, with longest single continous segment.
  • AMLt: “allowed” metrical levels, with the total of continous segments.
  • Metrical level 包括
    • The same metrical level and “in-phase”
    • The same metrical level, but tapped on the “off-beat”
    • Twice the annotated metrical level
    • Half the annotated metrical level (两种)


  • 评估方法的选择
  • 拓展到 Tempo 和 Downbeat

Theoretical Underpinnings

General pipeline commonly used for beat and/or downbeat tracking systems: feature extraction -> likelihood estimation -> post-processing.

Feature extraction

  • Three most explored categories of musically inspired features
    • Chroma (CH), reflect the harmonic content of the signal
    • Onset detection function (ODF) or spectral flux (SF), event-oriented indicators
    • Spectral coefficients or MFCCs, timbre inspired features
  • Combinations of logarithmic spectrograms with different resolutions (recently)

Likelihood estimation

  • Heuristics method or traditional machine learning method
    • Template
    • GMM, k-means: recognize rhythm patterns
    • Limitation: need assumptions of style or genre
  • Deep learning method

Inference (post-processing)

  • Obtain the final downbeat sequence
  • Most used: Probabilistic graphical models (PGMs)
    • PGMs are a set of probabilistic models that express conditional dependencies between random variables as a graph.
    • Directed or Undirected

Pros and Cons of DL

  • Flexible and adaptable across tasks
  • Remove the stage of hand-crafted feature design
  • Dependence on annotated data
  • Bias of the data
  • Lack of interpretability

DNN 的各种结构

  • 包括 MLP、CNN、RNN、GRU、双向 RNN 的简单介绍
  • TCN:
    • Uses dilated convolutions which enable exponentially large receptive fields
    • Good at learning sequential/temporal structure
    • Handle more context while retain the parallelisation property of CNN
    • Trained more efficiently than RNN, LSTM or GRU