AnuLaya

How Scoring Works

AnuLaya listens to your tabla playing through the microphone and evaluates three things for every bol: did you play? (presence), were you on time? (timing), and did you play the right stroke? (timbre). This page describes the signal processing behind each of those judgments.


The problem

Scoring live percussion practice is harder than it sounds. The app knows what you should play — the composition defines every bol and its position in time. But comparing what the microphone captures against what was expected involves bridging two very different acoustic worlds:

Any scoring method that compares these two signals directly — raw waveforms, absolute amplitudes, even raw spectrograms — will be dominated by this structural gap rather than by the musical content. The core challenge is finding a representation where a correctly played Na recorded through a phone microphone looks the same as a synthesized Na, while still looking different from a Ti or a Ge.

Why online scoring is harder than offline

An offline scorer can take its time — process the full recording, use heavy libraries, run multiple passes. Online scoring during live practice has none of those luxuries:

Scoring pipeline: onset detection, matching, presence gate, timing, and stroke classification

From time domain to frequency domain

Audio analysis begins with a choice of representation. The time domain — the raw waveform — tells us when sound is loud or quiet, but not what frequencies are present. A Na and a Ge can have similar peak amplitudes despite sounding completely different.

The frequency domain, obtained via the Fast Fourier Transform (FFT), decomposes a short window of audio into its constituent frequencies. Now we can see that Na concentrates energy in the 800--3000 Hz range (a sharp ring) while Ge packs energy below 300 Hz (a bass thud). The tradeoff: a single FFT frame loses temporal precision.

The spectrogram bridges both by sliding the FFT window across the signal (the Short-Time Fourier Transform). The result is a 2D matrix — time on one axis, frequency on the other — that captures how the spectral content of a sound evolves. Nearly every stage of the scoring pipeline operates on some form of this representation.


Mel scale and perceptual alignment

Human hearing is approximately logarithmic: the perceived interval between 100 Hz and 200 Hz feels the same as between 1000 Hz and 2000 Hz. A standard spectrogram spaces frequency bins linearly, which wastes resolution on high frequencies we barely distinguish.

The Mel spectrogram applies a perceptually motivated nonlinear mapping, grouping FFT bins into triangular filter banks spaced along the Mel scale. This reduces dimensionality (from thousands of FFT bins to 128 Mel bands) while aligning the representation with how we actually hear. A subsequent log transform aligns the magnitude axis with perceived loudness.

This Mel spectrogram is the shared foundation for both onset detection and stroke classification.


Step 1: Onset detection — when did you hit?

Before evaluating what you played, the system must detect when you played. Tabla strokes are sharp percussive transients — a sudden burst of energy across many frequency bands.

The detector computes spectral flux: the frame-to-frame increase in energy across the Mel spectrogram. When a stroke lands, energy appears suddenly across many bands, producing a sharp spike in flux. During silence or sustained resonance, the spectrum changes slowly and flux stays near zero.

Onset times are extracted by peak-picking on the flux signal: a peak that exceeds a threshold and is separated from the previous onset by a minimum interval is declared a new stroke. The detected onset is then backtracked slightly to the point where energy first began rising, capturing the true attack moment rather than the peak.

Spectral flux signal with sharp peaks at each tabla stroke, marked as detected onsets

Step 2: Matching — which bol was that?

The app knows the full composition schedule: every expected bol and its precise time position (adjusted for tempo and latency). When an onset is detected, it needs to be matched to the nearest expected bol.

Greedy matching pairs each detected onset to the nearest unmatched expected bol within a search window, subject to a monotonicity constraint — each onset can match at most one bol, and matches must proceed forward in time. This prevents one loud hit from claiming credit for multiple bols.

If no onset falls near an expected bol, that bol is marked as missed. If multiple onsets cluster around a single expected bol (a sign of "machine-gunning" — rapid random hitting), the extra hits penalize the score.


Step 3: Presence — did you play at all?

Presence measures whether you produced sound when the composition expected it. It compares the energy in your audio against the expected energy for that bol's time slot.

Three cases arise:

Expected You played Result
Bol Sound detected Scored normally
Rest (silence) Sound detected Penalized — "played in silence"
Rest (silence) Silence Correct — full credit

Presence acts as a gate: if it falls below a threshold, timing and timbre scores are zeroed. You cannot earn credit for a stroke you didn't play, and silence during a rest is the correct behavior.


Step 4: Timing — were you on beat?

Timing is the simplest and most intuitive dimension. It measures the offset between your detected onset and the expected bol position, normalized by the composition's pace:

The normalization uses the composition's fastest bol spacing as the reference, so timing standards are consistent regardless of whether a particular bol happens to sit in a slow or fast passage.

Latency calibration

The total delay between "metronome plays" and "onset detected in your audio" includes audio output latency, your reaction time, acoustic propagation, and input latency. The system auto-calibrates during the first few bols of each session by measuring the systematic offset component and compensating for it. This means scoring works correctly whether you're using wired headphones (~10 ms output latency) or Bluetooth (~100--200 ms).


Step 5: Timbre — did you play the right stroke?

This is the hardest problem and the one that required the most iteration to solve.

The evolution

Attempt 1 — MFCCs and Dynamic Time Warping. The first approach used Mel-Frequency Cepstral Coefficients (MFCCs) — a 13-dimensional compression of the Mel spectrogram — and aligned user audio against a synthesized golden reference using Dynamic Time Warping (DTW). DTW finds the optimal time-stretching to align two sequences, which should reveal both timing offsets and timbre mismatches.

This worked beautifully on well-played compositions. But it failed catastrophically against "machine-gunning": if a user played any continuous stream of random tabla strokes at roughly the right tempo, DTW awarded near-perfect scores. The root cause was that MFCCs compress spectral information so aggressively that any two percussive tabla strokes produce similar MFCC vectors. DTW, faced with a dense stream of plausible-looking frames, simply walked along the diagonal of its cost matrix and reported excellent alignment.

Side-by-side DTW cost matrices: correct playing produces a meaningful warping path, while machine-gunning produces a straight diagonal that scores equally well

Attempt 2 — Z-normalized Mel spectral comparison (M2). To break the circularity of DTW cherry-picking flattering frame pairings, the timbre metric was decoupled from the warping path. Instead of 13 MFCCs, the system used the full 128-band Mel spectrogram, and each frame was Z-normalized independently — subtracting the mean and dividing by the standard deviation across Mel bands.

Z-normalization is the key insight. Cosine similarity (comparing vector direction while ignoring magnitude) handles multiplicative differences like overall volume. But microphone recordings also have additive distortions — an ambient noise floor that elevates all frequency bands uniformly. Z-normalization removes both multiplicative and additive effects, isolating the purely relative distribution of energy across frequency bands. After normalization, a Na recorded through a phone microphone produces the same spectral shape as a synthesized Na.

This metric achieved a Cohen's d of 1.56 — the distributions of correct-bol scores and wrong-bol scores barely overlap, making threshold-based classification reliable. But it required synthesizing and holding a full golden reference in memory, making it expensive for real-time use.

Z-normalization removes noise floor and gain differences, making synthesis and microphone spectra comparable while preserving bol identity

Attempt 3 — Acoustic property classification. The current approach replaces spectral comparison against a golden reference with a lightweight stroke classifier. Rather than asking "does this sound like the synthesized version?", it asks "does this stroke have the acoustic properties we'd expect?"

Four acoustic classes

Tabla strokes fall into four classes defined by their physical production:

Four acoustic classes of tabla strokes: Damped (fast decay), Resonant Treble (dahina ring), Resonant Bass (baya hum), and Both (treble + bass)

The classifier detects binary acoustic properties from a short window of Mel spectrogram around the detected onset:

Scoring then checks whether the detected properties match what's expected for the target bol. A Na should have treble resonance; if you play a damped stroke instead, that's a "wrong bol." A Dha should have treble or bass resonance (it's deliberately permissive, since Dha and Dhin have different spectral profiles despite both being "Both" class).

Why this works better than a 4-way classifier

A naive 4-class classifier (predict D/RT/RB/B, check against expected) fails because the "Both" class is acoustically heterogeneous — Dha is bass-dominant while Dhin is treble-dominant. A single class label can't capture this. The property-based approach sidesteps the problem: it checks for the presence of expected resonance rather than requiring a single classification decision.

Composition-informed scoring

A critical advantage: during practice, the app always knows what bol is expected at each position. The classifier doesn't need to identify strokes in isolation — it only needs to confirm or deny the expected class. Even modest raw classification accuracy becomes highly useful when the prior is known.


Putting it together

The final per-bol score combines timing (weighted 70%) and timbre (weighted 30%), gated by presence:

  1. Detect onsets via Mel spectral flux.
  2. Match each onset to the nearest expected bol.
  3. Gate on presence — no sound means no score.
  4. Measure timing offset from onset to expected position.
  5. Classify the stroke's acoustic properties against the expected bol.
  6. Penalize extra hits in the same bol slot.
  7. Aggregate into a per-bol score, per-cycle breakdown, and session summary.

Machine-gunning — the failure mode that drove much of this development — is now caught by multiple independent mechanisms: wrong-class classification, extra-onset penalties, and the presence gate. A high overall score requires playing the right strokes at approximately the right times.


What's next

The current property-based classifier uses hand-tuned thresholds. The next step is a small convolutional neural network trained on multi-resolution Mel spectrograms around each onset, producing continuous confidence scores rather than binary property decisions. The architecture is designed to be small enough (~15K parameters) to run on-device via CoreML with sub-millisecond inference.

Beyond that: personalized models that adapt to your specific instrument and playing style, and finer-grained bol identification beyond the four acoustic classes.


References

  1. R. M. Ananthanarayana, A. Bhattacharjee, and P. Rao, "Four-way classification of tabla strokes with transfer learning using western drums," Transactions of the International Society for Music Information Retrieval (TISMIR), vol. 6, no. 1, 2023.

  2. H. Sakoe and S. Chiba, "Dynamic programming algorithm optimization for spoken word recognition," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43–49, 1978.

  3. S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.

  4. J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler, "A tutorial on onset detection in music signals," IEEE Transactions on Speech and Audio Processing, vol. 13, no. 5, pp. 1035–1047, 2005.

  5. B. McFee, C. Raffel, D. Liang, D. Ellis, M. McVicar, E. Battenberg, and O. Nieto, "librosa: Audio and music signal analysis in Python," in Proceedings of the 14th Python in Science Conference (SciPy), pp. 18–24, 2015.

  6. A. Kumar, S. Ashok, and N. Tiwari, "Indian music tabla bols classification using deep learning," Proceedings of Meetings on Acoustics, 2025.