Skip to content

memd — Math

From raw audio to knowledge graph, every step, every formula


Table of Contents

  1. Sound as Numbers
  2. Pre-emphasis Filter
  3. Framing and Windowing
  4. Fourier Transform — Decomposing Sound into Frequencies
  5. Power Spectrum
  6. Mel Scale — Matching Human Hearing
  7. Mel Filterbank
  8. Log Compression
  9. The Log-Mel Spectrogram — Full Pipeline
  10. Fixed-Point Arithmetic — Doing Math Without Decimals
  11. Neural Networks — Learning from Data
  12. Activation Functions
  13. Loss Function and Training
  14. Backpropagation — How Gradients Flow
  15. Convolution — Pattern Detection on Grids
  16. Depthwise Separable Convolution — Cheaper Convolution
  17. Pooling — Shrinking Feature Maps
  18. Softmax — Converting Scores to Probabilities
  19. The Full memd-net Architecture
  20. INT8 Quantisation — Shrinking the Model
  21. INT8 Inference Runtime — memd-rt
  22. The Memory Packet
  23. Cosine Similarity — Finding Similar Memories
  24. Knowledge Graph and Session Merging
  25. Full End-to-End Pipeline

1. Sound as Numbers

Sound is variation in air pressure over time. A microphone converts that pressure variation into electrical voltage. An Analog-to-Digital Converter (ADC) inside the ESP32-S3 samples that voltage at regular intervals and records each measurement as an integer.

Sample rate: 16,000 samples per second (16 kHz). This means one measurement every:

\[\Delta t = \frac{1}{16000} = 0.0000625 \text{ seconds} = 0.0625 \text{ ms}\]

Each sample is a 16-bit signed integer, range \(-32768\) to \(32767\).

One second of audio:

\[x = [x_0, x_1, x_2, \ldots, x_{15999}]\]

Loud sound → large magnitude numbers. Silence → values near zero.

What you get from the INMP441 microphone: The INMP441 communicates over I2S (Inter-IC Sound) — a serial bus protocol that streams audio samples continuously. The ESP32-S3 reads these samples into a buffer in RAM.

PSEUDOCODE: Reading audio from I2S

function read_audio_frame(duration_ms):
    num_samples = (duration_ms / 1000) × SAMPLE_RATE   // e.g. 2000ms → 32000
    buffer = allocate_array(num_samples, type=int16)

    i2s_read(port=I2S_PORT_0, destination=buffer, length=num_samples)

    return buffer

2. Pre-emphasis Filter

Speech and environmental audio naturally have more energy at low frequencies than high frequencies. This imbalance makes the neural network harder to train — it sees strong low-frequency signal and weak high-frequency signal, and learns to mostly ignore high frequencies.

Pre-emphasis is a high-pass filter that boosts high frequencies before any other processing. It amplifies rapid changes between adjacent samples (rapid change = high frequency content).

\[y[n] = x[n] - \alpha \cdot x[n-1], \quad \alpha = 0.97\]

Where:

  • \(x[n]\) = current sample
  • \(x[n-1]\) = previous sample
  • \(\alpha = 0.97\) = pre-emphasis coefficient (standard value)
  • \(y[n]\) = filtered output

Why \(\alpha = 0.97\)? It’s a well-established empirical choice for speech processing. Higher \(\alpha\) = stronger boost to high frequencies. Lower = gentler.

What this does mathematically: a flat signal where \(x[n] \approx x[n-1]\) produces \(y[n] \approx 0\) (suppressed). A rapidly oscillating signal where \(x[n]\) and \(x[n-1]\) differ a lot produces large \(y[n]\) (amplified). This is the definition of high-pass filtering.

Example: $\(x = [1000, 1020, 980, 1050, 300]\)$ $\(y[0] = 1000 \quad \text{(no previous sample, use as-is)}\)$ $\(y[1] = 1020 - 0.97 \times 1000 = 1020 - 970 = 50\)$ $\(y[2] = 980 - 0.97 \times 1020 = 980 - 989.4 = -9.4\)$ $\(y[3] = 1050 - 0.97 \times 980 = 1050 - 950.6 = 99.4\)$ $\(y[4] = 300 - 0.97 \times 1050 = 300 - 1018.5 = -718.5\)$

Note how the large jump from 1050 to 300 produces a huge output (\(-718.5\)) — the filter is sensitive to change.

PSEUDOCODE: Pre-emphasis filter

function pre_emphasis(x, alpha=0.97):
    y = allocate_array(length(x), type=float)
    y[0] = x[0]

    for n from 1 to length(x) - 1:
        y[n] = x[n] - alpha × x[n-1]

    return y

3. Framing and Windowing

3.1 Why frames?

A Fourier Transform (next section) assumes the signal is stationary — its frequency content doesn’t change during the analysis window. Real audio is not stationary. The word “hello” goes through completely different sounds in 500ms.

Solution: chop the signal into short overlapping chunks called frames. Each frame is short enough to be approximately stationary (25–32ms).

Frame parameters for memd:

  • Frame length: \(N = 512\) samples \(= 32\text{ms}\) at 16 kHz
  • Hop size (step between frames): \(H = 512\) samples for a non-overlapping 32×32 grid

Wait — for 1 second of audio with 32 frames and hop size H:

\[\text{num_frames} = \left\lfloor \frac{\text{total_samples} - N}{H} \right\rfloor + 1 = 32\]

Solving: \(H = \lfloor(16000 - 512) / 31\rfloor = \lfloor 496.4 \rfloor = 496\) samples \(\approx 31\text{ms}\) hop.

PSEUDOCODE: Framing

function frame_signal(x, frame_length=512, hop_size=496):
    frames = []
    start = 0

    while start + frame_length <= length(x):
        frame = x[start : start + frame_length]   // slice of 512 samples
        frames.append(frame)
        start += hop_size

    return frames   // list of 32 arrays, each 512 samples

3.2 The Hamming window

When you cut out a frame, the edges are abrupt discontinuities — the signal goes from some non-zero value to exactly zero at the boundary (because you’re cutting). In the frequency domain, an abrupt cut looks like a rectangular pulse, which creates high-frequency ringing artifacts called spectral leakage — fake frequencies that don’t exist in the original signal.

Fix: multiply the frame by a smooth function that tapers to zero at both edges. The Hamming window:

\[w[n] = 0.54 - 0.46 \cdot \cos!\left(\frac{2\pi n}{N - 1}\right), \quad n = 0, 1, \ldots, N-1\]

Where \(N\) is the frame length (512). The cosine function oscillates between -1 and +1:

  • At \(n=0\): \(\cos(0) = 1 \Rightarrow w[0] = 0.54 - 0.46 = 0.08 \approx 0\)
  • At \(n=N/2\): \(\cos(\pi) = -1 \Rightarrow w[N/2] = 0.54 + 0.46 = 1.0\)
  • At \(n=N-1\): \(\cos(2\pi) = 1 \Rightarrow w[N-1] = 0.08 \approx 0\)

A smooth bell peaking at 1.0 in the middle, fading to near-zero at the edges.

Apply the window:

\[x_{\text{windowed}}[n] = x_{\text{frame}}[n] \cdot w[n]\]
PSEUDOCODE: Hamming window

function compute_hamming_window(N=512):
    w = allocate_array(N, type=float)
    for n from 0 to N-1:
        w[n] = 0.54 - 0.46 × cos(2π × n / (N - 1))
    return w

function apply_window(frame, window):
    result = allocate_array(length(frame), type=float)
    for n from 0 to length(frame) - 1:
        result[n] = frame[n] × window[n]
    return result

4. Fourier Transform

4.1 The core idea

Any signal — no matter how complex — can be expressed as a sum of pure sine waves at different frequencies and amplitudes. The Fourier Transform finds those frequencies and their amplitudes.

A pure sine wave at frequency \(f\), amplitude \(A\), and phase \(\phi\):

\[s(t) = A \cdot \sin(2\pi f t + \phi)\]

The Discrete Fourier Transform (DFT) is the version for discrete samples. Given a frame of \(N\) samples \(x[0], x[1], \ldots, x[N-1]\), it computes \(N\) complex numbers \(X[0], X[1], \ldots, X[N-1]\):

\[X[k] = \sum_{n=0}^{N-1} x[n] \cdot e^{-j \frac{2\pi k n}{N}}, \quad k = 0, 1, \ldots, N-1\]

4.2 Breaking down the formula

Euler’s formula (fundamental identity of complex exponentials):

\[e^{j\theta} = \cos(\theta) + j \cdot \sin(\theta)\]

Therefore:

\[e^{-j \frac{2\pi k n}{N}} = \cos!\left(\frac{2\pi k n}{N}\right) - j \cdot \sin!\left(\frac{2\pi k n}{N}\right)\]

Substituting back:

\[X[k] = \sum_{n=0}^{N-1} x[n] \cdot \cos\!\left(\frac{2\pi k n}{N}\right) - j \sum_{n=0}^{N-1} x[n] \cdot \sin\!\left(\frac{2\pi k n}{N}\right)\]

Each \(X[k]\) is a complex number \(a + jb\) with:

  • Real part: \(a = \sum x[n] \cos(2\pi k n / N)\)
  • Imaginary part: \(b = -\sum x[n] \sin(2\pi k n / N)\)

What does \(X[k]\) mean? It measures how well your signal correlates with a pure sinusoid at frequency \(k\). If your signal contains energy at exactly frequency \(k\), this correlation is large. Otherwise it’s near zero.

Frequency of bin \(k\):

\[f_k = k \cdot \frac{f_s}{N} = k \cdot \frac{16000}{512} = k \cdot 31.25 \text{ Hz}\]

So bin \(k=0\) → 0 Hz (DC component), bin \(k=1\) → 31.25 Hz, bin \(k=32\) → 1000 Hz, bin \(k=160\) → 5000 Hz.

Symmetry: for a real-valued input signal, \(X[N-k] = X[k]^*\) (complex conjugate). The spectrum is symmetric around the midpoint. You only need the first \(N/2 + 1 = 257\) bins.

4.3 The Fast Fourier Transform (FFT)

The naive DFT computes each \(X[k]\) by summing \(N\) terms, for all \(N\) values of \(k\): total \(O(N^2)\) operations. For \(N=512\): \(262,144\) multiplications.

The FFT (Cooley-Tukey algorithm, 1965) exploits the symmetry of the complex exponentials to recursively split the computation:

\[X[k] = \underbrace{\sum_{n \text{ even}} x[n] \cdot e^{-j\frac{2\pi k n}{N}}}*{\text{even samples DFT}} + e^{-j\frac{2\pi k}{N}} \underbrace{\sum*{n \text{ odd}} x[n] \cdot e^{-j\frac{2\pi k n}{N}}}_{\text{odd samples DFT}}\]

Recursively split even and odd sub-problems until you reach size-1 DFTs (trivial). Total operations: \(O(N \log_2 N)\).

For \(N=512\): \(512 \times \log_2(512) = 512 \times 9 = 4,608\) operations. 57× faster than naive.

You use esp-dsp’s dsps_fft2r_fc32 — no need to implement this yourself.

PSEUDOCODE: FFT on a frame (using library)

function compute_fft(windowed_frame):
    // Zero-pad or trim to power of 2 if needed (512 is already 2^9)
    complex_output = allocate_complex_array(512)

    // Copy real input into complex array (imaginary parts = 0)
    for n from 0 to 511:
        complex_output[n].real = windowed_frame[n]
        complex_output[n].imag = 0.0

    // In-place FFT (library call)
    fft_library_call(complex_output, N=512)

    // Only keep first 257 bins (symmetry)
    return complex_output[0:257]

5. Power Spectrum

You don’t need phase information (the imaginary part encodes phase). You want energy — how much power at each frequency. For each complex output \(X[k] = a + jb\):

\[P[k] = |X[k]|^2 = a^2 + b^2 = \text{Re}(X[k])^2 + \text{Im}(X[k])^2\]

This is the power spectrum — a list of 257 non-negative real numbers, one per frequency bin.

Why squared? Power is proportional to amplitude squared (from physics — power \(\propto A^2\)). It also makes all values non-negative, which is necessary before the log step later.

PSEUDOCODE: Power spectrum

function power_spectrum(fft_output):
    P = allocate_array(257, type=float)
    for k from 0 to 256:
        a = fft_output[k].real
        b = fft_output[k].imag
        P[k] = a×a + b×b
    return P

6. Mel Scale

6.1 Why Mel?

The FFT gives you 257 frequency bins equally spaced in Hz. But human perception of pitch is logarithmic, not linear.

The perceptual difference between 100 Hz and 200 Hz (one octave) sounds the same as between 1000 Hz and 2000 Hz (also one octave, also \(2\times\) ratio). But in linear Hz, the first gap is 100 Hz wide and the second is 1000 Hz wide — 10× different.

Also, our ears have much finer frequency discrimination at low frequencies than high. We can easily hear the difference between 200 Hz and 220 Hz (a musical half-step). But we struggle to distinguish 8000 Hz from 8020 Hz.

If you feed raw linear-spaced frequency bins into a neural network, you’re wasting capacity. The network has to learn that bins 1–10 (low freq) matter a lot and bins 200–257 (high freq) matter less. Better: give it input that already reflects perception.

6.2 The Mel formula

The Mel scale maps Hertz to a perceptual frequency scale:

\[m = 2595 \cdot \log_{10}!\left(1 + \frac{f}{700}\right)\]

Inverse (Mel to Hz):

\[f = 700 \cdot \left(10^{m/2595} - 1\right)\]

The constant 700 is the “bend point” — below 700 Hz the scale is nearly linear; above 700 Hz it becomes increasingly compressed.

Worked examples:

Hz Mel
0 0
100 150
500 607
1000 999
2000 1332
4000 1785
8000 2840

Note: going from 1000 to 2000 Hz = 333 mel. Going from 4000 to 8000 Hz (4× larger gap in Hz) = only 1055 mel. The high-frequency range is compressed.

PSEUDOCODE: Mel conversion

function hz_to_mel(f_hz):
    return 2595 × log10(1 + f_hz / 700)

function mel_to_hz(m):
    return 700 × (10^(m / 2595) - 1)

7. Mel Filterbank

7.1 Creating triangular filters

You want to convert your 257-bin power spectrum into 32 mel-scale values. You create 32 triangular filters that cover the frequency range in mel-spaced intervals.

Step 1: Choose frequency range. For audio: \(f_{\min} = 0\) Hz, \(f_{\max} = 8000\) Hz.

Step 2: Convert to mel: $\(m_{\min} = \text{hz_to_mel}(0) = 0, \quad m_{\max} = \text{hz_to_mel}(8000) = 2840\)$

Step 3: Space \(M + 2 = 34\) points evenly in mel space (\(M=32\) filters needs 34 boundary points):

\[m_i = m_{\min} + i \cdot \frac{m_{\max} - m_{\min}}{M + 1}, \quad i = 0, 1, \ldots, 33\]

Step 4: Convert each mel point back to Hz, then to the nearest FFT bin:

\[f_i = \text{mel_to_hz}(m_i)$$ $$b_i = \left\lfloor \frac{f_i \cdot N}{f_s} + 0.5 \right\rfloor = \text{nearest FFT bin}\]

Step 5: Define filter \(m\) (for \(m = 0, 1, \ldots, 31\)) as a triangle between bins \(b_m\) and \(b_{m+2}\), peaking at \(b_{m+1}\):

\[H_m[k] = \begin{cases} 0 & k < b_m \\[4pt] \dfrac{k - b_m}{b_{m+1} - b_m} & b_m \leq k < b_{m+1} \\[6pt] \dfrac{b_{m+2} - k}{b_{m+2} - b_{m+1}} & b_{m+1} \leq k \leq b_{m+2} \\[4pt] 0 & k > b_{m+2} \end{cases}\]

Filter \(m\) rises linearly from 0 to 1 between bins \(b_m\) and \(b_{m+1}\), then falls linearly from 1 to 0 between \(b_{m+1}\) and \(b_{m+2}\). Outside that range: zero.

7.2 Applying the filterbank

For each mel band \(m\), sum the power spectrum weighted by the filter:

\[S[m] = \sum_{k=0}^{256} H_m[k] \cdot P[k], \quad m = 0, 1, \ldots, 31\]

Result: 32 non-negative numbers, one per mel band. Each represents the total energy in that perceptual frequency region.

PSEUDOCODE: Mel filterbank

function compute_mel_filterbank(f_min=0, f_max=8000, n_filters=32, N=512, fs=16000):
    // Step 1-2: mel boundaries
    m_min = hz_to_mel(f_min)
    m_max = hz_to_mel(f_max)

    // Step 3: evenly spaced mel points
    mel_points = linspace(m_min, m_max, n_filters + 2)   // 34 points

    // Step 4: convert to FFT bin indices
    hz_points = [mel_to_hz(m) for m in mel_points]
    bin_points = [floor(f × N / fs + 0.5) for f in hz_points]

    // Step 5: build filter matrix (32 × 257)
    filterbank = zeros(n_filters, 257)
    for m from 0 to n_filters - 1:
        left  = bin_points[m]
        center = bin_points[m + 1]
        right  = bin_points[m + 2]

        // Rising slope
        for k from left to center - 1:
            filterbank[m][k] = (k - left) / (center - left)

        // Falling slope
        for k from center to right:
            filterbank[m][k] = (right - k) / (right - center)

    return filterbank

function apply_filterbank(P, filterbank, n_filters=32):
    S = zeros(n_filters)
    for m from 0 to n_filters - 1:
        for k from 0 to 256:
            S[m] += filterbank[m][k] × P[k]
    return S

8. Log Compression

After the filterbank, \(S[m]\) values span a huge range — perhaps \(10^6\) for loud sounds and \(1\) for quiet ones. Neural networks struggle with such large input ranges.

Apply the natural logarithm (or \(\log_{10}\)):

\[L[m] = \log(S[m] + \epsilon), \quad \epsilon = 10^{-6}\]

The \(\epsilon\) prevents \(\log(0)\) which is \(-\infty\).

Effect: $\(\log(10^6) = 6 \cdot \log(10) \approx 13.8\)$ $\(\log(10^4) = 4 \cdot \log(10) \approx 9.2\)$ $\(\log(1) = 0\)$

A signal that was \(10^6 \times\) stronger than another is now only \(\approx 13.8\) units louder on the log scale. The dynamic range is compressed from \(10^6\) to \(\sim 14\) units.

This also matches human loudness perception. The decibel scale is logarithmic: \(\text{dB} = 10 \log_{10}(P)\).

PSEUDOCODE: Log compression

function log_compress(S, epsilon=1e-6):
    L = allocate_array(length(S))
    for m from 0 to length(S) - 1:
        L[m] = log(S[m] + epsilon)
    return L

9. Log-Mel Spectrogram — Full Pipeline

Run the entire pipeline on 1 second of audio, collecting 32 frames:

\[\text{spectrogram}[t][m] = \log!\left(\sum_{k=0}^{256} H_m[k] \cdot |X_t[k]|^2 + \epsilon\right)\]

Where \(X_t[k]\) is the Fourier Transform of frame \(t\).

Result: a \(32 \times 32\) matrix. Each row = one mel band. Each column = one time frame. Each cell = log energy in that frequency band at that time.

This matrix is your neural network’s input. It is treated as a grayscale image.

PSEUDOCODE: Full spectrogram pipeline

function compute_log_mel_spectrogram(raw_audio, n_mels=32, n_frames=32):
    // Pre-processing
    audio_f = pre_emphasis(raw_audio, alpha=0.97)

    // Precompute Hamming window and mel filterbank once
    hamming = compute_hamming_window(N=512)
    filterbank = compute_mel_filterbank(n_filters=n_mels)

    // Frame and process
    frames = frame_signal(audio_f, frame_length=512, hop_size=496)

    spectrogram = zeros(n_mels, n_frames)

    for t from 0 to n_frames - 1:
        windowed = apply_window(frames[t], hamming)
        fft_out  = compute_fft(windowed)
        power    = power_spectrum(fft_out)
        mel_S    = apply_filterbank(power, filterbank)
        log_S    = log_compress(mel_S)

        spectrogram[:, t] = log_S   // store column t

    return spectrogram   // shape: (32, 32)

9.1 Quantising the spectrogram to INT8

Before feeding to the network, normalise and quantise:

  1. Find min and max across all 1024 values
  2. Map linearly to \([-128, 127]\):
\[q[m][t] = \text{round}!\left(\frac{\text{spectrogram}[m][t] - \mu}{\sigma} \times \text{scale} + \text{zero_point}\right)\]

Where \(\mu\) and \(\sigma\) are mean and standard deviation computed on your training set (fixed constants burned into firmware — not computed at runtime).


10. Fixed-Point Arithmetic

10.1 Why no floats?

The ESP32-S3 has a hardware floating-point unit (FPU) for single-precision float32. However:

  • Float32 operations are slower than INT32
  • Each float is 4 bytes; INT8 is 1 byte (4× memory savings)
  • INT8 inference at 240 MHz gives acceptable latency (~30–50ms)

Fixed-point representation stores a decimal number as an integer scaled by a power of 2.

Q15 format: 1 sign bit + 15 fractional bits. Represents range \([-1.0, 1.0)\).

\[\text{float value} = \frac{\text{int16 value}}{2^{15}} = \frac{\text{int16 value}}{32768}\]
Float Q15 integer
1.0 32767
0.5 16384
0.25 8192
-0.3 -9830
0.97 31785

Multiplication of two Q15 numbers \(a\) and \(b\):

\[a \times b \text{ in real:} \quad \frac{a_int}{32768} \times \frac{b_int}{32768} = \frac{a_int \times b_int}{32768^2}\]

To get back to Q15: right-shift by 15:

\[\text{result_int} = (a_int \times b_int) >> 15\]

Must use INT32 intermediate to avoid overflow: \(32767 \times 32767 = 1,073,676,289\) which overflows INT16 but fits in INT32.

PSEUDOCODE: Fixed-point Q15 multiply

function q15_multiply(a, b):
    // a and b are int16 in Q15 format
    product = (int32)a × (int32)b    // intermediate in 32-bit
    result = product >> 15            // shift back to Q15
    return (int16)result              // truncate to 16-bit

10.2 The fixed-point log problem (L1.1)

\(\log()\) has no direct fixed-point formula. Common approximation:

For \(x \in [1, 2)\): the mantissa of a float \(x\) in this range, \(\log_2(x) \approx x - 1\) (linear approximation).

Better quadratic approximation:

\[\log_2(m) \approx -0.3445 \cdot m^2 + 2.0247 \cdot m - 1.6802, \quad m \in [1, 2)\]

For arbitrary \(x > 0\) (extract exponent \(e\) from float representation):

\[\log_2(x) = e + \log_2(m), \quad x = m \cdot 2^e, \quad m \in [1, 2)\]

The bias problem: the approximation error is not uniform across \(m \in [1,2)\). At \(m \approx 1\) (low-energy bins) the quadratic approximation is less accurate than at \(m \approx 1.5\) (high-energy bins). Low-energy mel bins (high frequencies) have values near 1.0 after normalisation — exactly where the error is worst. This creates a systematic per-bin offset between your C implementation and the Python reference.

Measuring and correcting the bias:

\[\text{bias}[m] = \frac{1}{T} \sum_{t=0}^{T-1} \left(\text{log_python}[m][t] - \text{log_C}[m][t]\right)\]

Compute this over a representative dataset. Store bias[32] as a constant array in firmware. Add it back after the fixed-point log:

\[L_{\text{corrected}}[m] = L_{\text{C}}[m] + \text{bias}[m]\]

This is your original finding — not documented in any tutorial.


11. Neural Networks

11.1 What a neuron computes

A single artificial neuron receives \(n\) inputs and produces one output:

\[\text{output} = f!\left(\sum_{i=0}^{n-1} w_i \cdot x_i + b\right) = f(\mathbf{w} \cdot \mathbf{x} + b)\]

Where:

  • \(\mathbf{x} = [x_0, x_1, \ldots, x_{n-1}]\) = input vector
  • \(\mathbf{w} = [w_0, w_1, \ldots, w_{n-1}]\) = weight vector (learned)
  • \(b\) = bias scalar (learned)
  • \(f\) = activation function (nonlinear, explained next section)
  • \(\mathbf{w} \cdot \mathbf{x} = \sum w_i x_i\) = dot product

The dot product \(\sum w_i x_i\) measures how similar the input is to the weight pattern. If the input matches the pattern encoded in the weights, the sum is large. If it doesn’t, the sum is near zero.

11.2 A layer of neurons

A layer has \(m\) neurons, each with its own weight vector. Represented as a matrix:

\[\mathbf{y} = f(\mathbf{W} \mathbf{x} + \mathbf{b})\]

Where:

  • \(\mathbf{x} \in \mathbb{R}^n\) = input vector
  • \(\mathbf{W} \in \mathbb{R}^{m \times n}\) = weight matrix (row \(i\) = weights of neuron \(i\))
  • \(\mathbf{b} \in \mathbb{R}^m\) = bias vector
  • \(\mathbf{y} \in \mathbb{R}^m\) = output vector
  • \(f\) = activation applied element-wise

11.3 Why depth?

Multiple layers allow the network to learn hierarchical features.

  • Layer 1 detects simple frequency bands
  • Layer 2 detects combinations of bands (e.g., the harmonic pattern of speech)
  • Layer 3 detects higher-level structure (e.g., “this is keyboard rhythm”)
  • Final layer combines everything into a class score

Without multiple layers, you have a single linear transform which can only draw straight-line decision boundaries. Speech vs keyboard cannot be separated by a straight line in frequency space. You need curves — that requires depth.


12. Activation Functions

Without nonlinear activation functions, a deep network collapses to a single linear layer. Proof:

\[\mathbf{y} = \mathbf{W}_2(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = \underbrace{(\mathbf{W}_2 \mathbf{W}*1)}*{\mathbf{W}’} \mathbf{x} + \underbrace{(\mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}*2)}*{\mathbf{b}’}\]

Two linear layers = one linear layer. The depth is wasted.

12.1 ReLU

Rectified Linear Unit:

\[\text{ReLU}(x) = \max(0, x) = \begin{cases} 0 & x < 0 \ x & x \geq 0 \end{cases}\]

Negative values are set to zero. Positive values pass through unchanged. This kink at zero is the nonlinearity that breaks the linear collapse.

Why does it work? It allows neurons to be “off” (output 0) for some inputs and “on” for others. Different neurons become active for different patterns. This creates a sparse, efficient representation.

12.2 ReLU6

\[\text{ReLU6}(x) = \min(\max(0, x), 6) = \begin{cases} 0 & x < 0 \ x & 0 \leq x \leq 6 \ 6 & x > 6 \end{cases}\]

Clips the output to \([0, 6]\). Used in MobileNet.

Why 6? Purely empirical — found to work well. The bound ensures activations never grow unboundedly during training, which matters for quantisation. If activations can be any positive number, INT8 quantisation would need a huge scale factor, losing precision. Clamping at 6 gives a known range.

For INT8 quantisation: knowing that activations \(\in [0, 6]\), the quantisation scale is:

\[S = \frac{6 - 0}{127 - 0} \approx 0.0472\]

Fixed, predictable, requires no per-activation calibration.


13. Loss Function and Training

13.1 Cross-entropy loss

You have a training dataset of audio clips, each labeled with the correct class (0–7). The network produces probability estimates \(\hat{y}[c]\) for each class \(c\). The true label is a one-hot vector \(y\) — all zeros except a 1 at the correct class.

Cross-entropy loss measures how wrong the predictions are:

\[\mathcal{L} = -\sum_{c=0}^{7} y[c] \cdot \log(\hat{y}[c])\]

Since \(y\) is one-hot (only one \(y[c] = 1\), all others are 0):

\[\mathcal{L} = -\log(\hat{y}[c^*])\]

Where \(c^*\) is the correct class. So the loss is simply the negative log of the probability assigned to the correct class.

Intuition:

  • Correct class predicted with probability 0.99: \(\mathcal{L} = -\log(0.99) = 0.01\) (small loss, good)
  • Correct class predicted with probability 0.50: \(\mathcal{L} = -\log(0.50) = 0.693\) (moderate)
  • Correct class predicted with probability 0.01: \(\mathcal{L} = -\log(0.01) = 4.61\) (large loss, bad)

The log function punishes confident wrong predictions very harshly.

13.2 Gradient descent

You want to minimise \(\mathcal{L}\) over all the weights \(\mathbf{W}\). The gradient \(\nabla_{\mathbf{W}} \mathcal{L}\) is a matrix of the same shape as \(\mathbf{W}\), where each entry is the partial derivative of the loss with respect to that weight:

\[\frac{\partial \mathcal{L}}{\partial w_{ij}} = \text{rate of change of loss if weight } w_{ij} \text{ is increased slightly}\]

The update rule (stochastic gradient descent):

\[\mathbf{W} \leftarrow \mathbf{W} - \eta \cdot \nabla_{\mathbf{W}} \mathcal{L}\]

Where \(\eta\) (learning rate) is a small positive constant, typically \(0.001\).

Why subtract the gradient? The gradient points in the direction that increases the loss most steeply. Subtracting it moves the weights in the direction that decreases the loss.

Batch gradient descent: don’t compute the gradient on one sample — compute it on a mini-batch of \(B\) samples (typically \(B = 32\) or \(64\)) and average the gradients. This reduces noise and uses GPU parallelism.

\[\mathbf{W} \leftarrow \mathbf{W} - \eta \cdot \frac{1}{B} \sum_{i=1}^{B} \nabla_{\mathbf{W}} \mathcal{L}_i\]

One epoch = one full pass over the entire training dataset. You train for many epochs (10–100).


14. Backpropagation

Computing \(\nabla_{\mathbf{W}} \mathcal{L}\) for every weight naively would require re-running the whole network for each weight. Backpropagation uses the chain rule to propagate gradients efficiently from output to input.

14.1 The chain rule

If \(z = f(y)\) and \(y = g(x)\), then:

\[\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}\]

For a vector function: if loss \(\mathcal{L}\) depends on \(\mathbf{y}\) which depends on \(\mathbf{x}\) through \(\mathbf{y} = f(\mathbf{x})\), then:

\[\frac{\partial \mathcal{L}}{\partial x_j} = \sum_i \frac{\partial \mathcal{L}}{\partial y_i} \cdot \frac{\partial y_i}{\partial x_j}\]

This is the Jacobian-vector product.

14.2 Backprop through a linear layer

Forward: \(\mathbf{y} = \mathbf{W} \mathbf{x} + \mathbf{b}\)

Given \(\dfrac{\partial \mathcal{L}}{\partial \mathbf{y}}\) (gradient flowing back from the next layer), compute:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \mathbf{x}^T\]
\[\frac{\partial \mathcal{L}}{\partial \mathbf{b}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}}\]
\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \mathbf{W}^T \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \quad \text{(passed to previous layer)}\]

14.3 Backprop through ReLU

Forward: \(y = \text{ReLU}(x) = \max(0, x)\)

\[\frac{\partial y}{\partial x} = \begin{cases} 1 & x > 0 \ 0 & x \leq 0 \end{cases}\]

Backprop: the gradient from the next layer is passed through unchanged if \(x > 0\), and killed (set to 0) if \(x \leq 0\). Neurons that were “off” during forward pass receive no gradient update — they don’t learn from this example.

14.4 PyTorch autograd

In PyTorch, you don’t implement backprop manually. Every tensor operation records itself in a computation graph. Calling .backward() on the loss traverses this graph in reverse, applying the chain rule at each operation.

PSEUDOCODE: Training loop (PyTorch style)

function train(model, dataset, epochs=50, lr=0.001):
    optimizer = AdamOptimizer(model.parameters(), lr=lr)

    for epoch from 1 to epochs:
        for (spectrogram, label) in dataset.batches(batch_size=32):

            // Forward pass
            logits = model(spectrogram)           // shape: (32, 8)
            loss   = cross_entropy(logits, label) // scalar

            // Backward pass
            optimizer.zero_gradients()
            loss.backward()                        // compute all ∂L/∂w
            optimizer.step()                       // w ← w - lr × ∂L/∂w

        print(f"Epoch {epoch}, Loss: {loss}")

15. Convolution

15.1 1D convolution intuition

Before 2D, consider 1D. You have a signal \(x\) and a filter \(h\) of length \(K\). The convolution at position \(n\):

\[(x * h)[n] = \sum_{k=0}^{K-1} h[k] \cdot x[n + k]\]

The filter slides across the signal. At each position, compute the dot product between the filter weights and the underlying signal values. If the signal has a pattern matching the filter, the output is large.

Example: filter \(h = [1, -2, 1]\) detects a “peak” (value higher than neighbors) because \(1 \times x_{\text{before}} - 2 \times x_{\text{center}} + 1 \times x_{\text{after}}\) is large when \(x_{\text{center}} \ll x_{\text{neighbors}}\) (actually a valley detector). This is an edge detector in 1D.

15.2 2D convolution

Your spectrogram is a 2D grid \(X\) of shape \(H \times W\) (32×32). A filter \(F\) is a small \(K \times K\) grid (e.g., \(3 \times 3\)). The 2D convolution at position \((i, j)\):

\[Y[i][j] = \sum_{k_i=0}^{K-1} \sum_{k_j=0}^{K-1} F[k_i][k_j] \cdot X[i + k_i][j + k_j]\]

The filter slides across both dimensions simultaneously. Output \(Y[i][j]\) is the dot product between the filter and the \(K \times K\) patch of \(X\) centered at \((i, j)\).

What filters learn:

  • A filter with large positive weights in the mid-frequency rows detects “speech formants”
  • A filter with alternating signs in the time dimension detects “rhythmic patterns”
  • A filter with large weights at a specific location detects “energy at a specific frequency”

The network learns these filters automatically from data.

15.3 Multiple input channels, multiple output channels

After the first layer, \(X\) is no longer a 2D grid but a 3D tensor of shape \(H \times W \times C\)\(C\) feature maps stacked, one per filter from the previous layer.

A filter now has shape \(K \times K \times C\) (covers all channels). The convolution for output channel \(m\):

\[Y[i][j][m] = \sum_{k_i=0}^{K-1} \sum_{k_j=0}^{K-1} \sum_{c=0}^{C-1} F_m[k_i][k_j][c] \cdot X[i + k_i][j + k_j][c] + b_m\]

With \(M\) output channels (filters), you repeat this for \(m = 0, 1, \ldots, M-1\).

Parameter count: each filter has \(K \times K \times C\) weights \(+\) 1 bias. With \(M\) filters: \(M \times (K^2 \times C + 1)\) total parameters.

For \(K=3, C=32, M=32\): \(32 \times (9 \times 32 + 1) = 32 \times 289 = 9,248\) parameters per layer.

Padding: to keep output the same size as input, add zeros around the border. With \(K=3\) and padding=1: output size = input size. Without padding: output shrinks by \(K-1=2\) in each dimension.

Stride: instead of sliding the filter one step at a time (stride=1), skip \(S\) steps (stride=\(S\)). Output size = \(\lfloor(H - K) / S\rfloor + 1\). Stride=2 halves the spatial dimensions, reducing computation.

PSEUDOCODE: 2D convolution (naive)

function conv2d(X, filters, bias, stride=1, padding=1):
    H, W, C_in = shape(X)
    K = filters.kernel_size      // e.g. 3
    M = filters.count            // number of output channels

    H_out = floor((H + 2×padding - K) / stride) + 1
    W_out = floor((W + 2×padding - K) / stride) + 1

    // Pad input with zeros
    X_padded = pad(X, padding)

    Y = zeros(H_out, W_out, M)

    for i from 0 to H_out - 1:
        for j from 0 to W_out - 1:
            for m from 0 to M - 1:
                sum = bias[m]
                for ki from 0 to K - 1:
                    for kj from 0 to K - 1:
                        for c from 0 to C_in - 1:
                            sum += filters[m][ki][kj][c] × X_padded[i×stride+ki][j×stride+kj][c]
                Y[i][j][m] = relu6(sum)

    return Y

16. Depthwise Separable Convolution

16.1 Motivation: operation count

Normal convolution on a \(32 \times 32 \times 32\) input with \(32\) output channels and \(3 \times 3\) filters, stride=1:

  • Operations per output position: \(3 \times 3 \times 32 = 288\) multiply-adds
  • Total output positions: \(32 \times 32 = 1024\)
  • Total operations: \(1024 \times 288 \times 32 = 9{,}437{,}184\) ≈ 9.4M

That’s just one layer. memd-net has several. Too expensive.

16.2 Depthwise convolution

Apply one \(K \times K\) filter per input channel, independently. No mixing between channels.

For channel \(c\), filter \(F_c\) of shape \(K \times K\):

\[\text{DW}[i][j][c] = \sum_{k_i=0}^{K-1} \sum_{k_j=0}^{K-1} F_c[k_i][k_j] \cdot X[i + k_i][j + k_j][c]\]

Output shape same as input: \(H \times W \times C\). Operations per output position: \(K^2 = 9\). Total: \(1024 \times 9 \times 32 = 294{,}912 \approx 295K\).

16.3 Pointwise convolution

A \(1 \times 1\) convolution — the filter has no spatial extent, just mixes channels.

For output channel \(m\):

\[\text{PW}[i][j][m] = \sum_{c=0}^{C-1} F_m[c] \cdot \text{DW}[i][j][c] + b_m\]

This is exactly a linear layer applied independently at every spatial position. It mixes the channel information from the depthwise step.

Operations per position: \(C = 32\). With \(M=32\) output channels: \(32 \times 32 = 1024\) per position. Total: \(1024 \times 1024 = 1{,}048{,}576 \approx 1M\).

Total depthwise separable: \(295K + 1M \approx 1.3M\) vs \(9.4M\) normal. 7.2× reduction.

Parameter reduction:

Normal: \(M \times K^2 \times C = 32 \times 9 \times 32 = 9{,}216\)

Depthwise separable: \(K^2 \times C + M \times C = 9 \times 32 + 32 \times 32 = 288 + 1024 = 1{,}312\)

7× fewer parameters. Same representational power (any normal convolution can be approximated by depthwise separable, just with more channels).

PSEUDOCODE: Depthwise separable convolution

function depthwise_conv(X, dw_filters):
    // One 3×3 filter per input channel
    H, W, C = shape(X)
    DW = zeros(H, W, C)
    for c from 0 to C-1:
        for i from 0 to H-1:
            for j from 0 to W-1:
                sum = 0
                for ki from 0 to 2:
                    for kj from 0 to 2:
                        sum += dw_filters[c][ki][kj] × X_padded[i+ki][j+kj][c]
                DW[i][j][c] = relu6(sum)
    return DW

function pointwise_conv(DW, pw_filters, bias):
    // 1×1 convolution: mix channels
    H, W, C = shape(DW)
    M = length(pw_filters)
    PW = zeros(H, W, M)
    for i from 0 to H-1:
        for j from 0 to W-1:
            for m from 0 to M-1:
                sum = bias[m]
                for c from 0 to C-1:
                    sum += pw_filters[m][c] × DW[i][j][c]
                PW[i][j][m] = relu6(sum)
    return PW

17. Pooling

17.1 Global Average Pooling

After several convolutional layers, you have a feature tensor of shape \(H \times W \times C\). You need to collapse this to a fixed-size vector for the final classification layer.

Global Average Pooling (GAP): for each channel, average all spatial values:

\[\text{GAP}[c] = \frac{1}{H \times W} \sum_{i=0}^{H-1} \sum_{j=0}^{W-1} X[i][j][c]\]

Output shape: \(C\) (one number per channel). No learned parameters — just an average.

Why average instead of flatten? Flatten would give \(H \times W \times C\) values — still spatial. GAP removes spatial information and keeps only “how much of each feature is present overall.” This also makes the network position-invariant — it doesn’t matter where in the spectrogram the keyboard pattern appears; GAP produces the same output.

PSEUDOCODE: Global Average Pooling

function global_avg_pool(X):
    H, W, C = shape(X)
    out = zeros(C)
    for c from 0 to C-1:
        total = 0
        for i from 0 to H-1:
            for j from 0 to W-1:
                total += X[i][j][c]
        out[c] = total / (H × W)
    return out   // shape: C

18. Softmax

The final fully connected layer produces 8 raw scores (logits) \(z[0], \ldots, z[7]\), one per class. These are unconstrained real numbers. Convert to probabilities:

\[\hat{y}[c] = \frac{e^{z[c]}}{\sum_{k=0}^{7} e^{z[k]}}\]

Properties:

  • All \(\hat{y}[c] > 0\) (exponential is always positive)
  • \(\sum_{c=0}^{7} \hat{y}[c] = 1\) (sum equals 1, valid probability distribution)
  • The largest logit gets the largest probability
  • The differences are amplified: logit gap of 2 → probability ratio of \(e^2 \approx 7.4\times\)

Numerical stability: if logits are large (e.g., \(z = [1000, 1001, 999]\)), \(e^{1000}\) overflows. Fix: subtract the maximum before exponentiating (mathematically equivalent):

\[\hat{y}[c] = \frac{e^{z[c] - \max(\mathbf{z})}}{\sum_k e^{z[k] - \max(\mathbf{z})}}\]
PSEUDOCODE: Softmax

function softmax(z):
    max_z = max(z)
    exp_z = [exp(z[c] - max_z) for c in range(8)]
    sum_exp = sum(exp_z)
    return [e / sum_exp for e in exp_z]

19. The Full memd-net Architecture

Input: \(32 \times 32 \times 1\) (height × width × channels, 1 channel = grayscale spectrogram)

Layer                   Operation                    Output Shape      SRAM (approx)
───────────────────────────────────────────────────────────────────────────────────
Input                   -                            32×32×1           1 KB
Conv2d (stride=2)       3×3, 8 filters, relu6        16×16×8           2 KB
DW Sep Conv (stride=1)  3×3 DW + 1×1 PW, 16 ch       16×16×16          4 KB
DW Sep Conv (stride=2)  3×3 DW + 1×1 PW, 32 ch       8×8×32            2 KB
DW Sep Conv (stride=1)  3×3 DW + 1×1 PW, 32 ch       8×8×32            2 KB
DW Sep Conv (stride=2)  3×3 DW + 1×1 PW, 64 ch       4×4×64            4 KB
DW Sep Conv (stride=1)  3×3 DW + 1×1 PW, 64 ch       4×4×64            4 KB
Global Avg Pool         average over 4×4             64                 —
Linear (embedding)      64 → 32, no activation       32                 —
Linear (classifier)     32 → 8, no activation        8                  —
Softmax                 8 → 8 probabilities          8                  —
───────────────────────────────────────────────────────────────────────────────────
Total weights (INT8):   ~36 KB
Peak activation SRAM:   ~4 KB (the 4×4×64 layer)
Total SRAM needed:      ~40 KB (well within 512KB budget)
Inference time:         ~30–50ms at 240MHz

The 32-dim output of the embedding linear layer (before the classifier) is your memory embedding. You store this in the packet.

PSEUDOCODE: memd-net forward pass

function memd_net_forward(spectrogram_int8):
    x = spectrogram_int8                        // 32×32×1

    x = conv2d(x, conv1_weights, stride=2)      // 16×16×8
    x = depthwise_separable(x, dws1, stride=1)  // 16×16×16
    x = depthwise_separable(x, dws2, stride=2)  // 8×8×32
    x = depthwise_separable(x, dws3, stride=1)  // 8×8×32
    x = depthwise_separable(x, dws4, stride=2)  // 4×4×64
    x = depthwise_separable(x, dws5, stride=1)  // 4×4×64

    x = global_avg_pool(x)                      // 64

    embedding = linear(x, embed_weights)        // 32  ← stored in packet
    logits    = linear(embedding, cls_weights)  // 8
    probs     = softmax(logits)                 // 8 probabilities

    event_type  = argmax(probs)                 // 0–7
    confidence  = max(probs) × 255             // 0–255 (scaled)

    return event_type, confidence, embedding

20. INT8 Quantisation

20.1 Affine quantisation scheme

Map float values to INT8 integers using scale \(S\) and zero-point \(Z\):

\[r = S \cdot (q - Z)$$ $$q = \text{clip}\!\left(\left\lfloor \frac{r}{S} \right\rceil + Z, -128, 127\right)\]

Where \(\lfloor \cdot \rceil\) means round to nearest integer, and clip keeps the result in the INT8 range.

Computing \(S\) and \(Z\) from the float range \([r_{\min}, r_{\max}]\):

\[S = \frac{r_{\max} - r_{\min}}{q_{\max} - q_{\min}} = \frac{r_{\max} - r_{\min}}{127 - (-128)} = \frac{r_{\max} - r_{\min}}{255}\]
\[Z = \left\lfloor -\frac{r_{\min}}{S} \right\rceil - 128 = \text{clip}\!\left(\left\lfloor -\frac{r_{\min}}{S} \right\rceil - 128, -128, 127\right)\]

Example:

Float weights range \([-0.5, 0.8]\): $\(S = \frac{0.8 - (-0.5)}{255} = \frac{1.3}{255} \approx 0.00510\)$ $\(Z = \left\lfloor \frac{0.5}{0.00510} \right\rceil - 128 = \lfloor 98.0 \rfloor - 128 = 98 - 128 = -30\)$

Float weight \(0.3\): $\(q = \left\lfloor \frac{0.3}{0.00510} \right\rceil + (-30) = 59 - 30 = 29\)$

Dequantise back: \(r = 0.00510 \times (29 - (-30)) = 0.00510 \times 59 = 0.3009\) ✓ (close to 0.3, small rounding error)

20.2 Quantised convolution math

In a convolution layer, you compute:

\[y = \sum w_i x_i\]

In the quantised domain:

\[r_w = S_w(q_w - Z_w), \quad r_x = S_x(q_x - Z_x)\]
\[y = \sum r_{w_i} r_{x_i} = \sum S_w(q_{w_i} - Z_w) \cdot S_x(q_{x_i} - Z_x)\]
\[y = S_w S_x \sum (q_{w_i} - Z_w)(q_{x_i} - Z_x)\]

The inner sum is entirely integer arithmetic. Expand it:

\[\sum (q_{w_i} - Z_w)(q_{x_i} - Z_x) = \underbrace{\sum q_{w_i} q_{x_i}}*{\text{INT8 × INT8}} - Z_w \underbrace{\sum q*{x_i}}*{\text{INT32}} - Z_x \underbrace{\sum q*{w_i}}_{\text{precomputed}} + N \cdot Z_w Z_x\]
  • \(\sum q_{w_i} q_{w_i}\): each product is INT8 × INT8 → up to 16 bits. Sum of \(N=288\) terms: up to 16 + \(\log_2(288) \approx 16 + 8 = 24\) bits. Use INT32 accumulator.
  • \(\sum q_{w_i}\): precomputed at quantisation time (constant per filter).
  • The result is in INT32.

Requantisation (output to INT8):

The output \(y\) in float is:

\[r_y = S_w S_x \cdot \text{INT32_sum}\]

You want the output in INT8 with scale \(S_y\), zero-point \(Z_y\):

\[q_y = \text{round}!\left(\frac{r_y}{S_y}\right) + Z_y = \text{round}!\left(\frac{S_w S_x}{S_y} \cdot \text{INT32_sum}\right) + Z_y\]

The factor \(M_0 = S_w S_x / S_y\) is a small positive float. You convert it to a fixed-point integer + bit-shift:

\[M_0 \approx \frac{M_{\text{int}}}{2^n}\]

Where \(M_{\text{int}}\) is a 32-bit integer and \(n\) is a shift. Then:

\[q_y = \text{clip}\!\left(\left(\text{INT32_sum} \times M_{\text{int}}\right) \gg n + Z_y, -128, 127\right)\]

No floating-point at inference time. Only integer multiplications and bit-shifts.

20.3 Per-tensor vs per-channel

Per-tensor: one \((S_w, Z_w)\) pair for all weights in the entire layer.

Problem: if the layer has 32 output filters and their weight ranges differ greatly, a single scale fits the largest-range filter well but badly quantises small-range filters (they get rounded to very few distinct integer values, losing precision).

Per-channel: one \((S_w^{(m)}, Z_w^{(m)})\) pair per output filter \(m\).

Each filter gets its own scale matched to its own weight range. The requantisation scale \(M_0^{(m)} = S_w^{(m)} S_x / S_y\) is now different per output channel. Small overhead (32 multiplications vs 1), large accuracy gain.

Why this matters for audio: high-frequency mel filters detect energy at 4000–8000 Hz. These frequencies carry less energy in typical audio, so their detecting filters learn smaller-magnitude weights. Per-tensor quantisation, dominated by large-magnitude filters, assigns too coarse a scale to these small-weight filters. Per-channel fixes this.

PSEUDOCODE: Per-channel INT8 quantisation

function quantise_weights_per_channel(W_float, n_channels):
    // W_float shape: (n_channels, K, K, C_in)
    scales = zeros(n_channels)
    zpoints = zeros(n_channels, type=int8)
    W_int8 = zeros_like(W_float, type=int8)

    for m from 0 to n_channels - 1:
        w = W_float[m]                          // one filter: (K, K, C_in)
        r_min = min(w)
        r_max = max(w)
        S = (r_max - r_min) / 255.0
        Z = round(-r_min / S) - 128

        scales[m]  = S
        zpoints[m] = clip(Z, -128, 127)

        W_int8[m] = clip(round(w / S) + Z, -128, 127)

    return W_int8, scales, zpoints

21. INT8 Inference Runtime (memd-rt)

memd-rt is your custom C inference engine. No TFLite, no ESP-DL. You implement every operator.

21.1 Memory layout

Tensors are stored as flat arrays. A 4D tensor of shape \((N, H, W, C)\) in NHWC format (batch, height, width, channels):

\[\text{index}(n, h, w, c) = n \times (H \times W \times C) + h \times (W \times C) + w \times C + c\]

NHWC vs NCHW:

  • NHWC: channel is the innermost dimension → adjacent channels at same spatial location are contiguous in memory → efficient for depthwise conv (accessing all channels at one position)
  • NCHW: height/width innermost → efficient for some GPU operations

ESP32-S3 has no SIMD. NHWC is fine. Critically: your C layout must match how PyTorch exported the weights, or the convolutions will silently compute wrong results.

21.2 INT8 convolution loop with INT32 accumulation

PSEUDOCODE: INT8 Conv2D forward pass (memd-rt)

function conv2d_int8(input_int8, weights_int8, bias_int32,
                     Sx, Zx,          // input scale, zero-point
                     Sw[], Zw[],       // per-channel weight scale, zero-point
                     Sy, Zy,          // output scale, zero-point
                     stride, padding):

    H, W, C_in  = shape(input_int8)
    M, K, K, _  = shape(weights_int8)    // M output channels, K×K kernel

    H_out = (H + 2×padding - K) / stride + 1
    W_out = (W + 2×padding - K) / stride + 1

    output_int8 = zeros(H_out, W_out, M, type=int8)

    for m from 0 to M-1:
        M0 = Sx × Sw[m] / Sy          // requantisation scale (float, precomputed)
        // Convert M0 to fixed-point: M0 ≈ M0_int × 2^(-shift)
        shift, M0_int = float_to_fixedpoint(M0)

        for i from 0 to H_out-1:
            for j from 0 to W_out-1:

                acc = (int32)bias_int32[m]   // INT32 accumulator

                for ki from 0 to K-1:
                    for kj from 0 to K-1:
                        for c from 0 to C_in-1:

                            q_x = input_padded[i×stride+ki][j×stride+kj][c]
                            q_w = weights_int8[m][ki][kj][c]

                            // Subtract zero-points before multiply
                            acc += (int32)(q_w - Zw[m]) × (int32)(q_x - Zx)

                // Requantise INT32 accumulator to INT8
                q_y = (acc × M0_int) >> shift
                q_y = q_y + Zy
                output_int8[i][j][m] = clip(q_y, -128, 127)

    return output_int8

21.3 The weight file format (.memd)

You define a custom binary format for storing the model. Each section:

PSEUDOCODE: .memd file structure

File header:
    magic      : 4 bytes   "MEMD"
    version    : 2 bytes   0x0001
    n_layers   : 2 bytes   number of layers

For each layer:
    layer_type : 1 byte    (0=conv2d, 1=dw_conv, 2=pw_conv, 3=linear)
    n_channels : 4 bytes
    kernel_size: 1 byte
    stride     : 1 byte
    padding    : 1 byte

    // Per-channel quantisation parameters
    scales[]   : n_channels × 4 bytes   (float32)
    zpoints[]  : n_channels × 1 byte    (int8)

    // Weights
    weights[]  : n_channels × K × K × C_in bytes   (int8)
    bias[]     : n_channels × 4 bytes               (int32)

On the ESP32, you mmap or memcpy this file from flash into PSRAM at startup. The C runtime reads directly from this binary.


22. The Memory Packet

Every time memd-net fires, you produce a memory_packet_t struct. Fixed size: exactly 128 bytes.

typedef struct {
    uint32_t timestamp;       // 4 bytes  — unix epoch from NTP
    uint8_t  sensor_type;     // 1 byte   — 0x01 = audio
    uint8_t  event_type;      // 1 byte   — 0=speech,1=music,...,7=silence
    uint8_t  confidence;      // 1 byte   — 0–255, from softmax max
    uint8_t  reserved;        // 1 byte   — alignment padding
    int8_t   embedding[32];   // 32 bytes — INT8 quantised 32-dim embedding
    uint8_t  audio_snippet[88]; // 88 bytes — compressed audio thumbnail
} memory_packet_t;             // total: 4+1+1+1+1+32+88 = 128 bytes

Why exactly 128 bytes?

  • No fragmentation in flash (LittleFS manages fixed-size blocks trivially)
  • \(O(1)\) seek: packet \(n\) is at byte offset \(n \times 128\)
  • Trivially serialisable: you can memcpy the whole struct into a WiFi packet
  • Cache-aligned: 128 is a multiple of typical cache line sizes

Audio snippet (88 bytes): a very compressed thumbnail of the raw audio. Computed by heavily downsampling the 2-second capture (32000 samples → 88 samples). Not for high-fidelity replay — just for a rough verification that the packet captured what you think it did.

PSEUDOCODE: Packing a memory packet

function pack_memory_packet(event_type, confidence_float, embedding_int8,
                             audio_frame, timestamp):
    pkt = allocate(memory_packet_t)

    pkt.timestamp    = timestamp
    pkt.sensor_type  = 0x01
    pkt.event_type   = event_type
    pkt.confidence   = round(confidence_float × 255)
    pkt.reserved     = 0

    copy(embedding_int8, pkt.embedding, 32)

    // Downsample audio to 88 samples
    pkt.audio_snippet = downsample(audio_frame, target_length=88)

    return pkt

Writing to flash (LittleFS):

PSEUDOCODE: Flash storage

function append_packet(pkt):
    f = littlefs_open("/memd/log.bin", mode=APPEND)
    littlefs_write(f, bytes(pkt), 128)
    littlefs_close(f)

function read_packet(index):
    f = littlefs_open("/memd/log.bin", mode=READ)
    littlefs_seek(f, offset = index × 128)
    pkt = littlefs_read(f, 128)
    littlefs_close(f)
    return pkt

23. Cosine Similarity

The embedding is a point in 32-dimensional space. Similar sounds → nearby points. You retrieve similar memories by finding packets whose embeddings are close to a query embedding.

Dot product of two vectors \(\mathbf{a}, \mathbf{b} \in \mathbb{R}^{32}\):

\[\mathbf{a} \cdot \mathbf{b} = \sum_{i=0}^{31} a_i b_i\]

L2 norm (magnitude/length) of a vector:

\[|\mathbf{a}| = \sqrt{\sum_{i=0}^{31} a_i^2}\]

Cosine similarity:

\[\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| \cdot |\mathbf{b}|}\]

Range: \([-1, 1]\).

  • \(\cos = 1\): vectors point in exact same direction → identical sound type
  • \(\cos = 0\): perpendicular → unrelated sounds
  • \(\cos = -1\): opposite directions

Why cosine and not Euclidean distance? Euclidean distance is sensitive to magnitude (vector length). A louder version of the same sound might have a larger-magnitude embedding, making it look “far” from the quieter version despite being the same event. Cosine similarity normalises out magnitude and only cares about direction.

Retrieval:

PSEUDOCODE: Top-k retrieval

function retrieve_similar(query_embedding, k=5):
    all_packets = load_all_packets()
    similarities = []

    for pkt in all_packets:
        stored_emb = pkt.embedding      // int8[32]

        // Dequantise to float for comparison
        stored_float = dequantise(stored_emb, scale=S_emb, zp=Z_emb)
        query_float  = dequantise(query_embedding, scale=S_emb, zp=Z_emb)

        sim = cosine_similarity(query_float, stored_float)
        similarities.append((sim, pkt))

    // Sort by similarity descending
    similarities.sort(key=first_element, descending=True)
    return similarities[:k]   // top k

For thousands of packets, this brute-force loop is fast enough (milliseconds in Python).


24. Knowledge Graph and Session Merging

24.1 SQLite schema

-- Individual events
CREATE TABLE events (
    id          INTEGER PRIMARY KEY,
    timestamp   INTEGER NOT NULL,
    event_type  INTEGER NOT NULL,
    confidence  INTEGER NOT NULL,
    embedding   BLOB NOT NULL        -- 32 int8 values as raw bytes
);

-- Merged sessions (consecutive same-type events)
CREATE TABLE sessions (
    id          INTEGER PRIMARY KEY,
    event_type  INTEGER NOT NULL,
    start_time  INTEGER NOT NULL,
    end_time    INTEGER NOT NULL,
    packet_count INTEGER NOT NULL
);

-- Co-occurrence edges between sessions
CREATE TABLE edges (
    session_a   INTEGER REFERENCES sessions(id),
    session_b   INTEGER REFERENCES sessions(id),
    co_count    INTEGER DEFAULT 0,
    last_seen   INTEGER,
    PRIMARY KEY (session_a, session_b)
);

24.2 Session merging heuristic

Raw packets arrive at roughly one every few seconds (whenever VAD triggers). Consecutive keyboard packets should be merged into one “keyboard session” rather than stored as hundreds of separate events.

Merging rule: if two packets of the same event type are within \(\Delta t_{\max} = 30\) seconds of each other, they belong to the same session.

This is a design choice with no objectively correct answer. Too small \(\Delta t_{\max}\) = over-splits one session. Too large = merges different sessions. The right value is application-specific.

PSEUDOCODE: Session builder

function build_sessions(packets, dt_max=30):
    sessions = []
    current_session = None

    for pkt in packets.sorted_by_timestamp():
        if current_session is None:
            current_session = new_session(pkt)

        elif (pkt.event_type == current_session.event_type and
              pkt.timestamp - current_session.end_time <= dt_max):
            // Extend current session
            current_session.end_time = pkt.timestamp
            current_session.packet_count += 1

        else:
            // Close current session, start new one
            sessions.append(current_session)
            current_session = new_session(pkt)

    if current_session is not None:
        sessions.append(current_session)

    return sessions

24.3 Edge weights — temporal co-occurrence

Two session types that tend to occur together (e.g., keyboard and music) should have a strong edge.

\[w(s_a, s_b) = w(s_a, s_b) + \frac{1}{1 + |t_a - t_b|}\]

Where \(t_a\) and \(t_b\) are the start times of the two sessions. Sessions close in time get a large increment (near 1). Sessions far apart get a tiny increment (near 0).

Over many days this accumulates into a meaningful co-occurrence graph. “What do I do when I’m coding?” → find strong edges to the “keyboard” session type.


25. Full End-to-End Pipeline

┌─────────────────────────────────────────────────────────────┐
│                      ESP32-S3                               │
│                                                             │
│  1. INMP441 mic → I2S buffer                                │
│     x[0..15999]  (16-bit PCM, 16kHz, 1 second)              │
│                                                             │
│  2. Pre-emphasis                                            │
│     y[n] = x[n] - 0.97 × x[n-1]                            │
│                                                             │
│  3. Energy VAD (RMS threshold)                              │
│     rms = sqrt(mean(y²))                                    │
│     if rms < threshold: discard, goto 1                     │
│                                                             │
│  4. Frame (512 samples, hop 496) + Hamming window           │
│     32 frames × 512 samples each                            │
│                                                             │
│  5. FFT each frame (esp-dsp)                                │
│     X_t[k] for k = 0..256   (complex, 32-bit float)         │
│                                                             │
│  6. Power spectrum                                          │
│     P_t[k] = |X_t[k]|² = Re² + Im²                         │
│                                                             │
│  7. Mel filterbank (32 triangular filters)                  │
│     S_t[m] = Σ_k H_m[k] × P_t[k]                           │
│                                                             │
│  8. Log compression + fixed-point bias correction           │
│     L_t[m] = log(S_t[m] + ε) + bias[m]                     │
│                                                             │
│  9. Stack into 32×32 INT8 spectrogram                       │
│     spec[m][t] = quantise(L_t[m])                           │
│                                                             │
│  10. memd-net forward pass (memd-rt, C)                     │
│      conv2d → dw_sep × 4 → gap → linear                     │
│      → event_type (0–7)                                     │
│      → confidence (0–255)                                   │
│      → embedding[32] (int8)                                 │
│                                                             │
│  11. OLED update                                            │
│      "keyboard 87%"                                         │
│                                                             │
│  12. Pack memory_packet_t (128 bytes)                       │
│      → append to LittleFS                                   │
│                                                             │
│  13. WiFi sync (when idle, I2S off)                         │
│      HTTP POST packets to Python server                     │
└────────────────────────────┬────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│                  Python Backend (laptop)                    │
│                                                             │
│  14. Receive packets                                        │
│      INSERT INTO events                                     │
│                                                             │
│  15. Session builder                                        │
│      merge consecutive same-type events within 30s          │
│      INSERT INTO sessions                                   │
│                                                             │
│  16. Edge weights                                           │
│      update edges table: w += 1/(1 + |t_a - t_b|)          │
│                                                             │
│  17. Retrieval                                              │
│      cosine_sim(query_emb, stored_emb) for all stored       │
│      return top-k                                           │
│                                                             │
│  18. (Optional) Ollama natural language query               │
│      "what did I spend most of yesterday doing?"            │
│      → query sessions by time range + most common type      │
└────────────────────────────────────────────────────────────┘

Summary of every formula

Step Formula
Pre-emphasis \(y[n] = x[n] - 0.97 \cdot x[n-1]\)
Hamming window \(w[n] = 0.54 - 0.46\cos!\left(\frac{2\pi n}{N-1}\right)\)
DFT \(X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j2\pi kn/N}\)
Power spectrum \(P[k] = \text{Re}(X[k])^2 + \text{Im}(X[k])^2\)
Hz to Mel \(m = 2595\log_{10}(1 + f/700)\)
Mel to Hz \(f = 700(10^{m/2595} - 1)\)
Mel filterbank \(S[m] = \sum_k H_m[k] \, P[k]\)
Log compression \(L[m] = \log(S[m] + \varepsilon)\)
Neuron \(y = f!\left(\sum w_i x_i + b\right)\)
ReLU6 \(y = \min(\max(0, x), 6)\)
Softmax \(\hat{y}[c] = e^{z[c]} / \sum_k e^{z[k]}\)
Cross-entropy loss \(\mathcal{L} = -\log(\hat{y}[c^*])\)
Gradient descent \(w \leftarrow w - \eta \, \partial\mathcal{L}/\partial w\)
2D convolution \(Y[i][j][m] = \sum_{k_i,k_j,c} F_m[k_i][k_j][c]\cdot X[i{+}k_i][j{+}k_j][c]\)
Global avg pool \(\text{GAP}[c] = \frac{1}{HW}\sum_{i,j} X[i][j][c]\)
INT8 quantise \(q = \text{clip}(\lfloor r/S \rceil + Z, -128, 127)\)
INT8 dequantise \(r = S(q - Z)\)
Cosine similarity $\cos(\mathbf{a},\mathbf{b}) = \frac{\mathbf{a}\cdot\mathbf{b}}{
Edge weight \(w(s_a, s_b) \mathrel{+}= \frac{1}{1 + \lvert t_a - t_b \rvert}\)

This document covers every step of the memd pipeline — from raw I2S samples to knowledge graph retrieval — with full mathematical derivations and pseudocode implementations.