memd — Math¶
From raw audio to knowledge graph, every step, every formula¶
Table of Contents¶
- Sound as Numbers
- Pre-emphasis Filter
- Framing and Windowing
- Fourier Transform — Decomposing Sound into Frequencies
- Power Spectrum
- Mel Scale — Matching Human Hearing
- Mel Filterbank
- Log Compression
- The Log-Mel Spectrogram — Full Pipeline
- Fixed-Point Arithmetic — Doing Math Without Decimals
- Neural Networks — Learning from Data
- Activation Functions
- Loss Function and Training
- Backpropagation — How Gradients Flow
- Convolution — Pattern Detection on Grids
- Depthwise Separable Convolution — Cheaper Convolution
- Pooling — Shrinking Feature Maps
- Softmax — Converting Scores to Probabilities
- The Full memd-net Architecture
- INT8 Quantisation — Shrinking the Model
- INT8 Inference Runtime — memd-rt
- The Memory Packet
- Cosine Similarity — Finding Similar Memories
- Knowledge Graph and Session Merging
- Full End-to-End Pipeline
1. Sound as Numbers¶
Sound is variation in air pressure over time. A microphone converts that pressure variation into electrical voltage. An Analog-to-Digital Converter (ADC) inside the ESP32-S3 samples that voltage at regular intervals and records each measurement as an integer.
Sample rate: 16,000 samples per second (16 kHz). This means one measurement every:
Each sample is a 16-bit signed integer, range \(-32768\) to \(32767\).
One second of audio:
Loud sound → large magnitude numbers. Silence → values near zero.
What you get from the INMP441 microphone: The INMP441 communicates over I2S (Inter-IC Sound) — a serial bus protocol that streams audio samples continuously. The ESP32-S3 reads these samples into a buffer in RAM.
PSEUDOCODE: Reading audio from I2S
function read_audio_frame(duration_ms):
num_samples = (duration_ms / 1000) × SAMPLE_RATE // e.g. 2000ms → 32000
buffer = allocate_array(num_samples, type=int16)
i2s_read(port=I2S_PORT_0, destination=buffer, length=num_samples)
return buffer
2. Pre-emphasis Filter¶
Speech and environmental audio naturally have more energy at low frequencies than high frequencies. This imbalance makes the neural network harder to train — it sees strong low-frequency signal and weak high-frequency signal, and learns to mostly ignore high frequencies.
Pre-emphasis is a high-pass filter that boosts high frequencies before any other processing. It amplifies rapid changes between adjacent samples (rapid change = high frequency content).
Where:
- \(x[n]\) = current sample
- \(x[n-1]\) = previous sample
- \(\alpha = 0.97\) = pre-emphasis coefficient (standard value)
- \(y[n]\) = filtered output
Why \(\alpha = 0.97\)? It’s a well-established empirical choice for speech processing. Higher \(\alpha\) = stronger boost to high frequencies. Lower = gentler.
What this does mathematically: a flat signal where \(x[n] \approx x[n-1]\) produces \(y[n] \approx 0\) (suppressed). A rapidly oscillating signal where \(x[n]\) and \(x[n-1]\) differ a lot produces large \(y[n]\) (amplified). This is the definition of high-pass filtering.
Example: $\(x = [1000, 1020, 980, 1050, 300]\)$ $\(y[0] = 1000 \quad \text{(no previous sample, use as-is)}\)$ $\(y[1] = 1020 - 0.97 \times 1000 = 1020 - 970 = 50\)$ $\(y[2] = 980 - 0.97 \times 1020 = 980 - 989.4 = -9.4\)$ $\(y[3] = 1050 - 0.97 \times 980 = 1050 - 950.6 = 99.4\)$ $\(y[4] = 300 - 0.97 \times 1050 = 300 - 1018.5 = -718.5\)$
Note how the large jump from 1050 to 300 produces a huge output (\(-718.5\)) — the filter is sensitive to change.
PSEUDOCODE: Pre-emphasis filter
function pre_emphasis(x, alpha=0.97):
y = allocate_array(length(x), type=float)
y[0] = x[0]
for n from 1 to length(x) - 1:
y[n] = x[n] - alpha × x[n-1]
return y
3. Framing and Windowing¶
3.1 Why frames?¶
A Fourier Transform (next section) assumes the signal is stationary — its frequency content doesn’t change during the analysis window. Real audio is not stationary. The word “hello” goes through completely different sounds in 500ms.
Solution: chop the signal into short overlapping chunks called frames. Each frame is short enough to be approximately stationary (25–32ms).
Frame parameters for memd:
- Frame length: \(N = 512\) samples \(= 32\text{ms}\) at 16 kHz
- Hop size (step between frames): \(H = 512\) samples for a non-overlapping 32×32 grid
Wait — for 1 second of audio with 32 frames and hop size H:
Solving: \(H = \lfloor(16000 - 512) / 31\rfloor = \lfloor 496.4 \rfloor = 496\) samples \(\approx 31\text{ms}\) hop.
PSEUDOCODE: Framing
function frame_signal(x, frame_length=512, hop_size=496):
frames = []
start = 0
while start + frame_length <= length(x):
frame = x[start : start + frame_length] // slice of 512 samples
frames.append(frame)
start += hop_size
return frames // list of 32 arrays, each 512 samples
3.2 The Hamming window¶
When you cut out a frame, the edges are abrupt discontinuities — the signal goes from some non-zero value to exactly zero at the boundary (because you’re cutting). In the frequency domain, an abrupt cut looks like a rectangular pulse, which creates high-frequency ringing artifacts called spectral leakage — fake frequencies that don’t exist in the original signal.
Fix: multiply the frame by a smooth function that tapers to zero at both edges. The Hamming window:
Where \(N\) is the frame length (512). The cosine function oscillates between -1 and +1:
- At \(n=0\): \(\cos(0) = 1 \Rightarrow w[0] = 0.54 - 0.46 = 0.08 \approx 0\)
- At \(n=N/2\): \(\cos(\pi) = -1 \Rightarrow w[N/2] = 0.54 + 0.46 = 1.0\)
- At \(n=N-1\): \(\cos(2\pi) = 1 \Rightarrow w[N-1] = 0.08 \approx 0\)
A smooth bell peaking at 1.0 in the middle, fading to near-zero at the edges.
Apply the window:
PSEUDOCODE: Hamming window
function compute_hamming_window(N=512):
w = allocate_array(N, type=float)
for n from 0 to N-1:
w[n] = 0.54 - 0.46 × cos(2π × n / (N - 1))
return w
function apply_window(frame, window):
result = allocate_array(length(frame), type=float)
for n from 0 to length(frame) - 1:
result[n] = frame[n] × window[n]
return result
4. Fourier Transform¶
4.1 The core idea¶
Any signal — no matter how complex — can be expressed as a sum of pure sine waves at different frequencies and amplitudes. The Fourier Transform finds those frequencies and their amplitudes.
A pure sine wave at frequency \(f\), amplitude \(A\), and phase \(\phi\):
The Discrete Fourier Transform (DFT) is the version for discrete samples. Given a frame of \(N\) samples \(x[0], x[1], \ldots, x[N-1]\), it computes \(N\) complex numbers \(X[0], X[1], \ldots, X[N-1]\):
4.2 Breaking down the formula¶
Euler’s formula (fundamental identity of complex exponentials):
Therefore:
Substituting back:
Each \(X[k]\) is a complex number \(a + jb\) with:
- Real part: \(a = \sum x[n] \cos(2\pi k n / N)\)
- Imaginary part: \(b = -\sum x[n] \sin(2\pi k n / N)\)
What does \(X[k]\) mean? It measures how well your signal correlates with a pure sinusoid at frequency \(k\). If your signal contains energy at exactly frequency \(k\), this correlation is large. Otherwise it’s near zero.
Frequency of bin \(k\):
So bin \(k=0\) → 0 Hz (DC component), bin \(k=1\) → 31.25 Hz, bin \(k=32\) → 1000 Hz, bin \(k=160\) → 5000 Hz.
Symmetry: for a real-valued input signal, \(X[N-k] = X[k]^*\) (complex conjugate). The spectrum is symmetric around the midpoint. You only need the first \(N/2 + 1 = 257\) bins.
4.3 The Fast Fourier Transform (FFT)¶
The naive DFT computes each \(X[k]\) by summing \(N\) terms, for all \(N\) values of \(k\): total \(O(N^2)\) operations. For \(N=512\): \(262,144\) multiplications.
The FFT (Cooley-Tukey algorithm, 1965) exploits the symmetry of the complex exponentials to recursively split the computation:
Recursively split even and odd sub-problems until you reach size-1 DFTs (trivial). Total operations: \(O(N \log_2 N)\).
For \(N=512\): \(512 \times \log_2(512) = 512 \times 9 = 4,608\) operations. 57× faster than naive.
You use esp-dsp’s dsps_fft2r_fc32 — no need to implement this yourself.
PSEUDOCODE: FFT on a frame (using library)
function compute_fft(windowed_frame):
// Zero-pad or trim to power of 2 if needed (512 is already 2^9)
complex_output = allocate_complex_array(512)
// Copy real input into complex array (imaginary parts = 0)
for n from 0 to 511:
complex_output[n].real = windowed_frame[n]
complex_output[n].imag = 0.0
// In-place FFT (library call)
fft_library_call(complex_output, N=512)
// Only keep first 257 bins (symmetry)
return complex_output[0:257]
5. Power Spectrum¶
You don’t need phase information (the imaginary part encodes phase). You want energy — how much power at each frequency. For each complex output \(X[k] = a + jb\):
This is the power spectrum — a list of 257 non-negative real numbers, one per frequency bin.
Why squared? Power is proportional to amplitude squared (from physics — power \(\propto A^2\)). It also makes all values non-negative, which is necessary before the log step later.
PSEUDOCODE: Power spectrum
function power_spectrum(fft_output):
P = allocate_array(257, type=float)
for k from 0 to 256:
a = fft_output[k].real
b = fft_output[k].imag
P[k] = a×a + b×b
return P
6. Mel Scale¶
6.1 Why Mel?¶
The FFT gives you 257 frequency bins equally spaced in Hz. But human perception of pitch is logarithmic, not linear.
The perceptual difference between 100 Hz and 200 Hz (one octave) sounds the same as between 1000 Hz and 2000 Hz (also one octave, also \(2\times\) ratio). But in linear Hz, the first gap is 100 Hz wide and the second is 1000 Hz wide — 10× different.
Also, our ears have much finer frequency discrimination at low frequencies than high. We can easily hear the difference between 200 Hz and 220 Hz (a musical half-step). But we struggle to distinguish 8000 Hz from 8020 Hz.
If you feed raw linear-spaced frequency bins into a neural network, you’re wasting capacity. The network has to learn that bins 1–10 (low freq) matter a lot and bins 200–257 (high freq) matter less. Better: give it input that already reflects perception.
6.2 The Mel formula¶
The Mel scale maps Hertz to a perceptual frequency scale:
Inverse (Mel to Hz):
The constant 700 is the “bend point” — below 700 Hz the scale is nearly linear; above 700 Hz it becomes increasingly compressed.
Worked examples:
| Hz | Mel |
|---|---|
| 0 | 0 |
| 100 | 150 |
| 500 | 607 |
| 1000 | 999 |
| 2000 | 1332 |
| 4000 | 1785 |
| 8000 | 2840 |
Note: going from 1000 to 2000 Hz = 333 mel. Going from 4000 to 8000 Hz (4× larger gap in Hz) = only 1055 mel. The high-frequency range is compressed.
PSEUDOCODE: Mel conversion
function hz_to_mel(f_hz):
return 2595 × log10(1 + f_hz / 700)
function mel_to_hz(m):
return 700 × (10^(m / 2595) - 1)
7. Mel Filterbank¶
7.1 Creating triangular filters¶
You want to convert your 257-bin power spectrum into 32 mel-scale values. You create 32 triangular filters that cover the frequency range in mel-spaced intervals.
Step 1: Choose frequency range. For audio: \(f_{\min} = 0\) Hz, \(f_{\max} = 8000\) Hz.
Step 2: Convert to mel: $\(m_{\min} = \text{hz_to_mel}(0) = 0, \quad m_{\max} = \text{hz_to_mel}(8000) = 2840\)$
Step 3: Space \(M + 2 = 34\) points evenly in mel space (\(M=32\) filters needs 34 boundary points):
Step 4: Convert each mel point back to Hz, then to the nearest FFT bin:
Step 5: Define filter \(m\) (for \(m = 0, 1, \ldots, 31\)) as a triangle between bins \(b_m\) and \(b_{m+2}\), peaking at \(b_{m+1}\):
Filter \(m\) rises linearly from 0 to 1 between bins \(b_m\) and \(b_{m+1}\), then falls linearly from 1 to 0 between \(b_{m+1}\) and \(b_{m+2}\). Outside that range: zero.
7.2 Applying the filterbank¶
For each mel band \(m\), sum the power spectrum weighted by the filter:
Result: 32 non-negative numbers, one per mel band. Each represents the total energy in that perceptual frequency region.
PSEUDOCODE: Mel filterbank
function compute_mel_filterbank(f_min=0, f_max=8000, n_filters=32, N=512, fs=16000):
// Step 1-2: mel boundaries
m_min = hz_to_mel(f_min)
m_max = hz_to_mel(f_max)
// Step 3: evenly spaced mel points
mel_points = linspace(m_min, m_max, n_filters + 2) // 34 points
// Step 4: convert to FFT bin indices
hz_points = [mel_to_hz(m) for m in mel_points]
bin_points = [floor(f × N / fs + 0.5) for f in hz_points]
// Step 5: build filter matrix (32 × 257)
filterbank = zeros(n_filters, 257)
for m from 0 to n_filters - 1:
left = bin_points[m]
center = bin_points[m + 1]
right = bin_points[m + 2]
// Rising slope
for k from left to center - 1:
filterbank[m][k] = (k - left) / (center - left)
// Falling slope
for k from center to right:
filterbank[m][k] = (right - k) / (right - center)
return filterbank
function apply_filterbank(P, filterbank, n_filters=32):
S = zeros(n_filters)
for m from 0 to n_filters - 1:
for k from 0 to 256:
S[m] += filterbank[m][k] × P[k]
return S
8. Log Compression¶
After the filterbank, \(S[m]\) values span a huge range — perhaps \(10^6\) for loud sounds and \(1\) for quiet ones. Neural networks struggle with such large input ranges.
Apply the natural logarithm (or \(\log_{10}\)):
The \(\epsilon\) prevents \(\log(0)\) which is \(-\infty\).
Effect: $\(\log(10^6) = 6 \cdot \log(10) \approx 13.8\)$ $\(\log(10^4) = 4 \cdot \log(10) \approx 9.2\)$ $\(\log(1) = 0\)$
A signal that was \(10^6 \times\) stronger than another is now only \(\approx 13.8\) units louder on the log scale. The dynamic range is compressed from \(10^6\) to \(\sim 14\) units.
This also matches human loudness perception. The decibel scale is logarithmic: \(\text{dB} = 10 \log_{10}(P)\).
PSEUDOCODE: Log compression
function log_compress(S, epsilon=1e-6):
L = allocate_array(length(S))
for m from 0 to length(S) - 1:
L[m] = log(S[m] + epsilon)
return L
9. Log-Mel Spectrogram — Full Pipeline¶
Run the entire pipeline on 1 second of audio, collecting 32 frames:
Where \(X_t[k]\) is the Fourier Transform of frame \(t\).
Result: a \(32 \times 32\) matrix. Each row = one mel band. Each column = one time frame. Each cell = log energy in that frequency band at that time.
This matrix is your neural network’s input. It is treated as a grayscale image.
PSEUDOCODE: Full spectrogram pipeline
function compute_log_mel_spectrogram(raw_audio, n_mels=32, n_frames=32):
// Pre-processing
audio_f = pre_emphasis(raw_audio, alpha=0.97)
// Precompute Hamming window and mel filterbank once
hamming = compute_hamming_window(N=512)
filterbank = compute_mel_filterbank(n_filters=n_mels)
// Frame and process
frames = frame_signal(audio_f, frame_length=512, hop_size=496)
spectrogram = zeros(n_mels, n_frames)
for t from 0 to n_frames - 1:
windowed = apply_window(frames[t], hamming)
fft_out = compute_fft(windowed)
power = power_spectrum(fft_out)
mel_S = apply_filterbank(power, filterbank)
log_S = log_compress(mel_S)
spectrogram[:, t] = log_S // store column t
return spectrogram // shape: (32, 32)
9.1 Quantising the spectrogram to INT8¶
Before feeding to the network, normalise and quantise:
- Find min and max across all 1024 values
- Map linearly to \([-128, 127]\):
Where \(\mu\) and \(\sigma\) are mean and standard deviation computed on your training set (fixed constants burned into firmware — not computed at runtime).
10. Fixed-Point Arithmetic¶
10.1 Why no floats?¶
The ESP32-S3 has a hardware floating-point unit (FPU) for single-precision float32. However:
- Float32 operations are slower than INT32
- Each float is 4 bytes; INT8 is 1 byte (4× memory savings)
- INT8 inference at 240 MHz gives acceptable latency (~30–50ms)
Fixed-point representation stores a decimal number as an integer scaled by a power of 2.
Q15 format: 1 sign bit + 15 fractional bits. Represents range \([-1.0, 1.0)\).
| Float | Q15 integer |
|---|---|
| 1.0 | 32767 |
| 0.5 | 16384 |
| 0.25 | 8192 |
| -0.3 | -9830 |
| 0.97 | 31785 |
Multiplication of two Q15 numbers \(a\) and \(b\):
To get back to Q15: right-shift by 15:
Must use INT32 intermediate to avoid overflow: \(32767 \times 32767 = 1,073,676,289\) which overflows INT16 but fits in INT32.
PSEUDOCODE: Fixed-point Q15 multiply
function q15_multiply(a, b):
// a and b are int16 in Q15 format
product = (int32)a × (int32)b // intermediate in 32-bit
result = product >> 15 // shift back to Q15
return (int16)result // truncate to 16-bit
10.2 The fixed-point log problem (L1.1)¶
\(\log()\) has no direct fixed-point formula. Common approximation:
For \(x \in [1, 2)\): the mantissa of a float \(x\) in this range, \(\log_2(x) \approx x - 1\) (linear approximation).
Better quadratic approximation:
For arbitrary \(x > 0\) (extract exponent \(e\) from float representation):
The bias problem: the approximation error is not uniform across \(m \in [1,2)\). At \(m \approx 1\) (low-energy bins) the quadratic approximation is less accurate than at \(m \approx 1.5\) (high-energy bins). Low-energy mel bins (high frequencies) have values near 1.0 after normalisation — exactly where the error is worst. This creates a systematic per-bin offset between your C implementation and the Python reference.
Measuring and correcting the bias:
Compute this over a representative dataset. Store bias[32] as a constant array in firmware. Add it back after the fixed-point log:
This is your original finding — not documented in any tutorial.
11. Neural Networks¶
11.1 What a neuron computes¶
A single artificial neuron receives \(n\) inputs and produces one output:
Where:
- \(\mathbf{x} = [x_0, x_1, \ldots, x_{n-1}]\) = input vector
- \(\mathbf{w} = [w_0, w_1, \ldots, w_{n-1}]\) = weight vector (learned)
- \(b\) = bias scalar (learned)
- \(f\) = activation function (nonlinear, explained next section)
- \(\mathbf{w} \cdot \mathbf{x} = \sum w_i x_i\) = dot product
The dot product \(\sum w_i x_i\) measures how similar the input is to the weight pattern. If the input matches the pattern encoded in the weights, the sum is large. If it doesn’t, the sum is near zero.
11.2 A layer of neurons¶
A layer has \(m\) neurons, each with its own weight vector. Represented as a matrix:
Where:
- \(\mathbf{x} \in \mathbb{R}^n\) = input vector
- \(\mathbf{W} \in \mathbb{R}^{m \times n}\) = weight matrix (row \(i\) = weights of neuron \(i\))
- \(\mathbf{b} \in \mathbb{R}^m\) = bias vector
- \(\mathbf{y} \in \mathbb{R}^m\) = output vector
- \(f\) = activation applied element-wise
11.3 Why depth?¶
Multiple layers allow the network to learn hierarchical features.
- Layer 1 detects simple frequency bands
- Layer 2 detects combinations of bands (e.g., the harmonic pattern of speech)
- Layer 3 detects higher-level structure (e.g., “this is keyboard rhythm”)
- Final layer combines everything into a class score
Without multiple layers, you have a single linear transform which can only draw straight-line decision boundaries. Speech vs keyboard cannot be separated by a straight line in frequency space. You need curves — that requires depth.
12. Activation Functions¶
Without nonlinear activation functions, a deep network collapses to a single linear layer. Proof:
Two linear layers = one linear layer. The depth is wasted.
12.1 ReLU¶
Rectified Linear Unit:
Negative values are set to zero. Positive values pass through unchanged. This kink at zero is the nonlinearity that breaks the linear collapse.
Why does it work? It allows neurons to be “off” (output 0) for some inputs and “on” for others. Different neurons become active for different patterns. This creates a sparse, efficient representation.
12.2 ReLU6¶
Clips the output to \([0, 6]\). Used in MobileNet.
Why 6? Purely empirical — found to work well. The bound ensures activations never grow unboundedly during training, which matters for quantisation. If activations can be any positive number, INT8 quantisation would need a huge scale factor, losing precision. Clamping at 6 gives a known range.
For INT8 quantisation: knowing that activations \(\in [0, 6]\), the quantisation scale is:
Fixed, predictable, requires no per-activation calibration.
13. Loss Function and Training¶
13.1 Cross-entropy loss¶
You have a training dataset of audio clips, each labeled with the correct class (0–7). The network produces probability estimates \(\hat{y}[c]\) for each class \(c\). The true label is a one-hot vector \(y\) — all zeros except a 1 at the correct class.
Cross-entropy loss measures how wrong the predictions are:
Since \(y\) is one-hot (only one \(y[c] = 1\), all others are 0):
Where \(c^*\) is the correct class. So the loss is simply the negative log of the probability assigned to the correct class.
Intuition:
- Correct class predicted with probability 0.99: \(\mathcal{L} = -\log(0.99) = 0.01\) (small loss, good)
- Correct class predicted with probability 0.50: \(\mathcal{L} = -\log(0.50) = 0.693\) (moderate)
- Correct class predicted with probability 0.01: \(\mathcal{L} = -\log(0.01) = 4.61\) (large loss, bad)
The log function punishes confident wrong predictions very harshly.
13.2 Gradient descent¶
You want to minimise \(\mathcal{L}\) over all the weights \(\mathbf{W}\). The gradient \(\nabla_{\mathbf{W}} \mathcal{L}\) is a matrix of the same shape as \(\mathbf{W}\), where each entry is the partial derivative of the loss with respect to that weight:
The update rule (stochastic gradient descent):
Where \(\eta\) (learning rate) is a small positive constant, typically \(0.001\).
Why subtract the gradient? The gradient points in the direction that increases the loss most steeply. Subtracting it moves the weights in the direction that decreases the loss.
Batch gradient descent: don’t compute the gradient on one sample — compute it on a mini-batch of \(B\) samples (typically \(B = 32\) or \(64\)) and average the gradients. This reduces noise and uses GPU parallelism.
One epoch = one full pass over the entire training dataset. You train for many epochs (10–100).
14. Backpropagation¶
Computing \(\nabla_{\mathbf{W}} \mathcal{L}\) for every weight naively would require re-running the whole network for each weight. Backpropagation uses the chain rule to propagate gradients efficiently from output to input.
14.1 The chain rule¶
If \(z = f(y)\) and \(y = g(x)\), then:
For a vector function: if loss \(\mathcal{L}\) depends on \(\mathbf{y}\) which depends on \(\mathbf{x}\) through \(\mathbf{y} = f(\mathbf{x})\), then:
This is the Jacobian-vector product.
14.2 Backprop through a linear layer¶
Forward: \(\mathbf{y} = \mathbf{W} \mathbf{x} + \mathbf{b}\)
Given \(\dfrac{\partial \mathcal{L}}{\partial \mathbf{y}}\) (gradient flowing back from the next layer), compute:
14.3 Backprop through ReLU¶
Forward: \(y = \text{ReLU}(x) = \max(0, x)\)
Backprop: the gradient from the next layer is passed through unchanged if \(x > 0\), and killed (set to 0) if \(x \leq 0\). Neurons that were “off” during forward pass receive no gradient update — they don’t learn from this example.
14.4 PyTorch autograd¶
In PyTorch, you don’t implement backprop manually. Every tensor operation records itself in a computation graph. Calling .backward() on the loss traverses this graph in reverse, applying the chain rule at each operation.
PSEUDOCODE: Training loop (PyTorch style)
function train(model, dataset, epochs=50, lr=0.001):
optimizer = AdamOptimizer(model.parameters(), lr=lr)
for epoch from 1 to epochs:
for (spectrogram, label) in dataset.batches(batch_size=32):
// Forward pass
logits = model(spectrogram) // shape: (32, 8)
loss = cross_entropy(logits, label) // scalar
// Backward pass
optimizer.zero_gradients()
loss.backward() // compute all ∂L/∂w
optimizer.step() // w ← w - lr × ∂L/∂w
print(f"Epoch {epoch}, Loss: {loss}")
15. Convolution¶
15.1 1D convolution intuition¶
Before 2D, consider 1D. You have a signal \(x\) and a filter \(h\) of length \(K\). The convolution at position \(n\):
The filter slides across the signal. At each position, compute the dot product between the filter weights and the underlying signal values. If the signal has a pattern matching the filter, the output is large.
Example: filter \(h = [1, -2, 1]\) detects a “peak” (value higher than neighbors) because \(1 \times x_{\text{before}} - 2 \times x_{\text{center}} + 1 \times x_{\text{after}}\) is large when \(x_{\text{center}} \ll x_{\text{neighbors}}\) (actually a valley detector). This is an edge detector in 1D.
15.2 2D convolution¶
Your spectrogram is a 2D grid \(X\) of shape \(H \times W\) (32×32). A filter \(F\) is a small \(K \times K\) grid (e.g., \(3 \times 3\)). The 2D convolution at position \((i, j)\):
The filter slides across both dimensions simultaneously. Output \(Y[i][j]\) is the dot product between the filter and the \(K \times K\) patch of \(X\) centered at \((i, j)\).
What filters learn:
- A filter with large positive weights in the mid-frequency rows detects “speech formants”
- A filter with alternating signs in the time dimension detects “rhythmic patterns”
- A filter with large weights at a specific location detects “energy at a specific frequency”
The network learns these filters automatically from data.
15.3 Multiple input channels, multiple output channels¶
After the first layer, \(X\) is no longer a 2D grid but a 3D tensor of shape \(H \times W \times C\) — \(C\) feature maps stacked, one per filter from the previous layer.
A filter now has shape \(K \times K \times C\) (covers all channels). The convolution for output channel \(m\):
With \(M\) output channels (filters), you repeat this for \(m = 0, 1, \ldots, M-1\).
Parameter count: each filter has \(K \times K \times C\) weights \(+\) 1 bias. With \(M\) filters: \(M \times (K^2 \times C + 1)\) total parameters.
For \(K=3, C=32, M=32\): \(32 \times (9 \times 32 + 1) = 32 \times 289 = 9,248\) parameters per layer.
Padding: to keep output the same size as input, add zeros around the border. With \(K=3\) and padding=1: output size = input size. Without padding: output shrinks by \(K-1=2\) in each dimension.
Stride: instead of sliding the filter one step at a time (stride=1), skip \(S\) steps (stride=\(S\)). Output size = \(\lfloor(H - K) / S\rfloor + 1\). Stride=2 halves the spatial dimensions, reducing computation.
PSEUDOCODE: 2D convolution (naive)
function conv2d(X, filters, bias, stride=1, padding=1):
H, W, C_in = shape(X)
K = filters.kernel_size // e.g. 3
M = filters.count // number of output channels
H_out = floor((H + 2×padding - K) / stride) + 1
W_out = floor((W + 2×padding - K) / stride) + 1
// Pad input with zeros
X_padded = pad(X, padding)
Y = zeros(H_out, W_out, M)
for i from 0 to H_out - 1:
for j from 0 to W_out - 1:
for m from 0 to M - 1:
sum = bias[m]
for ki from 0 to K - 1:
for kj from 0 to K - 1:
for c from 0 to C_in - 1:
sum += filters[m][ki][kj][c] × X_padded[i×stride+ki][j×stride+kj][c]
Y[i][j][m] = relu6(sum)
return Y
16. Depthwise Separable Convolution¶
16.1 Motivation: operation count¶
Normal convolution on a \(32 \times 32 \times 32\) input with \(32\) output channels and \(3 \times 3\) filters, stride=1:
- Operations per output position: \(3 \times 3 \times 32 = 288\) multiply-adds
- Total output positions: \(32 \times 32 = 1024\)
- Total operations: \(1024 \times 288 \times 32 = 9{,}437{,}184\) ≈ 9.4M
That’s just one layer. memd-net has several. Too expensive.
16.2 Depthwise convolution¶
Apply one \(K \times K\) filter per input channel, independently. No mixing between channels.
For channel \(c\), filter \(F_c\) of shape \(K \times K\):
Output shape same as input: \(H \times W \times C\). Operations per output position: \(K^2 = 9\). Total: \(1024 \times 9 \times 32 = 294{,}912 \approx 295K\).
16.3 Pointwise convolution¶
A \(1 \times 1\) convolution — the filter has no spatial extent, just mixes channels.
For output channel \(m\):
This is exactly a linear layer applied independently at every spatial position. It mixes the channel information from the depthwise step.
Operations per position: \(C = 32\). With \(M=32\) output channels: \(32 \times 32 = 1024\) per position. Total: \(1024 \times 1024 = 1{,}048{,}576 \approx 1M\).
Total depthwise separable: \(295K + 1M \approx 1.3M\) vs \(9.4M\) normal. 7.2× reduction.
Parameter reduction:
Normal: \(M \times K^2 \times C = 32 \times 9 \times 32 = 9{,}216\)
Depthwise separable: \(K^2 \times C + M \times C = 9 \times 32 + 32 \times 32 = 288 + 1024 = 1{,}312\)
7× fewer parameters. Same representational power (any normal convolution can be approximated by depthwise separable, just with more channels).
PSEUDOCODE: Depthwise separable convolution
function depthwise_conv(X, dw_filters):
// One 3×3 filter per input channel
H, W, C = shape(X)
DW = zeros(H, W, C)
for c from 0 to C-1:
for i from 0 to H-1:
for j from 0 to W-1:
sum = 0
for ki from 0 to 2:
for kj from 0 to 2:
sum += dw_filters[c][ki][kj] × X_padded[i+ki][j+kj][c]
DW[i][j][c] = relu6(sum)
return DW
function pointwise_conv(DW, pw_filters, bias):
// 1×1 convolution: mix channels
H, W, C = shape(DW)
M = length(pw_filters)
PW = zeros(H, W, M)
for i from 0 to H-1:
for j from 0 to W-1:
for m from 0 to M-1:
sum = bias[m]
for c from 0 to C-1:
sum += pw_filters[m][c] × DW[i][j][c]
PW[i][j][m] = relu6(sum)
return PW
17. Pooling¶
17.1 Global Average Pooling¶
After several convolutional layers, you have a feature tensor of shape \(H \times W \times C\). You need to collapse this to a fixed-size vector for the final classification layer.
Global Average Pooling (GAP): for each channel, average all spatial values:
Output shape: \(C\) (one number per channel). No learned parameters — just an average.
Why average instead of flatten? Flatten would give \(H \times W \times C\) values — still spatial. GAP removes spatial information and keeps only “how much of each feature is present overall.” This also makes the network position-invariant — it doesn’t matter where in the spectrogram the keyboard pattern appears; GAP produces the same output.
PSEUDOCODE: Global Average Pooling
function global_avg_pool(X):
H, W, C = shape(X)
out = zeros(C)
for c from 0 to C-1:
total = 0
for i from 0 to H-1:
for j from 0 to W-1:
total += X[i][j][c]
out[c] = total / (H × W)
return out // shape: C
18. Softmax¶
The final fully connected layer produces 8 raw scores (logits) \(z[0], \ldots, z[7]\), one per class. These are unconstrained real numbers. Convert to probabilities:
Properties:
- All \(\hat{y}[c] > 0\) (exponential is always positive)
- \(\sum_{c=0}^{7} \hat{y}[c] = 1\) (sum equals 1, valid probability distribution)
- The largest logit gets the largest probability
- The differences are amplified: logit gap of 2 → probability ratio of \(e^2 \approx 7.4\times\)
Numerical stability: if logits are large (e.g., \(z = [1000, 1001, 999]\)), \(e^{1000}\) overflows. Fix: subtract the maximum before exponentiating (mathematically equivalent):
PSEUDOCODE: Softmax
function softmax(z):
max_z = max(z)
exp_z = [exp(z[c] - max_z) for c in range(8)]
sum_exp = sum(exp_z)
return [e / sum_exp for e in exp_z]
19. The Full memd-net Architecture¶
Input: \(32 \times 32 \times 1\) (height × width × channels, 1 channel = grayscale spectrogram)
Layer Operation Output Shape SRAM (approx)
───────────────────────────────────────────────────────────────────────────────────
Input - 32×32×1 1 KB
Conv2d (stride=2) 3×3, 8 filters, relu6 16×16×8 2 KB
DW Sep Conv (stride=1) 3×3 DW + 1×1 PW, 16 ch 16×16×16 4 KB
DW Sep Conv (stride=2) 3×3 DW + 1×1 PW, 32 ch 8×8×32 2 KB
DW Sep Conv (stride=1) 3×3 DW + 1×1 PW, 32 ch 8×8×32 2 KB
DW Sep Conv (stride=2) 3×3 DW + 1×1 PW, 64 ch 4×4×64 4 KB
DW Sep Conv (stride=1) 3×3 DW + 1×1 PW, 64 ch 4×4×64 4 KB
Global Avg Pool average over 4×4 64 —
Linear (embedding) 64 → 32, no activation 32 —
Linear (classifier) 32 → 8, no activation 8 —
Softmax 8 → 8 probabilities 8 —
───────────────────────────────────────────────────────────────────────────────────
Total weights (INT8): ~36 KB
Peak activation SRAM: ~4 KB (the 4×4×64 layer)
Total SRAM needed: ~40 KB (well within 512KB budget)
Inference time: ~30–50ms at 240MHz
The 32-dim output of the embedding linear layer (before the classifier) is your memory embedding. You store this in the packet.
PSEUDOCODE: memd-net forward pass
function memd_net_forward(spectrogram_int8):
x = spectrogram_int8 // 32×32×1
x = conv2d(x, conv1_weights, stride=2) // 16×16×8
x = depthwise_separable(x, dws1, stride=1) // 16×16×16
x = depthwise_separable(x, dws2, stride=2) // 8×8×32
x = depthwise_separable(x, dws3, stride=1) // 8×8×32
x = depthwise_separable(x, dws4, stride=2) // 4×4×64
x = depthwise_separable(x, dws5, stride=1) // 4×4×64
x = global_avg_pool(x) // 64
embedding = linear(x, embed_weights) // 32 ← stored in packet
logits = linear(embedding, cls_weights) // 8
probs = softmax(logits) // 8 probabilities
event_type = argmax(probs) // 0–7
confidence = max(probs) × 255 // 0–255 (scaled)
return event_type, confidence, embedding
20. INT8 Quantisation¶
20.1 Affine quantisation scheme¶
Map float values to INT8 integers using scale \(S\) and zero-point \(Z\):
Where \(\lfloor \cdot \rceil\) means round to nearest integer, and clip keeps the result in the INT8 range.
Computing \(S\) and \(Z\) from the float range \([r_{\min}, r_{\max}]\):
Example:
Float weights range \([-0.5, 0.8]\): $\(S = \frac{0.8 - (-0.5)}{255} = \frac{1.3}{255} \approx 0.00510\)$ $\(Z = \left\lfloor \frac{0.5}{0.00510} \right\rceil - 128 = \lfloor 98.0 \rfloor - 128 = 98 - 128 = -30\)$
Float weight \(0.3\): $\(q = \left\lfloor \frac{0.3}{0.00510} \right\rceil + (-30) = 59 - 30 = 29\)$
Dequantise back: \(r = 0.00510 \times (29 - (-30)) = 0.00510 \times 59 = 0.3009\) ✓ (close to 0.3, small rounding error)
20.2 Quantised convolution math¶
In a convolution layer, you compute:
In the quantised domain:
The inner sum is entirely integer arithmetic. Expand it:
- \(\sum q_{w_i} q_{w_i}\): each product is INT8 × INT8 → up to 16 bits. Sum of \(N=288\) terms: up to 16 + \(\log_2(288) \approx 16 + 8 = 24\) bits. Use INT32 accumulator.
- \(\sum q_{w_i}\): precomputed at quantisation time (constant per filter).
- The result is in INT32.
Requantisation (output to INT8):
The output \(y\) in float is:
You want the output in INT8 with scale \(S_y\), zero-point \(Z_y\):
The factor \(M_0 = S_w S_x / S_y\) is a small positive float. You convert it to a fixed-point integer + bit-shift:
Where \(M_{\text{int}}\) is a 32-bit integer and \(n\) is a shift. Then:
No floating-point at inference time. Only integer multiplications and bit-shifts.
20.3 Per-tensor vs per-channel¶
Per-tensor: one \((S_w, Z_w)\) pair for all weights in the entire layer.
Problem: if the layer has 32 output filters and their weight ranges differ greatly, a single scale fits the largest-range filter well but badly quantises small-range filters (they get rounded to very few distinct integer values, losing precision).
Per-channel: one \((S_w^{(m)}, Z_w^{(m)})\) pair per output filter \(m\).
Each filter gets its own scale matched to its own weight range. The requantisation scale \(M_0^{(m)} = S_w^{(m)} S_x / S_y\) is now different per output channel. Small overhead (32 multiplications vs 1), large accuracy gain.
Why this matters for audio: high-frequency mel filters detect energy at 4000–8000 Hz. These frequencies carry less energy in typical audio, so their detecting filters learn smaller-magnitude weights. Per-tensor quantisation, dominated by large-magnitude filters, assigns too coarse a scale to these small-weight filters. Per-channel fixes this.
PSEUDOCODE: Per-channel INT8 quantisation
function quantise_weights_per_channel(W_float, n_channels):
// W_float shape: (n_channels, K, K, C_in)
scales = zeros(n_channels)
zpoints = zeros(n_channels, type=int8)
W_int8 = zeros_like(W_float, type=int8)
for m from 0 to n_channels - 1:
w = W_float[m] // one filter: (K, K, C_in)
r_min = min(w)
r_max = max(w)
S = (r_max - r_min) / 255.0
Z = round(-r_min / S) - 128
scales[m] = S
zpoints[m] = clip(Z, -128, 127)
W_int8[m] = clip(round(w / S) + Z, -128, 127)
return W_int8, scales, zpoints
21. INT8 Inference Runtime (memd-rt)¶
memd-rt is your custom C inference engine. No TFLite, no ESP-DL. You implement every operator.
21.1 Memory layout¶
Tensors are stored as flat arrays. A 4D tensor of shape \((N, H, W, C)\) in NHWC format (batch, height, width, channels):
NHWC vs NCHW:
- NHWC: channel is the innermost dimension → adjacent channels at same spatial location are contiguous in memory → efficient for depthwise conv (accessing all channels at one position)
- NCHW: height/width innermost → efficient for some GPU operations
ESP32-S3 has no SIMD. NHWC is fine. Critically: your C layout must match how PyTorch exported the weights, or the convolutions will silently compute wrong results.
21.2 INT8 convolution loop with INT32 accumulation¶
PSEUDOCODE: INT8 Conv2D forward pass (memd-rt)
function conv2d_int8(input_int8, weights_int8, bias_int32,
Sx, Zx, // input scale, zero-point
Sw[], Zw[], // per-channel weight scale, zero-point
Sy, Zy, // output scale, zero-point
stride, padding):
H, W, C_in = shape(input_int8)
M, K, K, _ = shape(weights_int8) // M output channels, K×K kernel
H_out = (H + 2×padding - K) / stride + 1
W_out = (W + 2×padding - K) / stride + 1
output_int8 = zeros(H_out, W_out, M, type=int8)
for m from 0 to M-1:
M0 = Sx × Sw[m] / Sy // requantisation scale (float, precomputed)
// Convert M0 to fixed-point: M0 ≈ M0_int × 2^(-shift)
shift, M0_int = float_to_fixedpoint(M0)
for i from 0 to H_out-1:
for j from 0 to W_out-1:
acc = (int32)bias_int32[m] // INT32 accumulator
for ki from 0 to K-1:
for kj from 0 to K-1:
for c from 0 to C_in-1:
q_x = input_padded[i×stride+ki][j×stride+kj][c]
q_w = weights_int8[m][ki][kj][c]
// Subtract zero-points before multiply
acc += (int32)(q_w - Zw[m]) × (int32)(q_x - Zx)
// Requantise INT32 accumulator to INT8
q_y = (acc × M0_int) >> shift
q_y = q_y + Zy
output_int8[i][j][m] = clip(q_y, -128, 127)
return output_int8
21.3 The weight file format (.memd)¶
You define a custom binary format for storing the model. Each section:
PSEUDOCODE: .memd file structure
File header:
magic : 4 bytes "MEMD"
version : 2 bytes 0x0001
n_layers : 2 bytes number of layers
For each layer:
layer_type : 1 byte (0=conv2d, 1=dw_conv, 2=pw_conv, 3=linear)
n_channels : 4 bytes
kernel_size: 1 byte
stride : 1 byte
padding : 1 byte
// Per-channel quantisation parameters
scales[] : n_channels × 4 bytes (float32)
zpoints[] : n_channels × 1 byte (int8)
// Weights
weights[] : n_channels × K × K × C_in bytes (int8)
bias[] : n_channels × 4 bytes (int32)
On the ESP32, you mmap or memcpy this file from flash into PSRAM at startup. The C runtime reads directly from this binary.
22. The Memory Packet¶
Every time memd-net fires, you produce a memory_packet_t struct. Fixed size: exactly 128 bytes.
typedef struct {
uint32_t timestamp; // 4 bytes — unix epoch from NTP
uint8_t sensor_type; // 1 byte — 0x01 = audio
uint8_t event_type; // 1 byte — 0=speech,1=music,...,7=silence
uint8_t confidence; // 1 byte — 0–255, from softmax max
uint8_t reserved; // 1 byte — alignment padding
int8_t embedding[32]; // 32 bytes — INT8 quantised 32-dim embedding
uint8_t audio_snippet[88]; // 88 bytes — compressed audio thumbnail
} memory_packet_t; // total: 4+1+1+1+1+32+88 = 128 bytes
Why exactly 128 bytes?
- No fragmentation in flash (LittleFS manages fixed-size blocks trivially)
- \(O(1)\) seek: packet \(n\) is at byte offset \(n \times 128\)
- Trivially serialisable: you can
memcpythe whole struct into a WiFi packet - Cache-aligned: 128 is a multiple of typical cache line sizes
Audio snippet (88 bytes): a very compressed thumbnail of the raw audio. Computed by heavily downsampling the 2-second capture (32000 samples → 88 samples). Not for high-fidelity replay — just for a rough verification that the packet captured what you think it did.
PSEUDOCODE: Packing a memory packet
function pack_memory_packet(event_type, confidence_float, embedding_int8,
audio_frame, timestamp):
pkt = allocate(memory_packet_t)
pkt.timestamp = timestamp
pkt.sensor_type = 0x01
pkt.event_type = event_type
pkt.confidence = round(confidence_float × 255)
pkt.reserved = 0
copy(embedding_int8, pkt.embedding, 32)
// Downsample audio to 88 samples
pkt.audio_snippet = downsample(audio_frame, target_length=88)
return pkt
Writing to flash (LittleFS):
PSEUDOCODE: Flash storage
function append_packet(pkt):
f = littlefs_open("/memd/log.bin", mode=APPEND)
littlefs_write(f, bytes(pkt), 128)
littlefs_close(f)
function read_packet(index):
f = littlefs_open("/memd/log.bin", mode=READ)
littlefs_seek(f, offset = index × 128)
pkt = littlefs_read(f, 128)
littlefs_close(f)
return pkt
23. Cosine Similarity¶
The embedding is a point in 32-dimensional space. Similar sounds → nearby points. You retrieve similar memories by finding packets whose embeddings are close to a query embedding.
Dot product of two vectors \(\mathbf{a}, \mathbf{b} \in \mathbb{R}^{32}\):
L2 norm (magnitude/length) of a vector:
Cosine similarity:
Range: \([-1, 1]\).
- \(\cos = 1\): vectors point in exact same direction → identical sound type
- \(\cos = 0\): perpendicular → unrelated sounds
- \(\cos = -1\): opposite directions
Why cosine and not Euclidean distance? Euclidean distance is sensitive to magnitude (vector length). A louder version of the same sound might have a larger-magnitude embedding, making it look “far” from the quieter version despite being the same event. Cosine similarity normalises out magnitude and only cares about direction.
Retrieval:
PSEUDOCODE: Top-k retrieval
function retrieve_similar(query_embedding, k=5):
all_packets = load_all_packets()
similarities = []
for pkt in all_packets:
stored_emb = pkt.embedding // int8[32]
// Dequantise to float for comparison
stored_float = dequantise(stored_emb, scale=S_emb, zp=Z_emb)
query_float = dequantise(query_embedding, scale=S_emb, zp=Z_emb)
sim = cosine_similarity(query_float, stored_float)
similarities.append((sim, pkt))
// Sort by similarity descending
similarities.sort(key=first_element, descending=True)
return similarities[:k] // top k
For thousands of packets, this brute-force loop is fast enough (milliseconds in Python).
24. Knowledge Graph and Session Merging¶
24.1 SQLite schema¶
-- Individual events
CREATE TABLE events (
id INTEGER PRIMARY KEY,
timestamp INTEGER NOT NULL,
event_type INTEGER NOT NULL,
confidence INTEGER NOT NULL,
embedding BLOB NOT NULL -- 32 int8 values as raw bytes
);
-- Merged sessions (consecutive same-type events)
CREATE TABLE sessions (
id INTEGER PRIMARY KEY,
event_type INTEGER NOT NULL,
start_time INTEGER NOT NULL,
end_time INTEGER NOT NULL,
packet_count INTEGER NOT NULL
);
-- Co-occurrence edges between sessions
CREATE TABLE edges (
session_a INTEGER REFERENCES sessions(id),
session_b INTEGER REFERENCES sessions(id),
co_count INTEGER DEFAULT 0,
last_seen INTEGER,
PRIMARY KEY (session_a, session_b)
);
24.2 Session merging heuristic¶
Raw packets arrive at roughly one every few seconds (whenever VAD triggers). Consecutive keyboard packets should be merged into one “keyboard session” rather than stored as hundreds of separate events.
Merging rule: if two packets of the same event type are within \(\Delta t_{\max} = 30\) seconds of each other, they belong to the same session.
This is a design choice with no objectively correct answer. Too small \(\Delta t_{\max}\) = over-splits one session. Too large = merges different sessions. The right value is application-specific.
PSEUDOCODE: Session builder
function build_sessions(packets, dt_max=30):
sessions = []
current_session = None
for pkt in packets.sorted_by_timestamp():
if current_session is None:
current_session = new_session(pkt)
elif (pkt.event_type == current_session.event_type and
pkt.timestamp - current_session.end_time <= dt_max):
// Extend current session
current_session.end_time = pkt.timestamp
current_session.packet_count += 1
else:
// Close current session, start new one
sessions.append(current_session)
current_session = new_session(pkt)
if current_session is not None:
sessions.append(current_session)
return sessions
24.3 Edge weights — temporal co-occurrence¶
Two session types that tend to occur together (e.g., keyboard and music) should have a strong edge.
Where \(t_a\) and \(t_b\) are the start times of the two sessions. Sessions close in time get a large increment (near 1). Sessions far apart get a tiny increment (near 0).
Over many days this accumulates into a meaningful co-occurrence graph. “What do I do when I’m coding?” → find strong edges to the “keyboard” session type.
25. Full End-to-End Pipeline¶
┌─────────────────────────────────────────────────────────────┐
│ ESP32-S3 │
│ │
│ 1. INMP441 mic → I2S buffer │
│ x[0..15999] (16-bit PCM, 16kHz, 1 second) │
│ │
│ 2. Pre-emphasis │
│ y[n] = x[n] - 0.97 × x[n-1] │
│ │
│ 3. Energy VAD (RMS threshold) │
│ rms = sqrt(mean(y²)) │
│ if rms < threshold: discard, goto 1 │
│ │
│ 4. Frame (512 samples, hop 496) + Hamming window │
│ 32 frames × 512 samples each │
│ │
│ 5. FFT each frame (esp-dsp) │
│ X_t[k] for k = 0..256 (complex, 32-bit float) │
│ │
│ 6. Power spectrum │
│ P_t[k] = |X_t[k]|² = Re² + Im² │
│ │
│ 7. Mel filterbank (32 triangular filters) │
│ S_t[m] = Σ_k H_m[k] × P_t[k] │
│ │
│ 8. Log compression + fixed-point bias correction │
│ L_t[m] = log(S_t[m] + ε) + bias[m] │
│ │
│ 9. Stack into 32×32 INT8 spectrogram │
│ spec[m][t] = quantise(L_t[m]) │
│ │
│ 10. memd-net forward pass (memd-rt, C) │
│ conv2d → dw_sep × 4 → gap → linear │
│ → event_type (0–7) │
│ → confidence (0–255) │
│ → embedding[32] (int8) │
│ │
│ 11. OLED update │
│ "keyboard 87%" │
│ │
│ 12. Pack memory_packet_t (128 bytes) │
│ → append to LittleFS │
│ │
│ 13. WiFi sync (when idle, I2S off) │
│ HTTP POST packets to Python server │
└────────────────────────────┬────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Python Backend (laptop) │
│ │
│ 14. Receive packets │
│ INSERT INTO events │
│ │
│ 15. Session builder │
│ merge consecutive same-type events within 30s │
│ INSERT INTO sessions │
│ │
│ 16. Edge weights │
│ update edges table: w += 1/(1 + |t_a - t_b|) │
│ │
│ 17. Retrieval │
│ cosine_sim(query_emb, stored_emb) for all stored │
│ return top-k │
│ │
│ 18. (Optional) Ollama natural language query │
│ "what did I spend most of yesterday doing?" │
│ → query sessions by time range + most common type │
└────────────────────────────────────────────────────────────┘
Summary of every formula¶
| Step | Formula |
|---|---|
| Pre-emphasis | \(y[n] = x[n] - 0.97 \cdot x[n-1]\) |
| Hamming window | \(w[n] = 0.54 - 0.46\cos!\left(\frac{2\pi n}{N-1}\right)\) |
| DFT | \(X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j2\pi kn/N}\) |
| Power spectrum | \(P[k] = \text{Re}(X[k])^2 + \text{Im}(X[k])^2\) |
| Hz to Mel | \(m = 2595\log_{10}(1 + f/700)\) |
| Mel to Hz | \(f = 700(10^{m/2595} - 1)\) |
| Mel filterbank | \(S[m] = \sum_k H_m[k] \, P[k]\) |
| Log compression | \(L[m] = \log(S[m] + \varepsilon)\) |
| Neuron | \(y = f!\left(\sum w_i x_i + b\right)\) |
| ReLU6 | \(y = \min(\max(0, x), 6)\) |
| Softmax | \(\hat{y}[c] = e^{z[c]} / \sum_k e^{z[k]}\) |
| Cross-entropy loss | \(\mathcal{L} = -\log(\hat{y}[c^*])\) |
| Gradient descent | \(w \leftarrow w - \eta \, \partial\mathcal{L}/\partial w\) |
| 2D convolution | \(Y[i][j][m] = \sum_{k_i,k_j,c} F_m[k_i][k_j][c]\cdot X[i{+}k_i][j{+}k_j][c]\) |
| Global avg pool | \(\text{GAP}[c] = \frac{1}{HW}\sum_{i,j} X[i][j][c]\) |
| INT8 quantise | \(q = \text{clip}(\lfloor r/S \rceil + Z, -128, 127)\) |
| INT8 dequantise | \(r = S(q - Z)\) |
| Cosine similarity | $\cos(\mathbf{a},\mathbf{b}) = \frac{\mathbf{a}\cdot\mathbf{b}}{ |
| Edge weight | \(w(s_a, s_b) \mathrel{+}= \frac{1}{1 + \lvert t_a - t_b \rvert}\) |
This document covers every step of the memd pipeline — from raw I2S samples to knowledge graph retrieval — with full mathematical derivations and pseudocode implementations.