memd — Checkpoints¶

ESP32-S3-N16R8 + INMP441 | June → November 2026¶

How To Use This¶

Two purposes. Read both columns as you build.

Checkpoints — concrete milestones. Each one has a verification step and a social post. Don't skip them by rushing forward. The posts are half the project.

Limitations — known failure modes documented before you hit them. When you hit them (you will), don't spend days thinking something is broken. It's expected. Fix or document and move on.

Cross-reference with roadmap.md for stage context and references.md for what to read when you hit each phase.

PHASE 0 — Environment Setup¶

Target: Week 1-2, June¶

References: references.md → PHASE 0 sections¶

✅ Checkpoint 0.1 — Hello World¶

What: C program flashed via ESP-IDF printing "Hello World" to serial. Verify: idf.py flash monitor runs. Text visible. Done in under 2 hours of unboxing. Post: Nothing. Internal only.

✅ Checkpoint 0.2 — INMP441 Capturing Audio¶

What: Raw I2S samples printing to serial terminal. Numbers changing when you make noise. Verify: Tap mic → numbers spike. Silence → near-zero values. Blow on mic → sustained wave. Post: Tweet — "INMP441 I2S mic working on ESP32-S3. First raw audio samples captured for memd. Waveform reacts to clapping. Next: VAD to filter silence."

✅ Checkpoint 0.3 — Memory Budget Known¶

What: Boot program printing exact free SRAM and PSRAM. Verify: MALLOC_CAP_INTERNAL shows ~380-420KB. PSRAM shows ~8MB. Write these numbers down. They are your constraints for everything that follows. Expected numbers for memd:

Internal SRAM free at boot:   ~400KB
PSRAM free at boot:           ~8MB
Target: model + activations   <180KB SRAM total
Actual headroom:              ~220KB spare

Post: Nothing. Reference number only.

✅ Checkpoint 0.4 — VAD Triggering Correctly¶

What: Energy-based VAD (RMS threshold on 100ms windows) triggering on sound, silent on silence. Verify: Clap → triggers. Speech → triggers. Fan noise alone → no trigger (tune threshold). Sustained silence → no trigger. Post: Tweet — "Simple RMS-based VAD working for memd. Triggers on speech and events, ignores background noise. Threshold tuning is more art than science — here's the tradeoff curve I found: [graph or numbers]"

✅ Checkpoint 0.5 — OLED Showing Status¶

What: SSD1306 OLED displaying "memd v0.1 — listening..." on boot, updating when VAD triggers. Verify: Screen shows text. VAD trigger → screen changes. This is the first demo-able moment. Post: Tweet with photo — "memd has eyes now. SSD1306 OLED showing live status. [photo]"

⚠️ Known Limitations — Phase 0¶

L0.1 — VAD false positives in noisy environments Severity: Medium Energy threshold has no understanding of sound type. Loud fan, AC, or traffic will trigger it the same as speech. This creates false memories. Mitigation: add a "noise/silence" class to your CNN (Stage 3) and discard those packets in the backend. For Phase 0, tune threshold and accept some false positives.

L0.2 — I2S and WiFi cannot run simultaneously Severity: Medium — known fix ESP32-S3 I2S and WiFi share internal buses. Running both simultaneously causes audio dropouts and WiFi packet loss. The fix is architectural: never run WiFi and I2S at the same time. WiFi syncs only when audio capture is paused. Build this into memd from day one. Not a bug — a constraint.

L0.3 — INMP441 is omnidirectional, no directionality Severity: Low You cannot tell if speech is directed at you or coming from a TV across the room. Both classify as "speech." State this explicitly in Blog Post 2. It is a known hardware limitation, not a software failure.

PHASE 1 — Log-Mel Spectrogram in C¶

Target: Week 3-6, June-July¶

References: references.md → PHASE 1 section, Fayek tutorial¶

✅ Checkpoint 1.1 — Python Reference Implementation¶

What: Python script taking a WAV file and outputting a 32×32 log-mel spectrogram as numpy array. Verify: Plot with matplotlib. Speech shows formant structure (horizontal bands). Music shows harmonic lines. Silence is flat and dark. If it looks like noise, your filterbank is wrong. Post: Nothing. Ground truth only.

✅ Checkpoint 1.2 — FFT Correct via esp-dsp¶

What: dsps_fft2r_fc32 producing correct power spectrum on a known test signal. Verify: Feed 1kHz sine wave at 16kHz sample rate. FFT peak at bin ~32. If not, check bit-reversal order or windowing. Post: Nothing. Internal validation.

✅ Checkpoint 1.3 — Full C Mel Spectrogram Matching Python¶

What: Complete fixed-point C pipeline outputting 32×32 INT8 matrix matching Python reference within 2 INT8 units mean absolute error. Verify: Same WAV → Python → array A. Same WAV → C via serial dump → array B. Compare element-wise. MAE < 2.0. If higher, your fixed-point log approximation or mel filterbank scaling is wrong. Post: Tweet thread — "Implemented log-mel spectrogram from scratch in fixed-point C for memd on ESP32-S3. No DSP libraries (except FFT). Here's what I had to figure out that no tutorial covers: [thread]" Include: mel filterbank equation, why fixed-point log is hard, your approximation, error vs Python. This tweet should be a proper thread, 6-8 tweets.

⚠️ Known Limitations — Phase 1¶

L1.1 — Fixed-point log introduces systematic frequency bias Severity: Medium — document it Fixed-point log approximation introduces consistent error that varies across the mel filterbank. High-frequency mel bins (where energy is naturally lower) get scaled differently than low-frequency bins. This shows up as a constant offset in certain frequency bands. Critical implication: if you train memd-net on Python-generated spectrograms but run inference on C-generated ones, the distribution mismatch degrades accuracy. You must generate training data using your C pipeline once it's stable (dump spectrograms over serial, collect as training set, retrain). Budget 1 week for this in Stage 3. Blog angle: this is a real finding. "Why training on Python spectrograms and running on C spectrograms was a mistake" is a section of Blog Post 1.

L1.2 — 32×32 resolution loses short transients Severity: Low 32 time frames over 1 second = ~31ms per frame. A single key click or brief knock spans 1-2 frames and may be poorly represented. memd-net will be weaker on very short sounds. Mitigation: overlapping windows (50% overlap). Doubles compute, improves short event detection. Implement this in Phase 3 if short sounds matter.

L1.3 — esp-dsp FFT uses float32 internally Severity: Low The FFT step runs in float32 — esp-dsp doesn't offer INT16 FFT. You then convert to fixed-point for the filterbank. If latency becomes a problem, this is the bottleneck to profile first. For normal use (classify once every few seconds), it's fine.

PHASE 2 — Custom C Inference Runtime (memd-rt)¶

Target: Week 6-12, July-August¶

References: references.md → PHASE 2 sections, GGML source, Darknet source¶

✅ Checkpoint 2.1 — Weight Exporter Working¶

What: Python script exporting trained memd-net weights to your binary format .memd. Verify: Load .memd back in Python, reconstruct weights, compare to original PyTorch tensors. Float32: bit-identical. INT8: within quantisation rounding error. Post: Nothing. Internal. Note: Call the file format .memd. Small detail, but it means your binary format has a name and you can explain it. "I designed the .memd binary format for on-device model storage" sounds better than "I have a binary file."

✅ Checkpoint 2.2 — Float32 Runtime Matching PyTorch¶

What: memd-rt running a full forward pass on a test spectrogram, output matching PyTorch logits within 1e-5. Why this is hard: NHWC vs NCHW memory layout bugs, padding errors, off-by-one in convolution loops. These are not obvious. Expect 2-5 days of debugging. Verify: Same 32×32 spectrogram → PyTorch → logits A. Same spectrogram → memd-rt C → logits B. Max absolute error < 1e-5. If not, it's a layout or indexing bug. Post: Tweet — "memd-rt (my C inference runtime) finally matches PyTorch output. The bug that took longest: [specific thing, e.g. NHWC vs NCHW, padding]. Here's what it actually means and why it matters for correctness: [explanation]" This tweet reliably gets engagement because the debugging struggle is relatable. Be specific about the actual bug.

✅ Checkpoint 2.3 — INT8 Quantisation Per-Tensor¶

What: Per-tensor PTQ working. Weights quantised, INT32 accumulation, correct rescaling. Accuracy within 8% of float32 baseline. Verify: Run 200 test clips through float32 runtime and INT8 runtime. Accuracy drop < 8%. If > 10%, your scale factors or rescaling math is wrong. Post: Tweet — "INT8 quantisation working in memd-rt. Per-tensor PTQ on memd-net. Float32 accuracy: X%. INT8 accuracy: Y%. Memory: [Z]KB weights. Here's the per-layer quantisation error:"

✅ Checkpoint 2.4 — INT8 Quantisation Per-Channel¶

What: Per-channel PTQ — one scale factor per output channel. Accuracy improvement over per-tensor, specifically in high-frequency mel bins. Verify: Accuracy improves 2-5% over per-tensor. If not, your per-channel scale computation is wrong (check that you're computing scale per output channel, not per tensor). Post: Tweet — "Per-channel vs per-tensor quantisation for audio on memd. The difference: [X]% accuracy. Why it matters specifically for mel spectrograms: high-frequency bins have lower energy → per-tensor clips them → per-channel doesn't. Here's the distribution plot: [image]" This is your most technically interesting tweet before Blog Post 1. It's an audio-specific finding that nobody has written about cleanly.

✅ Checkpoint 2.5 — Running on ESP32-S3¶

What: memd-rt cross-compiled and running on device, producing correct classifications on recorded test clips. Verify: Play speech recording near mic → classifies "speech" >80% confidence. Play keyboard sounds → "keyboard". Silence → "silence". Measure latency with esp_timer_get_time(). Expected: ~30-50ms inference latency. If >100ms, check for PSRAM accesses during inference (weights must be in SRAM). Post: Tweet — "memd-rt running on ESP32-S3. [X]ms inference latency. [Y]KB SRAM. Classifying 8 audio event types. Next: benchmark vs ESP-DL."

✅ Checkpoint 2.6 — Benchmark vs ESP-DL¶

What: Same model, same 200 test clips, through memd-rt and ESP-DL side by side. Verify: Both run on device. Numbers recorded. Expected result (and why it's fine):

Metric              memd-rt (yours)     ESP-DL
──────────────────────────────────────────────────────
Inference latency   ~40-80ms            ~15-30ms
Peak SRAM           ~180KB              ~160KB
Accuracy            X%                  X+2%
Binary footprint    small               large (full ESP-IDF dep)
Framework dep       none                ESP-IDF only
Portability         any C compiler      Espressif only

ESP-DL will likely be faster. That's because it uses PIE SIMD vector instructions. That's the point — you explain why it's faster (PIE SIMD processes 4 INT8 values per clock vs your 1), not just that it is. Post: Blog Post 1 — see below.

📝 BLOG POST 1 — End of August¶

Title: "Building memd-rt: a custom audio CNN inference runtime in C for ESP32-S3. Here's what I learned." Where to post: r/esp32, r/embedded Tweet: link + benchmark table as image Contents: 1. What memd is and why from scratch 2. The .memd binary format design 3. Matching PyTorch output: the bug that took the longest 4. INT8 quantisation math: per-tensor vs per-channel 5. The mel bin clipping finding (audio-specific, your contribution) 6. ESP32-S3 port: alignment, watchdog, I2S/WiFi scheduling 7. Benchmark table: memd-rt vs ESP-DL with explanation of gap (PIE SIMD) 8. Honest limitations section (link to this document)

⚠️ Known Limitations — Phase 2¶

L2.1 — memd-rt slower than ESP-DL Severity: Low — expected ESP-DL uses PIE SIMD vector instructions. Your scalar C loops don't. Expect ESP-DL to be 2-4x faster. This is the expected result. Explain it, don't apologise for it. "Here's what PIE does that my code doesn't" is a blog section, not a failure.

L2.2 — INT8 accuracy drop is real Severity: Low — expected Expect 3-8% accuracy drop from float32 to INT8. Some classes degrade more — eating/chewing sounds are spectrally diffuse and degrade the most. Document per-class accuracy, not just overall. The per-class breakdown is more interesting than the aggregate number.

L2.3 — Training/inference spectrogram mismatch Severity: High — must fix If you train memd-net on Python spectrograms but run memd-rt on C spectrograms (which have fixed-point error, L1.1), accuracy degrades silently. You won't know why. Fix: after C pipeline is stable in Phase 1, dump ~500 training spectrograms over serial, collect them as a dataset, retrain or fine-tune memd-net on C-generated data. Budget 1 week. This is not optional.

L2.4 — Watchdog kills inference during debugging Severity: Low — easy fix Default task watchdog timeout is 5 seconds. With printf debugging, inference can exceed this. Fix: esp_task_wdt_reset() in the inference loop or increase timeout in menuconfig. Not a real problem at production inference speeds.

L2.5 — No hardware MAC unit Severity: Low ESP32-S3 has no dedicated multiply-accumulate hardware. INT8 multiplications run on the general ALU. Latency is in the tens of milliseconds as a result. For memd's use case (classify once every few seconds when VAD triggers), this is completely acceptable.

PHASE 3 — memd Memory System + Backend¶

Target: Weeks 11-17, August-September¶

References: references.md → PHASE 3 section¶

✅ Checkpoint 3.1 — Memory Packets Writing to Flash¶

What: ESP32 writing correctly formatted memory_packet_t structs to LittleFS after each inference. Verify: Write 10 packets, read them back, verify all fields match. Check struct byte layout with sizeof(memory_packet_t) == 128. Post: Nothing. Internal.

✅ Checkpoint 3.2 — NTP Timestamp Sync¶

What: On WiFi connect, memd syncs time via NTP. All subsequent packets have correct unix timestamps. Verify: Check timestamp against current time after sync. Should be within 1 second. Post: Nothing. Infrastructure. Note on L3.1: if WiFi is unavailable at boot, timestamps are wrong until sync. Either add a DS3231 RTC module (₹150, worth it) or accept NTP dependency. Decide now and document the choice.

✅ Checkpoint 3.3 — WiFi Sync to Backend¶

What: ESP32 connecting to WiFi when idle (I2S off) and HTTP POSTing new packets to Python server on laptop. Verify: Python server logs received packets. Count on laptop matches count in ESP32 flash. No packet corruption. Post: Tweet — "memd packets syncing from ESP32-S3 to laptop over WiFi. Running for 3 hours: [X] events captured. Event distribution: speech [X]%, keyboard [Y]%, music [Z]%. The distribution was [expected/surprising]: [finding]"

✅ Checkpoint 3.4 — Knowledge Graph Building¶

What: Python backend creating SQLite nodes and edges from incoming packets. Sessions forming. Co-occurrence edges appearing. Verify: After 2-3 hours of use, run a graph visualisation with networkx. You should see clusters — keyboard sessions, music sessions, speech events. If it's a flat list, your edge formation logic is wrong. Post: Tweet with graph image — "memd knowledge graph after [X] hours of capture. Starting to see my day in edges. Keyboard + music co-occur strongly. Interesting finding: [something unexpected the data showed]"

✅ Checkpoint 3.5 — Cosine Retrieval Working¶

What: Given a query audio clip or time range, Python backend returns top-5 most similar past events. Verify: Record keyboard typing → query → top 5 results should all be keyboard events. Record speech → top 5 speech events. Cross-class retrieval (speech query returning keyboard results) indicates embedding space collapse — retrain. Post: Nothing. Save for Blog Post 2.

✅ Checkpoint 3.6 — End-to-End Demo¶

What: Full memd pipeline running: mic → VAD → inference → packet → flash → WiFi → graph → retrieval → OLED + terminal result. Verify: 5 different sound events. Retrieve each by similarity. All 5 correct. Measure total latency from sound to retrievable. Post: Demo video first — 90 seconds, no narration. OLED showing live classification. Terminal showing query result. This is your most important post. Post to Twitter first, before Blog Post 2.

📝 BLOG POST 2 — End of September¶

Title: "memd: a privacy-first ambient memory device on ESP32-S3 — from audio inference to knowledge graph, entirely local." Where to post: r/MachineLearning, r/selfhosted, r/privacy, r/developersIndia, r/esp32 HN: Show HN: memd — ambient memory device on ESP32-S3, custom audio runtime, local knowledge graph, no cloud Post HN Monday/Tuesday 9-11am US Eastern. Contents: 1. The problem memd solves (life context, big tech's approach vs yours) 2. Full system architecture diagram 3. memory_packet_t format and why fixed-size matters 4. Knowledge graph schema and session merging heuristic 5. Why brute force cosine retrieval is fine at this scale 6. Power measurements: continuous vs VAD-triggered (tweetable number) 7. Privacy: what stays on device, what goes to laptop, what never leaves 8. Demo video embed 9. Honest limitations (reference this document) 10. What memd v2 would look like

⚠️ Known Limitations — Phase 3¶

L3.1 — Timestamps require NTP, lost on power cycle Severity: Medium ESP32-S3 has no RTC. Every power cycle resets the clock to epoch 0. NTP on first WiFi connection fixes it, but packets captured before sync have wrong timestamps. Fix properly with a DS3231 RTC module (₹150, I2C, tiny). Decision: add it in Phase 3 or accept NTP dependency. State your choice explicitly in the blog post.

L3.2 — Session merging heuristic is arbitrary Severity: Low — this is the research question Your rule (gap < 30s = same session) is a design choice. It will sometimes merge two sessions that should be separate. It will sometimes split one session into two. There is no correct answer. This is actually a research question: what is the right temporal granularity for episodic memory formation? State it as such in Blog Post 2. Don't apologise for the heuristic — explain why no perfect answer exists.

L3.3 — Out-of-distribution sounds produce unreliable embeddings Severity: Medium memd-net was trained on ESC-50 + Speech Commands. An unusual instrument, a specific voice, or an unfamiliar environment produces an embedding that lands in a random region of embedding space. Retrieval for OOD sounds is unreliable. State the training distribution explicitly in your README and blog post. This is not a flaw — it's scope.

L3.4 — Knowledge graph is shallow Severity: Low — scope The SQLite graph captures co-occurrence and temporal proximity. It does not infer causality, semantics, or higher-order patterns. "Keyboard + music" co-occurring does not tell the system you work better with music. That requires an LLM layer (Ollama) on top. For v1, state clearly: the graph is a memory index, not a reasoning engine. v2 adds reasoning.

L3.5 — WiFi sync is unencrypted on LAN Severity: Low — prototype The HTTP POST from ESP32 to laptop is plaintext on your local network. For a personal research prototype this is acceptable. For a product it would need HTTPS or BLE with encryption. State this in Blog Post 2 — it shows you've thought about the threat model, which matters for the privacy positioning.

L3.6 — Single dominant event class per window Severity: Medium — scope memd-net outputs one class per 2-second window (the dominant sound). Music + keyboard simultaneously = whichever is louder wins. Polyphonic event detection (multi-label classification) is a much harder problem. State as a known limitation and future direction. "memd v2 would use multi-label output" is a natural blog post ending.

PHASE 4 — Research Positioning¶

Target: Month 4-5, October-November¶

References: references.md → PHASE 4, Landmark Papers sections¶

✅ Checkpoint 4.1 — 15-20 Papers Read With Notes¶

What: MCUNet, Jacob et al., Hello Edge, LEAF, MLPerf Tiny, MemGPT, Generative Agents, Sparks of AGI, Attention Is All You Need, and ~5-10 others read with notes. Verify: You can discuss any of them conversationally without looking. You know which specific sections are most relevant to memd. You have a search-driven reading list from your own build problems. Post: Nothing. But paper thread tweets count toward this.

✅ Checkpoint 4.2 — 5 Paper Threads Published on Twitter¶

What: 5 public Twitter threads on landmark papers. See references.md for specific thread angles. Minimum 5: Attention Is All You Need, MemGPT, Generative Agents, MCUNet, Whisper. Verify: Posted, public, at least some engagement. Post: These are the posts themselves.

✅ Checkpoint 4.3 — Cold Mails Tracked and Sent¶

What: 15-20 personalised emails to PhD students at IIIT-H CVIT, IISc DESE, IIT Madras, IIT Bombay. Verify: Spreadsheet with: name, lab, paper referenced, date sent, response status. When to start: Send 5 mails when Blog Post 1 is live. Scale to 20 when Blog Post 2 is live. Do not mail: before you have something to show. A mail with no blog post is a generic mail.

Summary — All Limitations¶

ID    Phase    Limitation                               Severity
────────────────────────────────────────────────────────────────────
L0.1  Setup    VAD false positives in noise             Medium
L0.2  Setup    I2S + WiFi can't run simultaneously      Medium (known fix)
L0.3  Setup    No sound directionality                  Low
L1.1  DSP      Fixed-point log frequency bias           Medium (document it)
L1.2  DSP      32×32 loses short transients             Low
L1.3  DSP      esp-dsp FFT uses float32                 Low
L2.1  Runtime  memd-rt slower than ESP-DL               Low (expected, explain it)
L2.2  Runtime  INT8 accuracy drop 3-8%                  Low (expected, document it)
L2.3  Runtime  Train/inference spectrogram mismatch     HIGH — must fix
L2.4  Runtime  Watchdog during debug                    Low (easy fix)
L2.5  Runtime  No hardware MAC                          Low (latency acceptable)
L3.1  Memory   Timestamps need NTP                      Medium
L3.2  Memory   Session merging heuristic arbitrary      Low (research question)
L3.3  Memory   OOD sounds unreliable                    Medium
L3.4  Memory   Graph is shallow                         Low (scope)
L3.5  Memory   WiFi sync unencrypted                    Low (prototype)
L3.6  Memory   Single dominant class per window         Medium (scope)

The only High severity item is L2.3. Fix it. Everything else: document and explain.

Full Post Schedule¶

Jun Week 1    Tweet: "Starting drop year. Building memd." (announce)
Jun Week 2    Tweet: Checkpoint 0.2 (INMP441 working)
Jun Week 2    Tweet: Checkpoint 0.4 (VAD working) + OLED photo
Jun Week 5-6  Tweet: Checkpoint 1.3 thread (mel spectrogram in C — 6-8 tweets)
Jul Week 3-4  Tweet: Checkpoint 2.2 (float32 runtime matches PyTorch)
Aug Week 1    Tweet: Checkpoint 2.3 (INT8 accuracy numbers)
Aug Week 1    Tweet: Checkpoint 2.4 (per-channel vs per-tensor finding)
Aug Week 2    Tweet: Checkpoint 2.5 (running on device, latency)
Aug Week 4    BLOG POST 1 → r/esp32, r/embedded
Sep Week 1    Tweet: Checkpoint 3.3 (WiFi sync, event distribution)
Sep Week 2    Tweet: Checkpoint 3.4 (graph visualisation)
Sep Week 3    DEMO VIDEO → Twitter (most important post)
Sep Week 4    BLOG POST 2 → r/MachineLearning, r/selfhosted, r/privacy,
                            r/developersIndia, r/esp32
Sep Week 4    HN: Show HN post
Sep Week 4    Cold mailing begins (5 mails)
Month 4+      Paper threads: Attention, MemGPT, Generative Agents,
                             MCUNet, Whisper (one per week)
October       Cold mailing at full scale (15-20 total)
November      Applications open

Last updated: June 2026 Project: memd — ESP32-S3-N16R8 + INMP441 + Custom C Audio Inference + Knowledge Graph Cross-reference: roadmap.md (what to build), references.md (what to read), prompt.md (full context)