memd — Roadmap¶

Ayush | June 2026 → November 2026 Applications¶

What Is memd¶

memd (pronounced "mem-dee", like a Unix daemon) is a privacy-first ambient memory device built on an ESP32-S3 microcontroller. It listens to your environment, classifies audio events using a custom INT8 inference runtime written in C from scratch, compresses observations into 128-byte memory packets, and builds a local knowledge graph of your life — no cloud, no subscriptions, no data leaving your machine.

Think of it as /var/log for your life. A daemon that runs quietly, captures what matters, and makes it searchable later.

The One Sentence¶

"I built memd — a privacy-first ambient memory device on a $5 microcontroller — custom INT8 audio inference runtime in C, a tiny CNN trained on log-mel spectrograms, and a local knowledge graph backend — that continuously captures life events and makes them retrievable."

This goes in your SOP, cold mails, and Twitter bio. Every decision serves this sentence.

Why memd Exists¶

Big tech is racing to build always-on AI memory from the top down: - OpenAI added memory to ChatGPT - Google has Project Astra - Limitless built a $99 wearable pendant - Rewind built a Mac app that records your entire screen

They all start with powerful models and add memory as an afterthought. memd starts from the opposite end — a $5 microcontroller with 512KB SRAM — and asks: what is the minimum viable memory representation?

Nobody in a university lab is working at this layer. That's the gap memd fills.

The research question:

"What is the minimum audio representation that preserves sufficient semantic information for long-term episodic memory retrieval on a severely constrained device?"

That question has a paper in it. memd is the experiment that answers it.

Why This Matters for Your Applications¶

You are not just building a project. You are building a research narrative that goes:

You built the whole stack yourself
    → custom C runtime (not TFLite, not Edge Impulse)
    → INT8 quantisation from scratch (not a wrapper)
    → knowledge graph and retrieval (not a demo)

You hit real research problems
    → mel spectrogram quantisation bias (audio-specific, undocumented)
    → memory deduplication heuristics (open research question)
    → minimum embedding dimensionality (testable hypothesis)

You documented everything publicly
    → 2 blog posts, demo video, Twitter threads
    → citable by cold mail targets
    → visible to labs before you mail them

This is what gets you into IIIT-H CVIT, IISc DESE, IIT Madras. Not the GitHub repo. The writeup.

Hardware¶

ESP32-S3-N16R8 devkit          ₹800-1000    Probots / Hubtronics
INMP441 I2S MEMS microphone    ₹200-300     robu.in (buy 2)
SSD1306 OLED 0.96" I2C 4-pin   ₹150-200     robu.in
LiPo battery 500mAh            ₹300-400     robu.in (add later)
TP4056 charging module         ₹50          robu.in (add later)
Jumper wires + breadboard      ₹200
──────────────────────────────────────────────
Phase 1 total (no battery)     ~₹1500-1700

Why N16R8 specifically: 16MB flash stores model weights + memd packet log. 8MB PSRAM buffers audio frames. No other variant works for this project.

Battery: Add after firmware is working. Don't debug power management and audio simultaneously. Run off USB first.

OLED: Shows real-time event classification on a small screen. The waveform → classification transition is your demo video hero shot. Worth the ₹200.

memd System Architecture¶

┌─────────────────────────────────────────────────────────┐
│                     ESP32-S3 (memd device)               │
│                                                          │
│  INMP441 mic → I2S buffer → VAD check (RMS threshold)   │
│                                  ↓ triggered             │
│                           2s audio window                │
│                                  ↓                       │
│                      log-mel spectrogram                 │
│                      (your fixed-point C, no library)    │
│                                  ↓                       │
│                      INT8 CNN inference                  │
│                      (your C runtime, no framework)      │
│                                  ↓                       │
│               event_type + confidence + 32-dim embedding │
│                                  ↓                       │
│               memory_packet_t → LittleFS (flash)         │
│                                  ↓                       │
│               SSD1306 OLED: event display                │
│                                  ↓                       │
│               WiFi sync to backend when idle             │
└────────────────────────────────┬─────────────────────────┘
                                 ↓
┌────────────────────────────────────────────────────────────┐
│                  memd backend (Python, your laptop)         │
│                                                            │
│  SQLite knowledge graph                                    │
│    nodes: event types, time blocks, sessions              │
│    edges: co-occurrence, temporal proximity               │
│                                                            │
│  Brute-force cosine retrieval on 32-dim embeddings        │
│  Time range queries: "what was happening at 3pm?"         │
│  Optional: Ollama (llama3.2) for natural language queries │
└────────────────────────────────────────────────────────────┘

The rule: ESP32 captures, compresses, classifies. It does not reason. The backend reasons. Never let memd try to be the brain.

memd Memory Packet Format¶

Each observation is exactly 128 bytes. Fixed size is a deliberate design choice — no fragmentation, O(1) seek in flash, trivially serialisable over WiFi.

typedef struct {
    uint32_t timestamp;         // unix time, 4 bytes (NTP synced on boot)
    uint8_t  sensor_type;       // 0x01 = audio (reserved for future sensors)
    uint8_t  event_type;        // 0-7: speech/music/keyboard/door/eating/traffic/tv/silence
    uint8_t  confidence;        // 0-255, scaled from softmax output
    uint8_t  reserved;          // padding for alignment
    int8_t   embedding[32];     // INT8 quantised 32-dim semantic embedding
    uint8_t  audio_snippet[88]; // compressed audio thumbnail
} memory_packet_t;              // exactly 128 bytes

This struct definition is what your blog post explains in detail. Every field is a design decision. Every design decision is a paragraph.

The Model — memd-net¶

A tiny MobileNet-style CNN trained to produce audio event classifications and embeddings. You design and train it yourself.

Input:          32×32 log-mel spectrogram (1 second audio)
Output:         event_type (8 classes) + 32-dim embedding
Weights INT8:   ~36KB
Peak SRAM:      ~3KB activations
Inference:      ~30-50ms on ESP32-S3 at 240MHz
Dataset:        ESC-50 + Google Speech Commands

8 event classes: speech, music, keyboard typing, eating, door/knock, traffic, TV audio, silence

These cover >90% of a normal study/work day. Don't add more — larger class count = larger model = memory problems.

Stage 1 — Setup (Weeks 1-3, June)¶

Hardware tasks: - ESP-IDF installed, Hello World flashing → Checkpoint 0.1 - INMP441 wired, I2S audio samples printing to serial → Checkpoint 0.2 - SRAM and PSRAM sizes printed and recorded → Checkpoint 0.3 - Energy VAD triggering on speech, ignoring silence → Checkpoint 0.4 - OLED showing "memd ready" on boot

Reading (parallel, 1-2h daily): - TinyML by Pete Warden, chapters 1-6 (free PDF) - MCUNet paper — Lin et al. 2020 (your primary citation) - Jacob et al. 2018 quantisation paper, sections 2-3 - Fayek speech processing tutorial (before any audio papers)

Goal: INMP441 capturing. VAD triggering. Memory budget known. You can explain log-mel spectrograms and INT8 quantisation verbally.

While hardware ships (order immediately): Set up ESP-IDF on your laptop. Read Beej chapters 1-8. Download ESC-50 dataset. This week is not wasted.

Stage 2 — Log-Mel Spectrogram in C (Weeks 3-6, June-July)¶

This is what separates memd from "I used Edge Impulse."

Raw I2S samples (int16)
    ↓
Pre-emphasis filter  y[n] = x[n] - 0.97*x[n-1]
    ↓
Hamming window (fixed-point multiply)
    ↓
512-point FFT (esp-dsp library — don't write FFT from scratch)
    ↓
Power spectrum (magnitude squared)
    ↓
Mel filterbank (32 triangular filters, fixed-point C)
    ↓
Log compression (fixed-point approximation)
    ↓
32×32 INT8 spectrogram → into CNN

Key checkpoints: - Python reference implementation first → Checkpoint 1.1 - FFT producing correct spectrum on known test signal → Checkpoint 1.2 - Full C pipeline matching Python within 2 INT8 units MAE → Checkpoint 1.3

The finding you'll document: fixed-point log approximation introduces systematic frequency bias. Low mel bins and high mel bins scale differently. This is not in any tutorial. This is your contribution. See L1.1 in checkpoints.md.

Stage 3 — memd-net Inference Runtime in C (Weeks 6-12, July-August)¶

The core of the project.

Operators to implement:

conv2d()            // first layer
depthwise_conv2d()  // bulk of computation
pointwise_conv2d()  // 1x1
relu6()             // activation
global_avg_pool()   // spatial collapse
linear()            // embedding + classification
softmax()           // output probabilities

Order of work: 1. Train memd-net in PyTorch. Verify embedding quality. 2. Write Python weight exporter → your binary format (call it .memd) → Checkpoint 2.1 3. Float32 C runtime matching PyTorch output within 1e-5 → Checkpoint 2.2 4. INT8 PTQ per-tensor → Checkpoint 2.3 5. INT8 PTQ per-channel (fixes mel bin clipping) → Checkpoint 2.4 6. Cross-compile for ESP32-S3, fix alignment/watchdog issues → Checkpoint 2.5 7. Benchmark vs ESP-DL → Checkpoint 2.6 → Blog Post 1

The finding you'll document: per-tensor quantisation clips high-frequency mel bins due to non-uniform energy distribution. Per-channel fixes it. 2-5% accuracy improvement. Audio-specific, not documented elsewhere.

Stage 4 — memd Memory System + Backend (Weeks 11-17, August-September)¶

ESP32 firmware: 1. INMP441 → I2S → VAD → 2s capture 2. Mel spectrogram 3. INT8 inference → event_type + embedding 4. Pack memory_packet_t 5. Append to LittleFS 6. OLED update 7. WiFi sync when idle

memd backend (Python):

# SQLite schema
events:   id, timestamp, event_type, confidence, embedding (blob)
sessions: id, event_type, start_time, end_time, packet_count
edges:    session_a, session_b, co_occurrence_count, last_seen

# Queries
cosine_search(query_embedding)     # find similar past events
time_query("3pm yesterday")        # what was happening then
event_query("keyboard")            # when did I last type

Optional (adds 2-3 days, worth it): Ollama + llama3.2 for natural language queries. "What did I spend most of yesterday doing?" becomes a real answerable question.

Interesting problems to document: - Memory deduplication: 400 keyboard packets in a day = 400 entries or one session? Your heuristic is a research question. - Embedding drift: same keyboard, different distance from mic. How similar are the embeddings? Measure it. - Power: continuous vs VAD-triggered current draw. Measure with USB power meter. This number will be dramatic. - Privacy boundary: on-device inference, local backend, no API calls. Make this explicit everywhere.

Output: Demo video (90s, no narration) + Blog Post 2

Stage 5 — Research Positioning (Month 4+, parallel to everything)¶

Labs to Target¶

IIIT-H CVIT + Qualcomm Edge AI Lab — Primary target. Qualcomm-funded, directly relevant work on edge AI systems. Sagnik is here. Faculty: Prof. Ramesh Loganathan (Qualcomm lab lead), Prof. C.V. Jawahar.

IISc Bangalore, DESE — Most hardware-aware institution in India. Hardware-software codesign, embedded systems.

IIT Madras, Embedded Systems group — ECE-aligned, real-time systems, IoT.

Microsoft Research India / TCS Research — Worth 1-2 mails each. Harder to get but real.

Cold Mail (send when Blog Post 1 is live)¶

Subject: Re: [specific paper title] — related work on constrained audio inference

Hi [name],

I read your paper "[title]" — your finding on [specific section]
directly matched a problem I hit building memd, an ambient memory
device on ESP32-S3. I'm running INT8 audio event classification
within 200KB SRAM. My approach to [specific problem]: [your solution],
giving [your numbers].

Writeup: [blog post link]. Code: [GitHub].

I'm a gap year student (applying for Fall 2027 admissions) looking
for a research internship starting [month].
15 minutes to discuss your work on [their area]?

Ayush
ahhyoushh.github.io | github.com/[handle]

Mail 2nd-3rd year PhD students, not professors first. Start with 5 mails when Blog Post 1 is live. Scale to 20 when Blog Post 2 is live.

Twitter — Start Day 1 of Drop Year¶

Don't wait until something is finished. Post process, not results.

Now: - "Starting my gap year today. Building memd — an ambient memory device on ESP32-S3. No cloud. No frameworks. Custom C inference from scratch. Thread on why:" - "Ordered ESP32-S3 N16R8 + INMP441 mic. While it ships: reading Beej's C guide and the MCUNet paper. First time writing serious C."

Each checkpoint: one tweet, specific, with numbers or screenshots. See checkpoints.md for exact post content at each milestone.

Paper threads (Month 4+): One thread per landmark paper. See references.md for thread angles on each paper.

Reddit (milestones only)¶

Blog Post 1 → r/esp32, r/embedded
Demo video  → r/esp32 (fastest response)
Blog Post 2 → r/MachineLearning, r/selfhosted, r/privacy, r/developersIndia

Hacker News (once, after Blog Post 2)¶

Show HN: memd — ambient memory device on ESP32-S3, custom audio runtime, no cloud Post Monday/Tuesday 9-11am US Eastern.

Timeline¶

Jun Week 1    Hardware ordered. ESP-IDF setup. Beej + MCUNet reading.
              Tweet: "Starting drop year. Building memd."

Jun Week 2    Hardware arrives. INMP441 working. VAD triggering.
              Tweets: Checkpoint 0.2, 0.4

Jun Week 3-4  Python mel spectrogram reference. ESC-50 explored.
              PyTorch model training begins.

Jul Week 1-2  Log-mel in fixed-point C. Match Python reference.
              Tweet: Checkpoint 1.3 thread (mel filterbank finding)

Jul Week 3-4  Float32 C runtime. Match PyTorch output.
              Tweet: Checkpoint 2.2 (the bug that took longest)

Aug Week 1-2  INT8 quantisation per-tensor then per-channel.
              Tweets: Checkpoint 2.3, 2.4 (mel bin clipping finding)

Aug Week 3    ESP32-S3 port. Latency numbers.
              Tweet: Checkpoint 2.5

Aug Week 4    Benchmark vs ESP-DL. Write Blog Post 1.
              BLOG POST 1 → r/esp32, r/embedded

Sep Week 1-2  Memory packet system. LittleFS working.
              Python backend: SQLite graph, cosine retrieval.

Sep Week 3    End-to-end demo working. Record demo video.
              Tweet: demo video

Sep Week 4    Write Blog Post 2.
              BLOG POST 2 → Reddit (5 subs) + HN
              Cold mailing begins

Oct           Cold mailing in full swing (15-20 mails).
              Deep paper reading for specific lab mails.
              Paper threads on Twitter.

Nov           Applications open. SOP ready.
              SOP anchored on Blog Post 1 + Blog Post 2 + internship conversations.

SOP Paragraph (Draft — Fill In Findings Later)¶

"During my gap year I built memd, a privacy-first ambient memory device, from first principles. Running on an ESP32-S3 microcontroller with 512KB SRAM, memd continuously classifies audio events using a custom INT8 inference runtime I wrote in C — no ML frameworks — and builds a local knowledge graph of daily life events on a companion backend. The core research question I explored: what is the minimum audio representation that preserves sufficient semantic information for episodic memory retrieval under severe hardware constraint? Building this led me to [specific finding about mel quantisation / embedding dimensionality / deduplication], which connects to [lab's work] because [specific reason]. I want to pursue [direction] at [lab] because [their specific research gap you can contribute to]."

The One Rule¶

Write as you build. memd the firmware is 50% of the project. memd the writeup is the other 50%. Nobody will read your code. Everyone will read your blog post. Sagnik got Karpathy's attention because of the writeup, not the repo. Your constraint — 512KB SRAM, local-only, no cloud — is your "2KB chess engine." Make it the headline of everything you publish.