Objective and use case
What you’ll build: A VAD-gated recorder for an NVIDIA Jetson Xavier NX with a Seeed ReSpeaker USB Mic Array v2.0 that listens continuously, opens the gate only when speech rises above ambient noise, and writes timestamped WAV clips with 500 ms pre-roll and 800 ms post-roll. Runs in real time at 16 kHz with ~50-80 ms gate-on latency.
Why it matters / Use cases
- Meeting capture without dead air: Record only utterances during standups or lab huddles; a 1 h meeting with ~30% speech shrinks storage ~70% (112 MB -> ~34 MB at 16 kHz), making review and search faster.
- Edge keyword-adjacent pipelines: Gate audio before cloud NLP so you upload only speech segments; 5-10x bandwidth reduction and improved privacy by discarding non-speech on device.
- Field audio logging: In workshops with ~70-80 dBA machine hum, adaptive thresholds and ReSpeaker beamforming capture operator notes («Start Lot 42») while holding false triggers to <1/min.
- Smart door intercoms: Keep 2-10 s clips of visitor speech while ignoring traffic; downstream ASR/diarization runs fewer minutes of audio, cutting cost/time ~80%.
- Wildlife/environmental studies: Trigger on calls above ambient (e.g., frogs/birds at +6 dB SNR), producing labeled clips for later classification without hours of silence.
Expected outcome
- A daemonized recorder ingesting mono PCM from the ReSpeaker via ALSA, resampled to 16 kHz; VAD via WebRTC (CPU) or Silero ONNX/TensorRT (GPU).
- Latency and buffering: gate-on ~50-80 ms; 500 ms pre-roll and 800 ms post-roll; min clip 0.3 s; configurable max clip (e.g., 60 s); timestamped filenames.
- Performance on Xavier NX: CPU WebRTC VAD uses ~6-12% of one Carmel core (~2-3% total CPU), GPU 0%; Silero FP16 via TensorRT uses ~5-8% GPU and ~5-7% CPU; real-time factor 0.05-0.1.
- Storage efficiency: With 25-30% speech density, 1 h raw audio (~112 MB at 16 kHz, 16-bit mono) drops to ~28-34 MB (70-75% reduction).
- Noise robustness: Optional ReSpeaker beamforming/NS/AGC and DOA focus reduce false positives in noisy spaces; VAD threshold and hangover tuned for 65-80 dBA environments.
Audience: Edge audio/AI developers, embedded/robotics integrators, applied ML engineers; Level: Intermediate-Advanced
Architecture/flow: USB mic -> ALSA capture -> optional ReSpeaker beamforming/NS/AGC -> 16 kHz resample -> VAD (WebRTC or TensorRT Silero) -> ring buffer (pre-roll) -> gate state machine (hangover) -> segmenter -> timestamped WAV writer; metrics log RTF, gate latency, CPU/GPU%.
Prerequisites
- Ubuntu (from JetPack/L4T) on an NVIDIA Jetson Xavier NX Developer Kit.
- Internet access for package/model downloads.
- Basic terminal skills; ability to edit/run Python scripts.
- A quiet moment for calibration (2–5 seconds of background noise only).
Materials
- Computing platform:
- EXACT model: NVIDIA Jetson Xavier NX Developer Kit
- Microphone:
- EXACT model: Seeed ReSpeaker USB Mic Array v2.0 (XMOS XVF-3000)
- Cables and power:
- USB-A to USB-C cable (for ReSpeaker)
- Jetson power supply (per NVIDIA spec)
- Optional (for validation):
- USB camera or CSI camera (not required for this project; used only for optional device sanity checks)
Setup/Connection
1) Physical connection (text-only description)
- Place the Xavier NX Developer Kit where ambient noise is representative.
- Plug the Seeed ReSpeaker USB Mic Array v2.0 into a USB 3.0/2.0 port on the Jetson using a USB-A to USB-C cable.
- On the mic array, ensure mic LEDs indicate normal power; avoid placing it too close to fans or vents to reduce wind noise.
2) Verify JetPack, model, and NVIDIA components
Run:
cat /etc/nv_tegra_release
jetson_release -v || true
uname -a
dpkg -l | grep -E 'nvidia|tensorrt'
Example expected snippet (yours may differ): L4T R35.x (JetPack 5.x), kernel version shows aarch64, packages include tensorrt and nvidia-l4t-*.
3) Power mode and clocks (performance and thermal note)
For most accurate timing, switch to MAXN during tests. Beware of thermals; revert after validation.
sudo nvpmodel -q
sudo nvpmodel -m 0 # MAXN for Xavier NX
sudo jetson_clocks # Maximize clocks (temporary)
To revert later:
sudo nvpmodel -m 2 # Typical balanced mode (varies by image)
sudo systemctl restart nvfancontrol || true
4) Detect the ReSpeaker microphone (ALSA)
List capture devices:
arecord -l
arecord -L | grep -i -E 'seeed|respeaker|xmos|mic'
Note the “card” and “device” identifiers. The ReSpeaker often appears as a multi-channel UAC2 device. We’ll use it in mono.
Quick 5-second capture test (replace hw:CARD=ReSpeaker,DEV=0 with your card/device):
arecord -D "hw:CARD=ReSpeaker,DEV=0" -f S16_LE -r 16000 -c 1 -d 5 /tmp/test_mic.wav
aplay /tmp/test_mic.wav
If 16 kHz is unsupported, record at 48 kHz and resample later in software (we include a fallback in the Python script).
5) Install dependencies (Python, audio, PyTorch GPU)
Update and install system audio libs:
sudo apt-get update
sudo apt-get install -y python3-pip python3-dev libportaudio2 libsndfile1 sox alsa-utils
python3 -m pip install --upgrade pip
Install Jetson-compatible PyTorch/TorchAudio (JetPack 5.x example; adjust if your JetPack differs):
# NVIDIA-provided wheels index for JetPack 5.x:
export PIP_EXTRA_INDEX_URL="https://developer.download.nvidia.com/compute/redist/jp/v51"
python3 -m pip install "torch==2.1.0+nv23.10" "torchaudio==2.1.0+nv23.10" "torchvision==0.16.0+nv23.10"
Install Python packages for I/O and VAD utilities:
python3 -m pip install sounddevice soundfile numpy scipy
Confirm CUDA is available in PyTorch:
python3 - << 'PY'
import torch
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
print("CUDA device count:", torch.cuda.device_count())
if torch.cuda.is_available():
print("Using:", torch.cuda.get_device_name(0))
PY
6) Optional camera sanity check (not required for this project)
CSI camera short test:
gst-launch-1.0 nvarguscamerasrc num-buffers=60 ! nvvidconv ! video/x-raw,format=I420 ! fakesink
USB camera short test (replace /dev/video0 if needed):
gst-launch-1.0 v4l2src device=/dev/video0 num-buffers=120 ! videoconvert ! video/x-raw ! fakesink
These are just to confirm the multimedia stack; the project itself only uses audio.
Full Code
We provide two Python programs:
- gpu_sanity.py: verifies GPU execution and times Silero VAD inference on random chunks.
- vad_recorder.py: the actual VAD noise-gated recorder with calibration, hysteresis, noise floor tracking, pre/post-roll, and file writing.
gpu_sanity.py (PyTorch GPU timing for Silero VAD)
# gpu_sanity.py
import time
import torch
def main():
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
# Load Silero VAD from Torch Hub
# Returns (model, utils)
model, _ = torch.hub.load('snakers4/silero-vad', 'silero_vad', trust_repo=True)
model.to(device)
model.eval()
# 30 ms at 16 kHz = 480 samples
sr = 16000
frame_len = int(0.03 * sr)
# Warm-up runs
warmup = 50
with torch.inference_mode():
for _ in range(warmup):
x = torch.randn(frame_len, dtype=torch.float32, device=device)
_ = model(x, sr)
iters = 500
torch.cuda.synchronize() if device.type == 'cuda' else None
t0 = time.perf_counter()
with torch.inference_mode():
for _ in range(iters):
x = torch.randn(frame_len, dtype=torch.float32, device=device)
_ = model(x, sr)
torch.cuda.synchronize() if device.type == 'cuda' else None
t1 = time.perf_counter()
avg_ms = (t1 - t0) * 1000.0 / iters
rtf = avg_ms / 30.0 # relative to 30 ms frame
print(f"Avg inference per 30 ms frame: {avg_ms:.3f} ms (RTF={rtf:.3f})")
print("Goal: keep RTF << 1.0 for real-time streaming.")
if __name__ == "__main__":
main()
vad_recorder.py (VAD noise-gated recorder)
# vad_recorder.py
import os
import time
import math
import queue
import argparse
import threading
import datetime as dt
from collections import deque
import numpy as np
import sounddevice as sd
import soundfile as sf
import torch
import torchaudio
def dbfs(x):
eps = 1e-12
rms = np.sqrt(np.mean(x.astype(np.float64)**2) + eps)
return 20 * math.log10(rms + eps)
def rms(x):
return np.sqrt(np.mean(x.astype(np.float64)**2))
def timestamp():
return dt.datetime.now().strftime("%Y%m%d_%H%M%S_%f")
def ensure_dir(p):
os.makedirs(p, exist_ok=True)
return p
class NoiseGateVADRecorder:
def __init__(self,
device_name=None,
samplerate=16000,
channels=1,
frame_ms=30,
pre_roll_ms=300,
post_roll_ms=500,
vad_open=0.60,
vad_close=0.30,
noise_multiplier=2.5,
min_segment_ms=400,
out_dir="recordings",
resample_from=None):
self.device_name = device_name
self.samplerate = samplerate
self.channels = channels
self.frame_len = int(frame_ms * samplerate / 1000)
self.pre_roll_frames = int(pre_roll_ms * samplerate / 1000)
self.post_roll_frames = int(post_roll_ms * samplerate / 1000)
self.min_segment_frames = int(min_segment_ms * samplerate / 1000)
self.vad_open = vad_open
self.vad_close = vad_close
self.noise_multiplier = noise_multiplier
self.out_dir = ensure_dir(out_dir)
self.resample_from = resample_from # e.g., 48000 if device doesn't do 16k
# PyTorch VAD model (Silero)
self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
self.model, _ = torch.hub.load('snakers4/silero-vad', 'silero_vad', trust_repo=True)
self.model.to(self.device)
self.model.eval()
# Optional resampler (GPU accelerated if CUDA)
if self.resample_from and self.resample_from != self.samplerate:
self.resampler = torchaudio.transforms.Resample(orig_freq=self.resample_from,
new_freq=self.samplerate).to(self.device)
else:
self.resampler = None
# Streaming state
self.q = queue.Queue(maxsize=100)
self.stop_flag = threading.Event()
self.triggered = False
self.post_countdown = 0
self.noise_ewma = 1e-6 # initialize with tiny value
self.noise_alpha = 0.95 # slow EWMA for noise floor
self.pre_roll = deque(maxlen=self.pre_roll_frames)
self.active_buffer = []
self.total_frames_in_segment = 0
# Telemetry
self.chunks_processed = 0
self.avg_ms = 0.0
def _audio_callback(self, indata, frames, time_info, status):
if status:
print(f"[ALSA] {status}")
# Keep only first channel
data = indata[:, 0].copy()
self.q.put(data)
def _open_stream(self):
# Attempt desired samplerate; fall back if needed
if self.resample_from:
sr_device = self.resample_from
else:
sr_device = self.samplerate
stream = sd.InputStream(
device=self.device_name,
channels=self.channels,
samplerate=sr_device,
dtype='float32',
blocksize=self.frame_len if self.resample_from is None else int(self.frame_len * sr_device / self.samplerate),
callback=self._audio_callback)
return stream, sr_device
def calibrate_noise(self, seconds=2.0):
print("Calibration: please stay quiet...")
samples_needed = int(seconds * (self.resample_from or self.samplerate))
buf = []
with sd.InputStream(device=self.device_name,
channels=self.channels,
samplerate=(self.resample_from or self.samplerate),
dtype='float32',
blocksize=1024) as s:
while len(buf) < samples_needed:
data, _ = s.read(1024)
buf.append(data[:,0])
arr = np.concatenate(buf, axis=0)
# Resample if needed
if self.resample_from and self.resample_from != self.samplerate:
with torch.inference_mode():
x = torch.from_numpy(arr).to(self.device).float()
x = x.unsqueeze(0)
x = self.resampler(x).squeeze(0)
arr = x.cpu().numpy()
base_rms = rms(arr)
self.noise_ewma = base_rms
print(f"Calibration done: noise floor RMS={base_rms:.6f} ({dbfs(arr):.1f} dBFS)")
def should_open(self, speech_prob, frame):
# Noise gate check
current_rms = rms(frame)
self.noise_ewma = self.noise_alpha * self.noise_ewma + (1 - self.noise_alpha) * current_rms
noise_gate = current_rms > (self.noise_multiplier * self.noise_ewma)
return (speech_prob >= self.vad_open) or noise_gate
def should_close(self, speech_prob):
return speech_prob < self.vad_close
def run(self):
print(f"Using device: {self.device}, model on: {self.device.type}")
stream, sr_device = self._open_stream()
print(f"Audio device SR={sr_device} Hz; target SR={self.samplerate} Hz")
with stream, torch.inference_mode():
segment_writer = None
file_path = None
last_log = time.time()
while not self.stop_flag.is_set():
try:
raw = self.q.get(timeout=0.5)
except queue.Empty:
continue
# Resample if needed
if self.resample_from and self.resample_from != self.samplerate:
x = torch.from_numpy(raw).to(self.device).float()
x = x.unsqueeze(0)
x = self.resampler(x).squeeze(0).contiguous()
chunk = x.cpu().numpy()
else:
chunk = raw
# Ensure frame length equals configured frame_len (pad/trim if needed)
if len(chunk) < self.frame_len:
chunk = np.pad(chunk, (0, self.frame_len - len(chunk)))
elif len(chunk) > self.frame_len:
chunk = chunk[:self.frame_len]
t0 = time.perf_counter()
# GPU VAD inference
x_t = torch.from_numpy(chunk).to(self.device).float()
prob = float(self.model(x_t, self.samplerate).item())
t1 = time.perf_counter()
# Telemetry update
self.chunks_processed += 1
self.avg_ms += (t1 - t0) * 1000.0
if not self.triggered:
# Accumulate pre-roll
self.pre_roll.extend(chunk)
if self.should_open(prob, chunk):
self.triggered = True
self.post_countdown = self.post_roll_frames
self.total_frames_in_segment = 0
# Open file
file_path = os.path.join(self.out_dir, f"{timestamp()}_segment.wav")
segment_writer = sf.SoundFile(file_path, mode='w', samplerate=self.samplerate, channels=1, subtype='PCM_16')
# Write pre-roll
if len(self.pre_roll) > 0:
segment_writer.write(np.array(self.pre_roll, dtype=np.float32))
self.total_frames_in_segment += len(self.pre_roll)
print(f"[OPEN] {file_path} (prob={prob:.2f}, noise_rms={self.noise_ewma:.6f})")
else:
# Write current chunk
segment_writer.write(chunk.astype(np.float32))
self.total_frames_in_segment += len(chunk)
# Check if we should close now or extend post-roll
if self.should_close(prob):
self.post_countdown -= len(chunk)
if self.post_countdown <= 0:
if self.total_frames_in_segment >= self.min_segment_frames:
print(f"[CLOSE] {file_path} len={self.total_frames_in_segment/self.samplerate:.2f}s")
else:
print(f"[DROP] too short: {self.total_frames_in_segment/self.samplerate:.2f}s -> {file_path}")
segment_writer.close()
os.remove(file_path)
file_path = None
segment_writer = None
# Reset state
if segment_writer:
segment_writer.close()
self.triggered = False
self.post_countdown = 0
self.active_buffer = []
self.total_frames_in_segment = 0
self.pre_roll.clear()
else:
# Speech continues; reset post-roll
self.post_countdown = self.post_roll_frames
# Periodic log
if time.time() - last_log >= 1.0:
avg_ms = self.avg_ms / max(1, self.chunks_processed)
rtf = avg_ms / (1000.0 * self.frame_len / self.samplerate)
print(f"[STAT] chunks={self.chunks_processed}, avg_ms={avg_ms:.2f}, RTF={rtf:.3f}, prob={prob:.2f}, noise_rms={self.noise_ewma:.6f}, state={'ON' if self.triggered else 'OFF'}")
last_log = time.time()
def stop(self):
self.stop_flag.set()
def list_devices():
print(sd.query_devices())
def parse_args():
ap = argparse.ArgumentParser(description="VAD noise-gated recorder (Jetson Xavier NX + ReSpeaker)")
ap.add_argument("--device-name", type=str, default=None, help="ALSA/PortAudio device name or index")
ap.add_argument("--samplerate", type=int, default=16000, help="Target sample rate (Hz)")
ap.add_argument("--channels", type=int, default=1, help="Channels (mono recommended)")
ap.add_argument("--frame-ms", type=int, default=30, help="Frame duration in ms")
ap.add_argument("--pre-roll-ms", type=int, default=300, help="Audio to keep before trigger (ms)")
ap.add_argument("--post-roll-ms", type=int, default=500, help="Audio to keep after speech stops (ms)")
ap.add_argument("--vad-open", type=float, default=0.60, help="Open threshold (speech prob)")
ap.add_argument("--vad-close", type=float, default=0.30, help="Close threshold (speech prob)")
ap.add_argument("--noise-mult", type=float, default=2.5, help="Noise gate multiplier over EWMA RMS")
ap.add_argument("--min-segment-ms", type=int, default=400, help="Drop segments shorter than this")
ap.add_argument("--out", type=str, default="recordings", help="Output directory")
ap.add_argument("--resample-from", type=int, default=None, help="If device cannot do 16k, set the device SR (e.g., 48000) to resample on GPU")
ap.add_argument("--list-devices", action="store_true", help="List audio devices and exit")
ap.add_argument("--calib-sec", type=float, default=2.0, help="Calibration seconds (silence)")
return ap.parse_args()
if __name__ == "__main__":
args = parse_args()
if args.list_devices:
list_devices()
raise SystemExit(0)
rec = NoiseGateVADRecorder(
device_name=args.device_name,
samplerate=args.samplerate,
channels=args.channels,
frame_ms=args.frame_ms,
pre_roll_ms=args.pre_roll_ms,
post_roll_ms=args.post_roll_ms,
vad_open=args.vad_open,
vad_close=args.vad_close,
noise_multiplier=args.noise_mult,
min_segment_ms=args.min_segment_ms,
out_dir=args.out,
resample_from=args.resample_from
)
rec.calibrate_noise(seconds=args.calib_sec)
try:
rec.run()
except KeyboardInterrupt:
print("Stopping...")
finally:
rec.stop()
Build/Flash/Run commands
No firmware flashing is required. Use these commands to prepare and run the project.
1) Clone or create a working folder and place scripts:
mkdir -p ~/jetson_vad
cd ~/jetson_vad
nano gpu_sanity.py # paste the content
nano vad_recorder.py # paste the content
2) Verify GPU path (PyTorch GPU inference):
python3 gpu_sanity.py
# Example output:
# Device: cuda:0
# Avg inference per 30 ms frame: 0.45 ms (RTF=0.015)
3) Identify the ReSpeaker device name/index:
python3 - << 'PY'
import sounddevice as sd
for i, dev in enumerate(sd.query_devices()):
print(i, dev['name'], 'max input ch:', dev['max_input_channels'])
PY
4) Start tegrastats in a separate terminal for live metrics:
sudo tegrastats --interval 1000 | tee /tmp/tegrastats_vad.log
5) Run the VAD recorder. Replace –device-name with the name or index you saw (e.g., «ReSpeaker USB Mic Array (UAC2.0)» or 2). If the device does not support 16 kHz, specify –resample-from 48000.
cd ~/jetson_vad
python3 vad_recorder.py \
--device-name "ReSpeaker USB Mic Array (UAC2.0)" \
--samplerate 16000 \
--channels 1 \
--frame-ms 30 \
--pre-roll-ms 300 \
--post-roll-ms 500 \
--vad-open 0.60 \
--vad-close 0.30 \
--noise-mult 2.5 \
--min-segment-ms 400 \
--out recordings \
--calib-sec 3.0
Example with GPU resampling from 48 kHz:
python3 vad_recorder.py --device-name "ReSpeaker USB Mic Array (UAC2.0)" --resample-from 48000
6) Stop tegrastats when done (Ctrl+C). Revert power settings if changed:
sudo nvpmodel -m 2
Step‑by‑step Validation
Follow this sequence to verify correctness and measure performance.
1) Device recognition and rates
– Confirm ReSpeaker enumerates via arecord -l and that mono capture is supported.
– If 16 kHz mono is not supported directly, plan to use –resample-from 48000.
2) PyTorch GPU availability
– Run gpu_sanity.py.
– Success criteria: CUDA available = True, device shows NVIDIA GPU name, average inference per 30 ms frame well below 30 ms (RTF < 0.1 recommended on Xavier NX).
3) Calibration
– Start vad_recorder.py; you’ll see “Calibration: please stay quiet…”.
– Keep the room quiet for the specified seconds.
– Success criteria: The printed noise floor RMS is stable (e.g., ~0.001–0.01 for typical rooms) and dBFS is near −50 to −35 dBFS (depends on gain).
4) Live stats during run
– Observe periodic [STAT] lines:
– avg_ms: average GPU inference time per chunk (ms).
– RTF: avg_ms / frame_ms. Must be <1.0; typical ~0.01–0.2.
– prob: last speech probability (0–1).
– noise_rms: EWMA noise floor estimate.
– state: ON when gate is open (recording), OFF when closed.
Example expected log snippet:
- [STAT] chunks=120, avg_ms=0.55, RTF=0.018, prob=0.12, noise_rms=0.003210, state=OFF
- [OPEN] recordings/20241108_141512_123456_segment.wav (prob=0.77, noise_rms=0.003250)
- [STAT] chunks=240, avg_ms=0.58, RTF=0.019, prob=0.82, noise_rms=0.003280, state=ON
- [CLOSE] recordings/20241108_141514_654321_segment.wav len=1.12s
5) Tegrastats metrics
– While running, sample outputs in /tmp/tegrastats_vad.log:
– Look for “GR3D_FREQ” (GPU), “RAM x/yMB”, “EMC_FREQ”.
– Success criteria:
– GPU <25% usage on average for this workload (varies).
– CPU load moderate; no memory pressure.
6) File validation
– Check the output folder:
```bash
ls -lh recordings
soxi recordings/*.wav | sed -n '1,12p'
```
- Success criteria:
- Files exist with sizes >0.
- soxi shows Sample Rate 16000, Channels 1, and durations consistent with your speaking time plus pre/post-roll.
7) Audio quality and gating behavior
– Play a sample:
```bash
for f in recordings/*.wav; do echo ">>> $f"; sox "$f" -n stat 2>&1 | grep -E 'RMS.*amplitude|Maximum amplitude|Length'; done
aplay recordings/$(ls recordings | head -n 1)
```
- Success criteria:
- Speech clearly captured; leading/trailing silence trimmed except for pre/post-roll.
- No perceptible clipping; “Maximum amplitude” in sox stat near but below 1.0.
8) Optional negative test
– Keep silent or play constant fan noise; ensure no segments are created (or very few).
– If false triggers occur, increase –noise-mult or raise –vad-open slightly.
Quantitative metrics to report
- From gpu_sanity.py:
- Average ms per 30 ms frame and RTF.
- From vad_recorder.py:
- avg_ms, RTF, and final number of WAV segments created over a 5-minute run.
- From tegrastats:
- Average GR3D_FREQ utilization and max RAM usage.
Troubleshooting
- ReSpeaker not listed in arecord -l
- Try unplug/replug; different USB port.
- Check dmesg | tail for USB errors.
-
Ensure user is in the audio group: sudo usermod -aG audio $USER; then log out/in.
-
16 kHz not supported by device
-
Use –resample-from 48000; the script performs GPU-accelerated resampling to 16 kHz.
-
No audio or XRUN (overruns/underruns)
- Reduce frame size variability: keep –frame-ms at 30 or 20.
- Close other audio apps (PulseAudio can interfere). Optionally run: pasuspender — python3 vad_recorder.py …
-
Lower tegrastats interval to reduce overhead if needed.
-
Too many false positives (records noise)
- Increase –noise-mult (e.g., 3.0 or 4.0).
- Raise –vad-open (e.g., 0.7) and/or raise –vad-close (e.g., 0.4) for stronger hysteresis.
-
Extend calibration (–calib-sec 5) in your real environment.
-
Missed words (gate opens late)
- Increase –pre-roll-ms to 500–800 ms to capture initial consonants.
-
Lower –vad-open slightly (e.g., 0.55) if speech is soft.
-
Files are too short or too many fragments
- Increase –post-roll-ms to 800–1200 ms to glue brief pauses.
-
Increase –min-segment-ms to suppress very short clips.
-
Jetson overheats or throttles
- Do not leave jetson_clocks enabled for long unattended runs.
- Provide airflow or a heatsink/fan kit.
- Revert to balanced power mode: sudo nvpmodel -m 2.
Improvements
- Multi-channel beamforming: The ReSpeaker has multiple mics; integrate a beamformer (e.g., torchaudio or PyTorch-based MVDR) to enhance SNR before VAD.
- Keyword-conditioned gating: Use a wake-word detector in tandem; only save segments that follow the wake word.
- File management: Rotate older files, compress to FLAC, or upload metadata-only for privacy.
- Metrics logger: Export stats to Prometheus or CSV, plotting RTF vs. ambient noise over time.
- Automatic gain control (AGC) tuning: Use the ReSpeaker’s XMOS DSP controls or software AGC for stable levels.
Reference parameters table
| Parameter | Default | Range/Notes |
|---|---|---|
| Sample rate | 16000 | Use 16 kHz for Silero VAD; resample from 48 kHz if needed |
| Frame size (ms) | 30 | 20–40 ms typical; smaller is lower latency, more overhead |
| Pre-roll (ms) | 300 | 200–800 ms to keep initial phonemes |
| Post-roll (ms) | 500 | 300–1200 ms to bridge short pauses |
| VAD open threshold | 0.60 | 0.5–0.8; higher = stricter |
| VAD close threshold | 0.30 | 0.2–0.5; maintain hysteresis |
| Noise multiplier | 2.5 | 2–5x EWMA RMS; higher = fewer false opens |
| Min segment (ms) | 400 | Drop ultra-short segments |
Expected metrics and success criteria
- GPU-accelerated AI path: PyTorch GPU
- gpu_sanity.py shows CUDA True, RTF < 0.1 for 30 ms frames.
- Recorder run (5 minutes, quiet office + occasional speech)
- Segments created: 10–30
- Average segment length: 1–6 seconds
- VAD avg_ms: ~0.4–1.5 ms per chunk on Xavier NX
- tegrastats: GR3D_FREQ <25% avg; CPU overall <35%
- Disk space: <50 MB (depends on speech amount)
Cleanup and power revert
- Stop scripts (Ctrl+C).
- Stop tegrastats and revert power mode:
sudo nvpmodel -m 2
- Optional: disable jetson_clocks by rebooting or running:
sudo systemctl restart nvfancontrol || true
- Remove recordings if desired:
rm -rf recordings
Final Checklist
- Objective met: VAD noise-gated recorder runs on NVIDIA Jetson Xavier NX Developer Kit + Seeed ReSpeaker USB Mic Array v2.0 (XMOS XVF-3000).
- JetPack verified; power mode set and reverted.
- PyTorch GPU path (C) confirmed with torch.cuda.is_available() and timed inference.
- Microphone detected, calibration completed; recording only when speech occurs.
- Code produced WAV segments with correct sample rate and durations.
- Quantitative metrics collected: avg_ms, RTF, GPU utilization from tegrastats.
- Troubleshooting applied if false triggers or missed words occurred.
- No GUI steps used; all commands and paths provided.
Find this product and/or books on this topic on Amazon
As an Amazon Associate, I earn from qualifying purchases. If you buy through this link, you help keep this project running.




