A Python-based SIP relay server with advanced AI integration for intelligent voice interactions. Supports SIP signaling, RTP media handling, WebSocket-driven real-time control, and AI-powered audio transcription and response generation. Designed for call routing, audio streaming, SIP integration, and AI-assisted call center functionality.
- SIP Signaling - INVITE, ACK, BYE, CANCEL message handling
- RTP Media Streaming - Real-time audio using G.711 (PCMA/PCMU)
- Dynamic RTP Port Allocation - Automatic port management with pair allocation
- Multi-Session Management - Handle multiple concurrent calls
- Call Recording - Automatic timestamped WAV recordings of inbound audio
- Dual Operation Modes - Incoming call handling (server) and outgoing call initiation (client)
- Speech-to-Text - Faster-Whisper for local transcription (multi-language)
- Text-to-Speech - Piper TTS with multi-language voice models
- LLM Integration - Three backend options:
- API Backend (remote HTTP LLM server)
- Local Backend (Qwen3 model on GPU)
- OpenAI Backend (GPT-4o-mini)
- Voice Activity Detection (VAD) - Silero VAD for speech boundary detection
- Language Detection - Automatic language identification (langid)
- WebSocket Interface - Bi-directional control and audio transmission
- Real-time Audio Streaming - Base64-encoded audio over WebSocket
- Environment-based Configuration - Centralized config with .env support
┌─────────────────────────────────────────────────────────────────┐
│ SIP Server v2 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌─────────────────┐ │
│ │ SIP Clients │◄───────►│ RelayServer │ │
│ │ (VoIP) │ UDP │ (SIP Signaling)│ │
│ └──────────────┘ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ SIPRTPSession │ │
│ │ Management │ │
│ └────────┬────────┘ │
│ │ │
│ ┌──────────────────────────┼──────────────────────┐ │
│ │ │ │ │
│ ┌──────▼──────┐ ┌────────────────▼───────┐ ┌──────────▼──┐ │
│ │ RTPHandler │ │ VADHandler │ │ SIPParsers │ │
│ │ (Audio I/O) │ │ (Voice Activity Detect)│ │ (SIP/SDP) │ │
│ └──────┬──────┘ └────────────────────────┘ └─────────────┘ │
│ │ │
│ ┌──────▼──────────────────────────────────┐ │
│ │ WebSocket Server │ │
│ │ (Real-time control & audio feed) │ │
│ └──────┬──────────────────────────────────┘ │
│ │ │
│ ┌──────▼──────────────────────────────────┐ │
│ │ Call Center (AI Mode) │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ STT │ │ LLM │ │ TTS │ │ │
│ │ │(Whisper)│ │(Backend)│ │ (Piper) │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └─────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
| Component | File | Description |
|---|---|---|
| RelayServer | receive_server.py |
Main SIP message handler and session orchestrator |
| SIPRTPSession | helper/sip_session.py |
RTP session lifecycle, port allocation, audio handling |
| RTPHandler | helper/rtp_handler.py |
Bidirectional RTP packet transmission and reception |
| VADHandler | helper/rtp_handler.py |
Silero-based voice activity detection |
| SIPMessageParser | helper/sip_parsers.py |
SIP/SDP message parsing and validation |
| WebSocketServer | helper/ws_helper.py |
WebSocket communication for real-time control |
| CallCenter | call_center.py |
AI pipeline: STT → LLM → TTS |
| LLM Backends | helper/llm_backends/ |
Pluggable LLM backends (API, Local, OpenAI) |
| Config | config.py |
Centralized configuration management |
SIP_server_v2/
├── main.py # Main entry point
├── receive_server.py # SIP relay server
├── call_center.py # AI call center implementation
├── config.py # Configuration management
│
├── helper/
│ ├── rtp_handler.py # RTP packet handling + VAD
│ ├── sip_session.py # Session management
│ ├── sip_parsers.py # SIP message parsing
│ ├── ws_helper.py # WebSocket server
│ ├── ws_command.py # WebSocket command helpers
│ ├── wav_handler.py # WAV file operations
│ ├── custom_sts_handler.py # Faster-Whisper STT + Piper TTS
│ ├── openai_sts_handler.py # OpenAI STT/TTS (legacy)
│ ├── PROMPT.py # System prompt for AI
│ └── llm_backends/
│ ├── llm_backend.py # Abstract base class
│ ├── api.py # Remote API backend
│ ├── local.py # Local Qwen3 backend
│ ├── openai.py # OpenAI API backend
│ └── models.py # Type definitions
│
├── model/
│ ├── sip_message.py # SIP/SDP message models
│ ├── rtp.py # RTP packet models
│ ├── ws_command.py # WebSocket command models
│ └── call_status.py # Call state enums
│
├── voices/ # Piper TTS voice models
│ ├── en/ # English voices
│ ├── zh/ # Chinese voices
│ └── .../ # Other languages
│
├── output/
│ ├── transcode/ # Greeting audio (greeting.wav)
│ ├── converted/ # Converted audio files
│ └── response/ # AI response audio
│
└── recording/ # Call recordings
- Python 3.12+
- CUDA 12.x (for GPU acceleration of ML models)
- FFmpeg (for audio conversion)
- Sufficient disk space for voice models (~500MB+)
Core dependencies (managed via pyproject.toml):
accelerate>=1.12.0 # GPU acceleration
bitsandbytes>=0.48.2 # Quantization
faster-whisper>=1.2.1 # Speech-to-text
huggingface-hub>=0.36.0 # Model downloading
jieba>=0.42.1 # Chinese tokenization
langid>=1.1.6 # Language detection
openai>=2.8.1 # OpenAI API client
piper-tts>=1.3.0 # Text-to-speech
pydantic>=2.12.4 # Data validation
pydub>=0.25.1 # Audio processing
python-dotenv>=1.2.1 # .env support
silero-vad>=6.2.0 # Voice activity detection
transformers>=4.51.3 # HuggingFace models
websockets>=15.0.1 # WebSocket protocol
git clone <repository-url>
cd SIP_server_v2
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .Create a .env file in the project root:
OPENAI_API_KEY=your_openai_api_key_here
# SIP Configuration
SIP_LOCAL_IP=192.168.1.101 # Your server's IP address
SIP_LOCAL_PORT=5062 # SIP listening port
SIP_TRANSFER_PORT=5060 # SIP transfer/relay port
SIP_SERVER_IP=192.168.1.170 # Remote SIP server IP
# WebSocket Configuration
WS_HOST=192.168.1.101
WS_PORT=8080
WS_URL=ws://192.168.1.101:8080
# RTP Configuration
RTP_PORT_START=31000 # Start of RTP port range
RTP_PORT_END=31010 # End of RTP port range
# Logging
LOG_LEVEL=INFO
SIP_LOG_FILE=sip_server.log
CALL_CENTER_LOG_FILE=call_center.log
# File Management
RECORDING_DIR=./recording
OUTPUT_DIR=./output
MAX_RECORDING_AGE_DAYS=7
# Performance Tuning
CALL_CENTER_BUFFER_SIZE=120 # Audio packets per utterance
WS_SEND_QUEUE_MAX=1000
WS_RECV_QUEUE_MAX=1000
RTP_SEND_QUEUE_MAX=500
RTP_RECV_QUEUE_MAX=500Start the main server (SIP + WebSocket):
uv run receive_server.pyThis initializes:
- SIP listener on
SIP_LOCAL_IP:SIP_LOCAL_PORT - WebSocket server on
WS_HOST:WS_PORT - Configuration validation
- Logging setup
In a separate terminal, start the AI call processing:
uv run call_center.pyThe call center:
- Connects to the WebSocket server
- Receives RTP audio packets
- Buffers audio with VAD-based speech detection
- Transcribes speech using Faster-Whisper
- Generates responses using the configured LLM backend
- Converts responses to speech using Piper TTS
- Sends audio back through the call
SIP Client Server
│ │
│──── INVITE + SDP ────────►│
│ │ Parse SDP, allocate RTP ports
│◄─── 200 OK + SDP ─────────│
│ │
│──── ACK ─────────────────►│
│ │ Start RTP, play greeting.wav
│◄═══ RTP Audio ═══════════►│ Bidirectional audio
│ │ Record inbound audio
│──── BYE ─────────────────►│
│ │ Save recording, cleanup
│◄─── 200 OK ───────────────│
│ │
WebSocket Client Server SIP Endpoint
│ │ │
│── CALL:{phone} ───────►│ │
│ │──── INVITE + SDP ────────►│
│ │◄─── 180 Ringing ──────────│
│◄─ RING_ANS:{phone} ────│ │
│ │◄─── 200 OK + SDP ─────────│
│◄─ CALL_ANS:{call_id} ──│ │
│ │──── ACK ─────────────────►│
│ │ │
│═══ RTP:{audio} ═══════►│◄═══ RTP Audio ═══════════►│
│ │ │
│── BYE:{call_id} ──────►│──── BYE ─────────────────►│
│ │◄─── 200 OK ───────────────│
| Command | Format | Description |
|---|---|---|
| CALL | CALL:{phone_number} |
Initiate outgoing call |
| RTP | RTP:{call_id}##{base64_audio} |
Send audio data |
| BYE | BYE:{call_id} |
Terminate call |
| CALL_ANS | CALL_ANS:{call_id} |
Answer incoming call |
| CALL_IGNORE | CALL_IGNORE:{call_id} |
Ignore incoming call |
| HANGUP | HANGUP:{call_id} |
Hang up call |
| Event | Format | Description |
|---|---|---|
| RING_ANS | RING_ANS:{phone}##{call_id} |
Incoming call notification |
| CALL_ANS | CALL_ANS:{call_id} |
Call answered |
| CALL_IGNORE | CALL_IGNORE:{call_id} |
Call ignored |
| CALL_FAILED | CALL_FAILED:{status} {reason} |
Call failed |
| BYE | BYE:{call_id} |
Call terminated |
| RTP | RTP:{call_id}##{base64_audio} |
Incoming audio data |
| Codec | Payload Type | Description |
|---|---|---|
| PCMU | 0 | G.711 μ-law |
| PCMA | 8 | G.711 A-law |
- Sample Rate: 8000 Hz
- Channels: Mono
- Sample Width: 16-bit PCM
- Frame Duration: 20ms
- Samples per Frame: 160
- Bytes per RTP Packet: 160 bytes (encoded)
Place greeting audio in:
./output/transcode/greeting.wav
This audio plays automatically when answering incoming calls.
The system supports three LLM backends for AI response generation:
Connects to a remote LLM server via HTTP POST.
from helper.llm_backends.api import APIBackend
backend = APIBackend(
base_url="http://localhost:8000",
api_key="your-api-key" # optional
)Runs Qwen3-1.7B model locally on GPU.
from helper.llm_backends.local import LocalBackend
backend = LocalBackend()
# Automatically loads model on first useUses OpenAI's GPT-4o-mini API.
from helper.llm_backends.openai import OpenAIBackend
backend = OpenAIBackend(
api_key="your-openai-api-key",
model="gpt-4o-mini"
)The system uses Silero VAD for detecting speech boundaries:
- Purpose: Determine when a speaker starts/stops talking
- Threshold: Configurable sensitivity (default: 0.5)
- Frame Size: 512 samples at 16kHz
- Integration: Built into RTPHandler for real-time processing
VAD enables:
- Efficient audio buffering (only process complete utterances)
- Natural conversation flow (wait for speaker to finish)
- Reduced processing overhead (skip silence)
- Default Range: 31000-31010
- Allocation: Ports allocated in pairs (RTP + RTCP)
- Spacing: 4-port spacing between sessions
- Cleanup: Automatic release on session termination
| Property | Value |
|---|---|
| Transport | UDP |
| Payload | G.711 (PCMA/PCMU) |
| Payload Size | 160 bytes |
| Sequence | 16-bit with wraparound |
| Timestamp | 32-bit, +160 per packet |
- SIP Server:
sip_server.log - Call Center:
call_center.log
[LEVEL] - TIMESTAMP - MESSAGE - FILE:LINE
DEBUG- Detailed debugging informationINFO- General operational messagesWARNING- Warning conditionsERROR- Error conditions
Configure via LOG_LEVEL environment variable.
| Service | Default Port | Protocol | Direction |
|---|---|---|---|
| SIP Receive | 5062 | UDP | Inbound |
| SIP Transfer | 5060 | UDP | Bidirectional |
| WebSocket | 8080 | TCP | Bidirectional |
| RTP Audio | 31000-31010 | UDP | Bidirectional |
socket.error: [Errno 98] Address already in use
Solution: Change port in .env or kill the process using the port:
lsof -i :5062
kill <PID>- Check UDP firewall rules for RTP ports
- Verify
greeting.wavexists inoutput/transcode/ - Confirm RTP ports are correctly allocated (check logs)
- Ensure NAT traversal is configured if behind NAT
- Verify
SIP_LOCAL_IPmatches your network interface - Check firewall allows UDP on SIP port
- Review SIP server routing rules
- Enable DEBUG logging for detailed traces
- Confirm
SIP_SERVER_IPis correct - Check codec compatibility (PCMA/PCMU)
- Review SIP response codes in logs
- Verify SIP credentials if required
Missing API Key:
ValueError: OPENAI_API_KEY is required
Ensure .env contains valid OPENAI_API_KEY.
Rate Limiting:
- Check OpenAI account quota
- Implement retry logic or reduce request frequency
- Check microphone/audio input quality
- Adjust VAD sensitivity threshold
- Verify audio is 16kHz sample rate for VAD
- Review VAD logs for detection events
- Consider using Local backend for lower latency
- Check network connectivity to API endpoints
- Monitor GPU utilization for local models
- Reduce
CALL_CENTER_BUFFER_SIZEfor faster processing
- Verify
RECORDING_DIRexists and is writable - Check disk space availability
- Ensure proper cleanup of old recordings
- Full type hint coverage
- Pydantic models for data validation
- Match/case for message routing
- Structured logging throughout
- Create a new file in
helper/llm_backends/ - Inherit from
LLMBackendbase class - Implement the
generate()method - Register in call_center.py
from helper.llm_backends.llm_backend import LLMBackend
class MyBackend(LLMBackend):
def generate(self, messages: list) -> str:
# Your implementation
passsession = SIPRTPSession(
call_id="unique-call-id",
remote_ip="192.168.1.100",
remote_port=5060
)
# Start audio
session.start_rtp()
# Play audio file
session.play_audio(Path("greeting.wav"))
# Stop and cleanup
session.stop()handler = RTPHandler(
local_port=31000,
remote_ip="192.168.1.100",
remote_port=31002
)
# Send audio
handler.send_audio(audio_bytes)
# Receive with callback
handler.set_receive_callback(on_audio_received)
handler.start_receiving()from helper.ws_command import WSCommandHelper
# Parse incoming command
cmd_type, payload = WSCommandHelper.parse("CALL:1234567890")
# Build outgoing command
message = WSCommandHelper.build("RING_ANS", "1234567890##call-123")Code by DHT@Matthew
Version: 0.2.0