SIP Relay Server v2

A Python-based SIP relay server with advanced AI integration for intelligent voice interactions. Supports SIP signaling, RTP media handling, WebSocket-driven real-time control, and AI-powered audio transcription and response generation. Designed for call routing, audio streaming, SIP integration, and AI-assisted call center functionality.

Features

SIP & Telephony

SIP Signaling - INVITE, ACK, BYE, CANCEL message handling
RTP Media Streaming - Real-time audio using G.711 (PCMA/PCMU)
Dynamic RTP Port Allocation - Automatic port management with pair allocation
Multi-Session Management - Handle multiple concurrent calls
Call Recording - Automatic timestamped WAV recordings of inbound audio
Dual Operation Modes - Incoming call handling (server) and outgoing call initiation (client)

AI & Speech Processing

Speech-to-Text - Faster-Whisper for local transcription (multi-language)
Text-to-Speech - Piper TTS with multi-language voice models
LLM Integration - Three backend options:
- API Backend (remote HTTP LLM server)
- Local Backend (Qwen3 model on GPU)
- OpenAI Backend (GPT-4o-mini)
Voice Activity Detection (VAD) - Silero VAD for speech boundary detection
Language Detection - Automatic language identification (langid)

Integration & Control

WebSocket Interface - Bi-directional control and audio transmission
Real-time Audio Streaming - Base64-encoded audio over WebSocket
Environment-based Configuration - Centralized config with .env support

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         SIP Server v2                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────┐         ┌─────────────────┐                   │
│  │  SIP Clients │◄───────►│  RelayServer    │                   │
│  │  (VoIP)      │  UDP    │  (SIP Signaling)│                   │
│  └──────────────┘         └────────┬────────┘                   │
│                                    │                            │
│                           ┌────────▼────────┐                   │
│                           │  SIPRTPSession  │                   │
│                           │  Management     │                   │
│                           └────────┬────────┘                   │
│                                    │                            │
│         ┌──────────────────────────┼──────────────────────┐     │
│         │                          │                      │     │
│  ┌──────▼──────┐  ┌────────────────▼───────┐  ┌──────────▼──┐   │
│  │ RTPHandler  │  │    VADHandler          │  │ SIPParsers  │   │
│  │ (Audio I/O) │  │ (Voice Activity Detect)│  │ (SIP/SDP)   │   │
│  └──────┬──────┘  └────────────────────────┘  └─────────────┘   │
│         │                                                       │
│  ┌──────▼──────────────────────────────────┐                    │
│  │     WebSocket Server                    │                    │
│  │     (Real-time control & audio feed)    │                    │
│  └──────┬──────────────────────────────────┘                    │
│         │                                                       │
│  ┌──────▼──────────────────────────────────┐                    │
│  │     Call Center (AI Mode)               │                    │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐    │                    │
│  │  │   STT   │ │   LLM   │ │   TTS   │    │                    │
│  │  │(Whisper)│ │(Backend)│ │ (Piper) │    │                    │
│  │  └─────────┘ └─────────┘ └─────────┘    │                    │
│  └─────────────────────────────────────────┘                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Core Components

Component	File	Description
RelayServer	`receive_server.py`	Main SIP message handler and session orchestrator
SIPRTPSession	`helper/sip_session.py`	RTP session lifecycle, port allocation, audio handling
RTPHandler	`helper/rtp_handler.py`	Bidirectional RTP packet transmission and reception
VADHandler	`helper/rtp_handler.py`	Silero-based voice activity detection
SIPMessageParser	`helper/sip_parsers.py`	SIP/SDP message parsing and validation
WebSocketServer	`helper/ws_helper.py`	WebSocket communication for real-time control
CallCenter	`call_center.py`	AI pipeline: STT → LLM → TTS
LLM Backends	`helper/llm_backends/`	Pluggable LLM backends (API, Local, OpenAI)
Config	`config.py`	Centralized configuration management

Project Structure

SIP_server_v2/
├── main.py                    # Main entry point
├── receive_server.py          # SIP relay server
├── call_center.py             # AI call center implementation
├── config.py                  # Configuration management
│
├── helper/
│   ├── rtp_handler.py         # RTP packet handling + VAD
│   ├── sip_session.py         # Session management
│   ├── sip_parsers.py         # SIP message parsing
│   ├── ws_helper.py           # WebSocket server
│   ├── ws_command.py          # WebSocket command helpers
│   ├── wav_handler.py         # WAV file operations
│   ├── custom_sts_handler.py  # Faster-Whisper STT + Piper TTS
│   ├── openai_sts_handler.py  # OpenAI STT/TTS (legacy)
│   ├── PROMPT.py              # System prompt for AI
│   └── llm_backends/
│       ├── llm_backend.py     # Abstract base class
│       ├── api.py             # Remote API backend
│       ├── local.py           # Local Qwen3 backend
│       ├── openai.py          # OpenAI API backend
│       └── models.py          # Type definitions
│
├── model/
│   ├── sip_message.py         # SIP/SDP message models
│   ├── rtp.py                 # RTP packet models
│   ├── ws_command.py          # WebSocket command models
│   └── call_status.py         # Call state enums
│
├── voices/                    # Piper TTS voice models
│   ├── en/                    # English voices
│   ├── zh/                    # Chinese voices
│   └── .../                   # Other languages
│
├── output/
│   ├── transcode/             # Greeting audio (greeting.wav)
│   ├── converted/             # Converted audio files
│   └── response/              # AI response audio
│
└── recording/                 # Call recordings

Requirements

System Requirements

Python 3.12+
CUDA 12.x (for GPU acceleration of ML models)
FFmpeg (for audio conversion)
Sufficient disk space for voice models (~500MB+)

Dependencies

Core dependencies (managed via pyproject.toml):

accelerate>=1.12.0          # GPU acceleration
bitsandbytes>=0.48.2        # Quantization
faster-whisper>=1.2.1       # Speech-to-text
huggingface-hub>=0.36.0     # Model downloading
jieba>=0.42.1               # Chinese tokenization
langid>=1.1.6               # Language detection
openai>=2.8.1               # OpenAI API client
piper-tts>=1.3.0            # Text-to-speech
pydantic>=2.12.4            # Data validation
pydub>=0.25.1               # Audio processing
python-dotenv>=1.2.1        # .env support
silero-vad>=6.2.0           # Voice activity detection
transformers>=4.51.3        # HuggingFace models
websockets>=15.0.1          # WebSocket protocol

Installation

git clone <repository-url>
cd SIP_server_v2

# Using uv (recommended)
uv sync

# Or using pip
pip install -e .

Environment Configuration

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key_here

# SIP Configuration
SIP_LOCAL_IP=192.168.1.101        # Your server's IP address
SIP_LOCAL_PORT=5062               # SIP listening port
SIP_TRANSFER_PORT=5060            # SIP transfer/relay port
SIP_SERVER_IP=192.168.1.170       # Remote SIP server IP

# WebSocket Configuration
WS_HOST=192.168.1.101
WS_PORT=8080
WS_URL=ws://192.168.1.101:8080

# RTP Configuration
RTP_PORT_START=31000              # Start of RTP port range
RTP_PORT_END=31010                # End of RTP port range

# Logging
LOG_LEVEL=INFO
SIP_LOG_FILE=sip_server.log
CALL_CENTER_LOG_FILE=call_center.log

# File Management
RECORDING_DIR=./recording
OUTPUT_DIR=./output
MAX_RECORDING_AGE_DAYS=7

# Performance Tuning
CALL_CENTER_BUFFER_SIZE=120       # Audio packets per utterance
WS_SEND_QUEUE_MAX=1000
WS_RECV_QUEUE_MAX=1000
RTP_SEND_QUEUE_MAX=500
RTP_RECV_QUEUE_MAX=500

Usage

Running the SIP Server

Start the main server (SIP + WebSocket):

uv run receive_server.py

This initializes:

SIP listener on SIP_LOCAL_IP:SIP_LOCAL_PORT
WebSocket server on WS_HOST:WS_PORT
Configuration validation
Logging setup

Running the Call Center (AI Mode)

In a separate terminal, start the AI call processing:

uv run call_center.py

The call center:

Connects to the WebSocket server
Receives RTP audio packets
Buffers audio with VAD-based speech detection
Transcribes speech using Faster-Whisper
Generates responses using the configured LLM backend
Converts responses to speech using Piper TTS
Sends audio back through the call

Call Flows

Incoming Call

SIP Client                    Server
    │                           │
    │──── INVITE + SDP ────────►│
    │                           │ Parse SDP, allocate RTP ports
    │◄─── 200 OK + SDP ─────────│
    │                           │
    │──── ACK ─────────────────►│
    │                           │ Start RTP, play greeting.wav
    │◄═══ RTP Audio ═══════════►│ Bidirectional audio
    │                           │ Record inbound audio
    │──── BYE ─────────────────►│
    │                           │ Save recording, cleanup
    │◄─── 200 OK ───────────────│
    │                           │

Outgoing Call (via WebSocket)

WebSocket Client              Server                    SIP Endpoint
       │                        │                           │
       │── CALL:{phone} ───────►│                           │
       │                        │──── INVITE + SDP ────────►│
       │                        │◄─── 180 Ringing ──────────│
       │◄─ RING_ANS:{phone} ────│                           │
       │                        │◄─── 200 OK + SDP ─────────│
       │◄─ CALL_ANS:{call_id} ──│                           │
       │                        │──── ACK ─────────────────►│
       │                        │                           │
       │═══ RTP:{audio} ═══════►│◄═══ RTP Audio ═══════════►│
       │                        │                           │
       │── BYE:{call_id} ──────►│──── BYE ─────────────────►│
       │                        │◄─── 200 OK ───────────────│

WebSocket Protocol

Commands (Client → Server)

Command	Format	Description
CALL	`CALL:{phone_number}`	Initiate outgoing call
RTP	`RTP:{call_id}##{base64_audio}`	Send audio data
BYE	`BYE:{call_id}`	Terminate call
CALL_ANS	`CALL_ANS:{call_id}`	Answer incoming call
CALL_IGNORE	`CALL_IGNORE:{call_id}`	Ignore incoming call
HANGUP	`HANGUP:{call_id}`	Hang up call

Events (Server → Client)

Event	Format	Description
RING_ANS	`RING_ANS:{phone}##{call_id}`	Incoming call notification
CALL_ANS	`CALL_ANS:{call_id}`	Call answered
CALL_IGNORE	`CALL_IGNORE:{call_id}`	Call ignored
CALL_FAILED	`CALL_FAILED:{status} {reason}`	Call failed
BYE	`BYE:{call_id}`	Call terminated
RTP	`RTP:{call_id}##{base64_audio}`	Incoming audio data

Audio Specifications

Codec Support

Codec	Payload Type	Description
PCMU	0	G.711 μ-law
PCMA	8	G.711 A-law

Audio Format

Sample Rate: 8000 Hz
Channels: Mono
Sample Width: 16-bit PCM
Frame Duration: 20ms
Samples per Frame: 160
Bytes per RTP Packet: 160 bytes (encoded)

Audio Files

Place greeting audio in:

./output/transcode/greeting.wav

This audio plays automatically when answering incoming calls.

LLM Backends

The system supports three LLM backends for AI response generation:

API Backend

Connects to a remote LLM server via HTTP POST.

from helper.llm_backends.api import APIBackend

backend = APIBackend(
    base_url="http://localhost:8000",
    api_key="your-api-key"  # optional
)

Local Backend

Runs Qwen3-1.7B model locally on GPU.

from helper.llm_backends.local import LocalBackend

backend = LocalBackend()
# Automatically loads model on first use

OpenAI Backend

Uses OpenAI's GPT-4o-mini API.

from helper.llm_backends.openai import OpenAIBackend

backend = OpenAIBackend(
    api_key="your-openai-api-key",
    model="gpt-4o-mini"
)

Voice Activity Detection (VAD)

The system uses Silero VAD for detecting speech boundaries:

Purpose: Determine when a speaker starts/stops talking
Threshold: Configurable sensitivity (default: 0.5)
Frame Size: 512 samples at 16kHz
Integration: Built into RTPHandler for real-time processing

VAD enables:

Efficient audio buffering (only process complete utterances)
Natural conversation flow (wait for speaker to finish)
Reduced processing overhead (skip silence)

RTP Configuration

Port Allocation

Default Range: 31000-31010
Allocation: Ports allocated in pairs (RTP + RTCP)
Spacing: 4-port spacing between sessions
Cleanup: Automatic release on session termination

Packet Properties

Property	Value
Transport	UDP
Payload	G.711 (PCMA/PCMU)
Payload Size	160 bytes
Sequence	16-bit with wraparound
Timestamp	32-bit, +160 per packet

Logging

Log Files

SIP Server: sip_server.log
Call Center: call_center.log

Log Format

[LEVEL] - TIMESTAMP - MESSAGE - FILE:LINE

Log Levels

DEBUG - Detailed debugging information
INFO - General operational messages
WARNING - Warning conditions
ERROR - Error conditions

Configure via LOG_LEVEL environment variable.

Network Ports

Service	Default Port	Protocol	Direction
SIP Receive	5062	UDP	Inbound
SIP Transfer	5060	UDP	Bidirectional
WebSocket	8080	TCP	Bidirectional
RTP Audio	31000-31010	UDP	Bidirectional

Troubleshooting

Port Already in Use

socket.error: [Errno 98] Address already in use

Solution: Change port in .env or kill the process using the port:

lsof -i :5062
kill <PID>

No Audio / One-Way Audio

Check UDP firewall rules for RTP ports
Verify greeting.wav exists in output/transcode/
Confirm RTP ports are correctly allocated (check logs)
Ensure NAT traversal is configured if behind NAT

SIP Messages Not Received

Verify SIP_LOCAL_IP matches your network interface
Check firewall allows UDP on SIP port
Review SIP server routing rules
Enable DEBUG logging for detailed traces

Call Fails Immediately

Confirm SIP_SERVER_IP is correct
Check codec compatibility (PCMA/PCMU)
Review SIP response codes in logs
Verify SIP credentials if required

OpenAI API Issues

Missing API Key:

ValueError: OPENAI_API_KEY is required

Ensure .env contains valid OPENAI_API_KEY.

Rate Limiting:

Check OpenAI account quota
Implement retry logic or reduce request frequency

VAD Not Detecting Speech

Check microphone/audio input quality
Adjust VAD sensitivity threshold
Verify audio is 16kHz sample rate for VAD
Review VAD logs for detection events

AI Responses Slow

Consider using Local backend for lower latency
Check network connectivity to API endpoints
Monitor GPU utilization for local models
Reduce CALL_CENTER_BUFFER_SIZE for faster processing

Call Recording Issues

Verify RECORDING_DIR exists and is writable
Check disk space availability
Ensure proper cleanup of old recordings

Development

Code Style

Full type hint coverage
Pydantic models for data validation
Match/case for message routing
Structured logging throughout

Adding a New LLM Backend

Create a new file in helper/llm_backends/
Inherit from LLMBackend base class
Implement the generate() method
Register in call_center.py

from helper.llm_backends.llm_backend import LLMBackend

class MyBackend(LLMBackend):
    def generate(self, messages: list) -> str:
        # Your implementation
        pass

API Reference

SIPRTPSession

session = SIPRTPSession(
    call_id="unique-call-id",
    remote_ip="192.168.1.100",
    remote_port=5060
)

# Start audio
session.start_rtp()

# Play audio file
session.play_audio(Path("greeting.wav"))

# Stop and cleanup
session.stop()

RTPHandler

handler = RTPHandler(
    local_port=31000,
    remote_ip="192.168.1.100",
    remote_port=31002
)

# Send audio
handler.send_audio(audio_bytes)

# Receive with callback
handler.set_receive_callback(on_audio_received)
handler.start_receiving()

WebSocket Commands

from helper.ws_command import WSCommandHelper

# Parse incoming command
cmd_type, payload = WSCommandHelper.parse("CALL:1234567890")

# Build outgoing command
message = WSCommandHelper.build("RING_ANS", "1234567890##call-123")

Author

Code by DHT@Matthew

Version: 0.2.0

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
helper		helper
model		model
output/transcode		output/transcode
voices		voices
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
__init__.py		__init__.py
call_center.py		call_center.py
config.py		config.py
pyproject.toml		pyproject.toml
receive_server.py		receive_server.py
uv.lock		uv.lock

Matthew20040407/SIP_server

Folders and files

Latest commit

History

Repository files navigation