原文链接：https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/from-zero-to-hero-building-a-production-ready-sip-gateway-for-azure-voice-live/4473405?WT.mc_id=AI-MVP-5003172

题图来源：Microsoft Tech Community / Microsoft Foundry Blog

引言

语音技术正在重塑人和机器的交互方式，让与 AI 的对话比以往更自然。随着 Voice Live API 的 public beta 发布，开发者拥有了构建低延迟、多模态语音体验的工具，在应用中可以做出很多新玩法。

过去，想做一个语音机器人，往往需要把多个模型“串起来”：比如用 ASR（自动语音识别）模型（像 Whisper）做转写、再用文本模型做推理、最后用 TTS（文本转语音）模型生成语音输出。这条链路通常会带来明显延迟，并且在情感表达等细节上会有损失。

Voice Live API 的关键变化在于：它把这些能力整合到一次 API 调用里。开发者通过建立持久的 WebSocket 连接，就能直接流式传输输入/输出音频，显著降低延迟并提升对话的自然度。同时，API 还支持 function calling，让语音机器人可以在对话过程中即时执行动作（例如下单、查询客户信息等）。

这篇文章从第三方技术博主视角，梳理 Microsoft 团队给出的一个完整方案：一个基于 Python 的 SIP 网关，用来把传统电话/语音系统（SIP/RTP）与 Azure 的 Voice Live 实时对话 API（WebSocket）连接起来。借助这个网关，任何 SIP 端点（座机、软电话、甚至经由 PSTN 转入的来电）都可以与 AI 进行自然的语音对话。

读完本文，你将理解：

生产级 SIP-to-WebSocket 网关的整体架构设计
音频转码与重采样策略（保证媒体格式无缝转换）
真实部署拓扑：云上电话系统与本地企业系统（如 Avaya、Genesys）对接
本地测试与企业生产环境的逐步配置指南

架构概览

高层设计

Voice Live API 的工作方式

传统语音助手往往需要链路式组合：ASR（转写）+ 文本模型（推理）+ TTS（合成）。多步骤不仅增加延迟，也更容易丢失对话的“情绪/语气”细节。

Voice Live API 通过一个持续的 WebSocket 连接把这些能力合在一起：可以实时流式输入音频并得到音频输出，从而显著降低端到端延迟，并增强对话的自然感。此外，function calling 让语音机器人可以在对话中主动触发业务动作。

本文的网关在 SIP/RTP（电话语音世界）与 Azure Voice Live 的 WebSocket 实时 API 之间充当 双向媒体代理（bidirectional media proxy）：既翻译信令协议，也转换音频格式，从而让传统 VoIP 基础设施与现代 AI 对话代理实现“无感集成”。

关键设计原则

异步优先（Asynchronous-First）：基于 asyncio 构建非阻塞 I/O，追求低延迟与高并发潜力
关注点分离（Separation of Concerns）：SIP、媒体、Voice Live 集成分层模块化
生产级错误处理：音频队列下溢时，优雅降级（注入静音帧）而不是抖动/掉话
结构化日志：使用 structlog 输出可机器解析、带上下文的日志，便于可观测性
类型安全：pydantic 校验配置 + mypy 静态检查

核心组件

SBC（会话边界控制器）的关键作用：

终止运营商侧 SIP，并向 Azure SIP/RTC 端点进行互通
规范化 SIP Header、移除不支持的选项、并做编解码映射
提供安全能力：TLS 信令、SRTP 媒体、拓扑隐藏、ACL
可选做媒体锚定（media anchoring）：用于合规录音、QoS 平滑、合法监听等

项目结构

src/voicelive_sip_gateway/
├── config/
│   ├── __init__.py
│   └── settings.py          # Pydantic-based configuration
├── gateway/
│   └── main.py              # Application entry point & lifecycle
├── logging/
│   ├── __init__.py
│   └── setup.py             # Structlog configuration
├── media/
│   ├── __init__.py
│   ├── stream_bridge.py     # Bidirectional audio queue manager
│   └── transcode.py         # μ-law ↔ PCM16 + resampling
├── sip/
│   ├── __init__.py
│   ├── agent.py             # pjsua2 wrapper & call handling
│   ├── rtp.py               # RTP utilities
│   └── sdp.py               # SDP parsing/generation
└── voicelive/
    ├── __init__.py
    ├── client.py            # Azure Voice Live SDK wrapper
    └── events.py            # Event type mapping

呼叫流程（Call Flow）

客户拨打你的号码
Audiocodes SBC 从 PSTN 收到呼叫，并向 Asterisk 转发 SIP INVITE
Asterisk 的路由逻辑把 INVITE 转到 voicelive-bot@gateway.example.com
Voice Live Gateway 以 SIP 终端方式向 Asterisk 注册并接听
RTP 音频流：Caller ↔ SBC ↔ Asterisk ↔ Gateway ↔ Azure

音频流水线：从 μ-law 到 PCM16

挑战

传统电话语音常用 G.711 μ-law 编码，采样率 8 kHz（带宽效率高）。而 Azure Voice Live 期望的是 PCM16（16-bit 线性 PCM），采样率 24 kHz。因此网关需要在实时链路中完成：

编解码转换（μ-law ↔ PCM16）
采样率转换（8 kHz ↔ 24 kHz）

并且要尽可能低延迟。

音频流示意

Caller (SIP)                    Gateway                     Azure Voice Live
─────────────                   ───────                     ────────────────
    │                              │                               │
    │ RTP: μ-law 8kHz              │                               │
    ├──────────────────────────────►                               │
    │                              │                               │
    │                         ┌────▼────┐                          │
    │                         │ pjsua2  │ (decodes to PCM16 8kHz)  │
    │                         └────┬────┘                          │
    │                              │                               │
    │                         ┌────▼────────┐                      │
    │                         │ Resample    │ (8kHz → 24kHz)       │
    │                         │ PCM16       │                      │
    │                         └────┬────────┘                      │
    │                              │                               │
    │                              ├───────────────────────────────►
    │                              │   WebSocket: PCM16 24kHz      │
    │                              │                               │
    │                              │◄──────────────────────────────┤
    │                              │   Response: PCM16 24kHz       │
    │                              │                               │
    │                         ┌────▼────────┐                      │
    │                         │ Resample    │ (24kHz → 8kHz)       │
    │                         │ PCM16       │                      │
    │                         └────┬────────┘                      │
    │                              │                               │
    │                         ┌────▼────┐                          │
    │◄────────────────────────┤ pjsua2  │ (encodes to μ-law)       │
    │  RTP: μ-law 8kHz        └─────────┘                          │

下面挑几个关键代码点，帮助理解端到端的音频流。

音频流桥接：stream_bridge.py

AudioStreamBridge 通过 asyncio 队列来编排双向音频流：

class AudioStreamBridge:
    """Bidirectional audio pump between SIP (μ-law) and Voice Live (PCM16 24kHz)."""

    VOICELIVE_SAMPLE_RATE = 24000
    SIP_SAMPLE_RATE = 8000

    def __init__(self, settings: Settings):
        self._inbound_queue: asyncio.Queue[bytes] = asyncio.Queue()   # SIP → Voice Live
        self._outbound_queue: asyncio.Queue[bytes] = asyncio.Queue()  # Voice Live → SIP

入站链路（Caller → AI）：

async def _flush(self) -> None:
    """Process inbound audio: PCM16 8kHz from SIP → PCM16 24kHz to Voice Live."""
    while True:
        pcm16_8k = await self._inbound_queue.get()
        pcm16_24k = resample_pcm16(pcm16_8k, self.SIP_SAMPLE_RATE, self.VOICELIVE_SAMPLE_RATE)
        if self._voicelive_client:
            await self._voicelive_client.send_audio_chunk(pcm16_24k)

出站链路（AI → Caller）：

async def emit_audio_to_sip(self, pcm_chunk: bytes) -> None:
    """Resample Voice Live audio down to 8 kHz PCM frames for SIP playback."""
    pcm_8k = resample_pcm16(pcm_chunk, self.VOICELIVE_SAMPLE_RATE, self.SIP_SAMPLE_RATE)

    # Split into 20ms frames (160 samples @ 8kHz = 320 bytes)
    frame_size_bytes = 320
    for offset in range(0, len(pcm_8k), frame_size_bytes):
        frame = pcm_8k[offset : offset + frame_size_bytes]
        if frame:
            await self._outbound_queue.put(frame)

帧时序（Frame Timing）：VoIP 常用 20ms 帧（ptime=20）。在 8 kHz 下：8000 samples/sec × 0.020 sec = 160 samples = 320 bytes（PCM16）。

使用 pjsua2 处理 SIP 信令

为什么选 pjsua2？

pjproject 基本算是 SIP/RTP 的“工业标准”，很多商业产品（包括 Asterisk）都在用。pjsua2 提供的 API 具备：

完整 SIP 协议栈（INVITE、ACK、BYE、REGISTER 等）
RTP/RTCP 媒体处理
内置多种编解码（G.711、G.722、Opus 等）
NAT 穿透（STUN/TURN/ICE）
线程安全的 C++ API + Python bindings

自定义媒体端口（Custom Media Ports）

为了把 pjsua2 的媒体管线与 asyncio 队列桥接起来，需要实现自定义的 AudioMediaPort 子类。

接收来电音频（Caller → Voice Live）

def onFrameReceived(self, frame: pj.MediaFrame) -> None:
    """Called by pjsua when it receives audio from caller (to Voice Live)."""
    if self._direction == "to_voicelive" and self._loop:
        if frame.type == pj.PJMEDIA_FRAME_TYPE_AUDIO and frame.buf:
            try:
                asyncio.run_coroutine_threadsafe(
                    self._bridge.enqueue_sip_audio(bytes(frame.buf)),
                    self._loop
                )
            except Exception as e:
                self._logger.warning("media.enqueue_failed", error=str(e))

线程模型注意点：pjsua2 的事件循环运行在专用线程里，示例用 asyncio.run_coroutine_threadsafe() 把音频数据安全地投递回主 asyncio loop。

向来电方发送音频（Voice Live → Caller）

def onFrameRequested(self, frame: pj.MediaFrame) -> None:
    """Called by pjsua when it needs audio to send to caller (from Voice Live)."""
    if self._direction == "from_voicelive" and self._loop:
        try:
            future = asyncio.run_coroutine_threadsafe(
                self._bridge.dequeue_sip_audio_nonblocking(),
                self._loop
            )
            pcm_data = future.result(timeout=0.050)

            # Ensure exactly 320 bytes (160 samples @ 8kHz)
            if len(pcm_data) != 320:
                pcm_data = (pcm_data + b'\x00' * 320)[:320]

            frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
            frame.buf = pj.ByteVector(pcm_data)
            frame.size = len(pcm_data)
        except Exception:
            # Return silence on timeout/error
            frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
            frame.buf = pj.ByteVector(b'\x00' * 320)
            frame.size = 320

优雅降级（Graceful Degradation）：如果出站队列为空（AI 还没生成音频），就注入静音帧，避免 RTP 抖动/断续。

呼叫状态管理（Call State Management）

class GatewayCall(pj.Call):
    """Handles SIP call lifecycle and connects media bridge."""

    def onCallState(self, prm: pj.OnCallStateParam) -> None:
        ci = self.getInfo()
        self._logger.info("sip.call_state", remote_uri=ci.remoteUri, state=ci.stateText)

        if ci.state == pj.PJSIP_INV_STATE_DISCONNECTED:
            self._cleanup()
            self._account.current_call = None

    def onCallMediaState(self, prm: pj.OnCallMediaStateParam) -> None:
        ci = self.getInfo()
        for mi in ci.media:
            if mi.type == pj.PJMEDIA_TYPE_AUDIO and mi.status == pj.PJSUA_CALL_MEDIA_ACTIVE:
                media = self.getMedia(mi.index)
                aud_media = pj.AudioMedia.typecastFromMedia(media)

                # Create bidirectional media bridge
                self._to_voicelive_port = CustomAudioMediaPort(self._bridge, "to_voicelive", self._logger)
                self._from_voicelive_port = CustomAudioMediaPort(self._bridge, "from_voicelive", self._logger)

                # Connect: Caller → to_voicelive_port → Voice Live
                aud_media.startTransmit(self._to_voicelive_port)
                # Connect: Voice Live → from_voicelive_port → Caller
                self._from_voicelive_port.startTransmit(aud_media)

                # Start AI conversation with greeting
                asyncio.run_coroutine_threadsafe(
                    self._voicelive_client.request_response(interrupt=False),
                    self._loop
                )

账户注册（Account Registration）

在与 Asterisk 组合的生产部署中，网关需要像标准 SIP 终端一样向 PBX 注册：

class SipAgent:
    def _run_pjsua_thread(self, loop: asyncio.AbstractEventLoop) -> None:
        self._ep = pj.Endpoint()
        self._ep.libCreate()
        self._ep.libInit(ep_cfg)
        self._ep.transportCreate(pj.PJSIP_TRANSPORT_UDP, transport_cfg)
        self._ep.libStart()

        self._account = GatewayAccount(self._logger, self._bridge, self._voicelive_client, loop)

        if self._settings.sip.register_with_server and self._settings.sip.server:
            acc_cfg = pj.AccountConfig()
            acc_cfg.idUri = f"sip:{self._settings.sip.user}@{self._settings.sip.server}"
            acc_cfg.regConfig.registrarUri = f"sip:{self._settings.sip.server}"

            # Digest authentication credentials
            cred = pj.AuthCredInfo()
            cred.scheme = "digest"
            cred.realm = self._settings.sip.auth_realm or "*"
            cred.username = self._settings.sip.auth_user
            cred.data = self._settings.sip.auth_password
            cred.dataType = pj.PJSIP_CRED_DATA_PLAIN_PASSWD
            acc_cfg.sipConfig.authCreds.append(cred)

            self._account.create(acc_cfg)

Voice Live 集成

Azure Voice Live 概览

Azure Voice Live 是一个 实时、双向的对话式 AI 服务，把下面能力整合到一起：

GPT-4o Realtime Preview：面向口语对话优化的超低延迟语言模型
流式语音识别：连续转写，并提供词级时间戳
神经 TTS：更自然的合成声音，并具备情感表达
服务端 VAD：通过语音活动检测（VAD）做自动轮次切换，无需额外提示词

客户端实现：client.py

class VoiceLiveClient:
    """Manages lifecycle of an Azure Voice Live WebSocket connection."""

    async def connect(self) -> None:
        if self._settings.azure.api_key:
            credential = AzureKeyCredential(self._settings.azure.api_key)
        else:
            # Use AAD authentication (Managed Identity or Service Principal)
            self._aad_credential = DefaultAzureCredential()
            credential = await self._aad_credential.__aenter__()

        self._connection_cm = connect(
            endpoint=self._settings.azure.endpoint,
            credential=credential,
            model=self._settings.azure.model,
        )
        self._connection = await self._connection_cm.__aenter__()

        # Configure session parameters
        session = RequestSession(
            model="gpt-4o",
            modalities=[Modality.TEXT, Modality.AUDIO],
            instructions=self._settings.azure.instructions,
            input_audio_format=InputAudioFormat.PCM16,
            output_audio_format=OutputAudioFormat.PCM16,
            input_audio_transcription=AudioInputTranscriptionOptions(model="azure-speech"),
            turn_detection=ServerVad(
                threshold=0.5,
                prefix_padding_ms=200,
                silence_duration_ms=400
            ),
            voice=AzureStandardVoice(name=self._settings.azure.voice)
        )
        await self._connection.session.update(session=session)

关键配置点：

turn_detection=ServerVad(...)：由 Azure 判断用户何时停止说话，并自动触发 AI 生成响应；无需唤醒词或显式提示
prefix_padding_ms=200：保留语音检测前 200ms 音频，避免截断过硬
silence_duration_ms=400：检测到 400ms 静音后认定轮次结束

流式响应音频：AI 生成的语音会以 RESPONSE_AUDIO_DELTA 事件分片到达（base64 编码的 PCM16 chunks）。网关解码后立即通过音频桥接下发，实现低延迟播放。

生产部署：Audiocodes + Asterisk

为什么选择这种拓扑？

组件	角色	收益
Audiocodes SBC	会话边界控制器	- NAT/防火墙穿透 - 安全（DOS 防护、加密） - 协议规范化 - 拓扑隐藏 - 媒体锚定（可选转码）
Asterisk PBX	SIP 服务器	- 呼叫路由与 IVR - 用户目录与鉴权 - 转接/保持/会议等高级呼叫能力 - CDR/分析 - 与企业电话系统集成
Voice Live Gateway	AI 对话端点	- 实时 AI 对话 - 自然语言理解 - 动态响应生成 - 多语言支持

网络架构

                    Internet
                       │
                       │ SIP/RTP
                       ▼
         ┌─────────────────────────┐
         │   Audiocodes SBC        │
         │   Public IP: X.X.X.X    │
         │   Ports: 5060, 10000+   │
         └────────────┬────────────┘
                      │ Private Network
         ┌────────────▼────────────┐
         │   Asterisk PBX          │
         │   Internal: 10.0.1.10   │
         │   Port: 5060            │
         └────────────┬────────────┘
                      │
         ┌────────────▼────────────┐
         │  Voice Live Gateway     │
         │  Internal: 10.0.1.20    │
         │  Port: 5060             │
         │  ┌───────────────────┐  │
         │  │ Outbound HTTPS    │  │
         │  │ to Azure          │  │
         │  └─────────┬─────────┘  │
         └────────────┼────────────┘
                      │ Internet (HTTPS/WSS)
                      ▼
         ┌────────────────────────┐
         │  Azure Voice Live API  │
         │  *.cognitiveservices   │
         └────────────────────────┘

Audiocodes SBC 配置

1. 到 Asterisk 的 SIP Trunk（IP Group）

IP Group Settings:
- Name: Asterisk-Trunk
- Type: Server
- SIP Group Name: asterisk.internal.example.com
- Media Realm: Private
- Proxy Set: Asterisk-ProxySet
- Classification: Classify by Proxy Set
- SBC Operation Mode: SBC-Only
- Topology Location: Internal Network

2. Asterisk 的 Proxy Set

Proxy Set Name: Asterisk-ProxySet
Proxy Address: 10.0.1.10:5060
Transport Type: UDP
Load Balancing Method: Parking Lot

3. IP-to-IP 路由规则

Rule Name: PSTN-to-Gateway
Source IP Group: PSTN-Trunk
Destination IP Group: Asterisk-Trunk
Call Trigger: Any
Destination Prefix Manipulation: None
Message Manipulation: None

4. 媒体设置

Media Realm: Private
IPv4 Interface: LAN1 (10.0.1.1)
Media Security: None (or SRTP if required)
Codec Preference Order: G711Ulaw, G711Alaw
Transcoding: Disabled (pass-through)
RTP Port Range: 10000-20000

Asterisk 配置

/etc/asterisk/pjsip.conf

;=====================================
; Transport Configuration
;=====================================
[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:5060
local_net=10.0.0.0/8

;=====================================
; Voice Live Gateway Endpoint
;=====================================
[voicelive-gateway]
type=endpoint
context=voicelive-routing
aors=voicelive-gateway
auth=voicelive-auth
disallow=all
allow=ulaw
allow=alaw
direct_media=no
force_rport=yes
rewrite_contact=yes
rtp_symmetric=yes
ice_support=no

[voicelive-gateway]
type=aor
contact=sip:10.0.1.20:5060
qualify_frequency=30

[voicelive-auth]
type=auth
auth_type=userpass
username=
password=
realm=sip.example.com

;=====================================
; SBC Trunk (for inbound calls)
;=====================================
[audiocodes-sbc]
type=endpoint
context=from-sbc
aors=audiocodes-sbc
disallow=all
allow=ulaw
allow=alaw

[audiocodes-sbc]
type=aor
contact=sip:10.0.1.1:5060

/etc/asterisk/extensions.conf

;=====================================
; Incoming calls from SBC
;=====================================
[from-sbc]
; Example: Route calls to 800-AI-VOICE to the gateway
exten => 8002486423,1,NoOp(Routing to Voice Live Gateway)
 same => n,Set(CALLERID(name)=AI Assistant)
 same => n,Dial(PJSIP/voicelive-bot@voicelive-gateway,30)
 same => n,Hangup()

; Default handler for unmatched numbers
exten => _X.,1,NoOp(Unrouted call: ${EXTEN})
 same => n,Playback(invalid)
 same => n,Hangup()

;=====================================
; Voice Live Gateway routing context
;=====================================
[voicelive-routing]
exten => _X.,1,NoOp(Call from gateway: ${CALLERID(num)})
 same => n,Hangup()

网关配置

创建 /opt/voicelive-gateway/.env：

#=====================================
# SIP Configuration
#=====================================
SIP_SERVER=asterisk.internal.example.com
SIP_PORT=5060
SIP_USER=voicelive-bot@sip.example.com
AUTH_USER=voicelive-bot
AUTH_REALM=sip.example.com
AUTH_PASSWORD=
REGISTER_WITH_SIP_SERVER=true
DISPLAY_NAME=Voice Live Bot

#=====================================
# Network Configuration
#=====================================
SIP_LOCAL_ADDRESS=0.0.0.0
SIP_VIA_ADDR=10.0.1.20
MEDIA_ADDRESS=10.0.1.20
MEDIA_PORT=10000
MEDIA_PORT_COUNT=1000

#=====================================
# Azure Voice Live
#=====================================
AZURE_VOICELIVE_ENDPOINT=wss://xxxxxx.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=
VOICE_LIVE_MODEL=gpt-4o
VOICE_LIVE_VOICE=en-US-AvaNeural
VOICE_LIVE_INSTRUCTIONS=You are a helpful customer service assistant for Contoso Inc. Answer questions about account balances, order status, and general inquiries. Be friendly and concise.
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS=200
VOICE_LIVE_PROACTIVE_GREETING_ENABLED=true
VOICE_LIVE_PROACTIVE_GREETING=Thank you for calling Contoso customer service. How can I help you today?

#=====================================
# Logging
#=====================================
LOG_LEVEL=INFO
VOICE_LIVE_LOG_FILE=/var/log/voicelive-gateway/gateway.log

Systemd 服务（Linux）

创建 /etc/systemd/system/voicelive-gateway.service：

[Unit]
Description=Voice Live SIP Gateway
After=network.target

[Service]
Type=simple
User=voicelive
Group=voicelive
WorkingDirectory=/opt/voicelive-gateway
EnvironmentFile=/opt/voicelive-gateway/.env
ExecStart=/opt/voicelive-gateway/.venv/bin/python3 -m voicelive_sip_gateway.gateway.main
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/log/voicelive-gateway

[Install]
WantedBy=multi-user.target

启用并启动：

sudo systemctl daemon-reload sudo systemctl enable voicelive-gateway sudo systemctl start voicelive-gateway sudo systemctl status voicelive-gateway

配置指南

环境变量参考

Variable	Description	Example	Required
Azure Voice Live
AZURE_VOICELIVE_ENDPOINT	WebSocket endpoint URL	wss://myresource.cognitiveservices.azure.com/openai/realtime	✅
AZURE_VOICELIVE_API_KEY	API key (or use AAD)	abc123…	✅*
VOICE_LIVE_MODEL	Model identifier	gpt-4o	❌ (default: gpt-4o)
VOICE_LIVE_VOICE	TTS voice name	en-US-AvaNeural	❌
VOICE_LIVE_INSTRUCTIONS	System prompt	You are a helpful assistant	❌
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS	Max tokens per response	200	❌
VOICE_LIVE_PROACTIVE_GREETING_ENABLED	Enable greeting on connect	true	❌
VOICE_LIVE_PROACTIVE_GREETING	Greeting message	Hello! How can I help?	❌
SIP Configuration
SIP_SERVER	Asterisk/PBX hostname	asterisk.example.com	✅**
SIP_PORT	SIP port	5060	❌ (default: 5060)
SIP_USER	SIP user URI	bot@sip.example.com	✅**
AUTH_USER	Auth username	bot	✅**
AUTH_REALM	Auth realm	sip.example.com	✅**
AUTH_PASSWORD	Auth password	securepass	✅**
REGISTER_WITH_SIP_SERVER	Enable registration	true	❌ (default: false)
DISPLAY_NAME	Caller ID name	Voice Live Bot	❌
Network Settings
SIP_LOCAL_ADDRESS	Local bind address	0.0.0.0	❌ (default: 127.0.0.1)
SIP_VIA_ADDR	Via header IP	10.0.1.20	❌
MEDIA_ADDRESS	RTP bind address	10.0.1.20	❌
MEDIA_PORT	RTP port range start	10000	❌
MEDIA_PORT_COUNT	RTP port range count	1000	❌
Logging
LOG_LEVEL	Log verbosity	INFO	❌ (default: INFO)
VOICE_LIVE_LOG_FILE	Log file path	logs/gateway.log	❌

*Either AZURE_VOICELIVE_API_KEY or AAD environment variables (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_CLIENT_SECRET) are required.

**Required only when REGISTER_WITH_SIP_SERVER=true.

本地测试搭建（Local Testing Setup）

如果你想不依赖 SBC/PBX 基础设施进行快速验证，可以按下面步骤做本地测试。

前置条件（Prerequisites）

安装 pjproject 及 Python bindings：


git clone https://github.com/pjsip/pjproject.git?WT.mc_id=AI-MVP-5003172 cd pjproject ./configure CFLAGS="-O2 -DNDEBUG" && make dep && make cd pjsip-apps/src/swig make export PYTHONPATH=$PWD:$PYTHONPATH

安装网关依赖：

cd /path/to/azure-voicelive-sip-python python3 -m venv .venv source .venv/bin/activate pip install -e .[dev]

安装 PortAudio（可选：用于本机扬声器播放）：

# macOS brew install portaudio # Ubuntu/Debian sudo apt-get install portaudio19-dev

本地配置

创建 .env：

# No SIP server - direct connection
REGISTER_WITH_SIP_SERVER=false
SIP_LOCAL_ADDRESS=127.0.0.1
SIP_VIA_ADDR=127.0.0.1
MEDIA_ADDRESS=127.0.0.1

# Azure Voice Live
AZURE_VOICELIVE_ENDPOINT=wss://your-resource.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=your-api-key

# Logging
LOG_LEVEL=DEBUG
VOICE_LIVE_LOG_FILE=logs/gateway.log

运行网关

# Load environment variables set -a && source .env && set +a # Start gateway make run # Or manually: PYTHONPATH=src python3 -m voicelive_sip_gateway.gateway.main

期望输出（示例）：

2025-11-27 10:15:32 [info ] voicelive.connected endpoint=wss://… 2025-11-27 10:15:32 [info ] sip.transport_created port=5060 2025-11-27 10:15:32 [info ] sip.agent_started address=127.0.0.1 port=5060

连接 SIP 软电话（Softphone）

可用的免费 SIP 客户端：

Windows/macOS: MicroSIP、Zoiper
Linux: Linphone
iOS/Android: Linphone、Zoiper

配置软电话（示例）：

Account Settings: - Username: test - Domain: 127.0.0.1 - Port: 5060 - Transport: UDP - Registration: Disabled (direct connection) To call: sip:test@127.0.0.1:5060

预期呼叫流程：

Dial sip:test@127.0.0.1:5060
Gateway answers with 200 OK
Hear AI greeting: “Thank you for calling. How can I help you today?”
Speak your question
Hear AI-generated response

延迟预算（Latency Budget）

Component	Typical Latency	Notes
Network (caller → gateway)	10-50ms	Depends on ISP, distance
SIP/RTP processing	<5ms	pjproject is highly optimized
Audio resampling	<2ms	scipy.resample_poly is efficient
WebSocket (gateway → Azure)	20-80ms	Depends on region, network
Voice Live processing	200-500ms	STT + LLM inference + TTS
Total round-trip	250-650ms	Perceived as near real-time

优化建议：

尽量把网关部署在与 Voice Live 资源相同的 Azure 区域
在 Audiocodes SBC 上启用 expedited routing（同时关闭不必要的 media anchoring）
尽量减少 SIP hop：简单场景可直接 SBC → Gateway（跳过 Asterisk）
监控队列深度：当 _inbound_queue 或 _outbound_queue 超过 10 个 item 时输出警告

关键日志事件（Key Log Events）

Event	Meaning	Action
sip.agent_started	Gateway listening	✅ Normal
sip.incoming_call	New call received	✅ Normal
sip.media_active	Audio bridge established	✅ Normal
voicelive.connected	WebSocket connected	✅ Normal
voicelive.audio_delta	AI audio chunk received	✅ Normal
voicelive.event_error	Voice Live API error	⚠️ Check API key, quota
media.enqueue_failed	Audio queue full	⚠️ CPU overload or slow network
sip.thread_error	pjsua crash	🔴 Restart gateway

常见问题（Common Issues）

1. 听不到来电方音频

现象：AI 对你的讲话没有响应

诊断：

Check RTP packets arriving sudo tcpdump -i any -n udp port 10000-11000

解决：

确认 MEDIA_ADDRESS 与网关可达 IP 一致
检查防火墙（放行 UDP 10000-11000）
确认 Asterisk 配置：direct_media=no 且 rtp_symmetric=yes

2. 音频断续/卡顿（Choppy audio）

现象：声音发“机械音”、有明显掉包

诊断：查看日志中的队列深度

解决：

增加 CPU 资源
降低 VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS，减少生成时间
检查网络抖动（目标 <30ms）

3. SIP 注册失败

现象：Asterisk 执行 pjsip show endpoints 显示 Unreachable

诊断：

On Asterisk pjsip set logger on # Watch for REGISTER attempts

解决：

核对 AUTH_USER / AUTH_PASSWORD / AUTH_REALM 与 Asterisk pjsip.conf 一致
检查网络连通性：ping asterisk.example.com
确认 REGISTER_WITH_SIP_SERVER=true

结语

这套方案给出了一个“生产级桥接器”的参考实现：在电话语音世界与 Azure 的 AI Voice Live 服务之间建立稳定、低延迟的双向媒体通道。文章中强调的成果包括：

✅ 实时音频转码：额外开销 <5ms ✅ 健壮的 SIP 栈：基于工业级 pjproject ✅ 异步架构：具备更高并发潜力 ✅ 企业级部署拓扑：Audiocodes SBC + Asterisk PBX ✅ 可观测性：结构化日志贯穿全链路

下一步（Next Steps）

可以考虑的增强方向：

多通话支持：重构为单实例支持多个并发呼叫
DTMF：实现 RFC 2833，用于 IVR 场景
呼叫转接：支持 SIP REFER，把来电转给人工坐席
录音与合规：将会话写入 Azure Blob Storage
指标体系：输出 Prometheus 指标（时长、错误率、延迟分位数）
Kubernetes：用 Helm chart 做网关的自动扩缩
TLS/SRTP：端到端加密 SIP 信令与 RTP 媒体

参考资源

Azure Voice Live Documentation: https://learn.microsoft.com/azure/ai-services/speech-service/voice-live?WT.mc_id=AI-MVP-5003172

获取代码（Get the Code）

完整源码在 GitHub：

https://github.com/monuminu/azure-voicelive-sip-python?WT.mc_id=AI-MVP-5003172

引言