原文链接:https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/from-zero-to-hero-building-a-production-ready-sip-gateway-for-azure-voice-live/4473405?WT.mc_id=AI-MVP-5003172

题图来源:Microsoft Tech Community / Microsoft Foundry Blog

引言

语音技术正在重塑人和机器的交互方式,让与 AI 的对话比以往更自然。随着 Voice Live API 的 public beta 发布,开发者拥有了构建低延迟、多模态语音体验的工具,在应用中可以做出很多新玩法。

过去,想做一个语音机器人,往往需要把多个模型“串起来”:比如用 ASR(自动语音识别)模型(像 Whisper)做转写、再用文本模型做推理、最后用 TTS(文本转语音)模型生成语音输出。这条链路通常会带来明显延迟,并且在情感表达等细节上会有损失。

Voice Live API 的关键变化在于:它把这些能力整合到一次 API 调用里。开发者通过建立持久的 WebSocket 连接,就能直接流式传输输入/输出音频,显著降低延迟并提升对话的自然度。同时,API 还支持 function calling,让语音机器人可以在对话过程中即时执行动作(例如下单、查询客户信息等)。

这篇文章从第三方技术博主视角,梳理 Microsoft 团队给出的一个完整方案:一个基于 Python 的 SIP 网关,用来把传统电话/语音系统(SIP/RTP)与 Azure 的 Voice Live 实时对话 API(WebSocket)连接起来。借助这个网关,任何 SIP 端点(座机、软电话、甚至经由 PSTN 转入的来电)都可以与 AI 进行自然的语音对话。

读完本文,你将理解:

  • 生产级 SIP-to-WebSocket 网关的整体架构设计
  • 音频转码与重采样策略(保证媒体格式无缝转换)
  • 真实部署拓扑:云上电话系统与本地企业系统(如 Avaya、Genesys)对接
  • 本地测试与企业生产环境的逐步配置指南

架构概览

高层设计

Voice Live API 的工作方式

传统语音助手往往需要链路式组合:ASR(转写)+ 文本模型(推理)+ TTS(合成)。多步骤不仅增加延迟,也更容易丢失对话的“情绪/语气”细节。

Voice Live API 通过一个持续的 WebSocket 连接把这些能力合在一起:可以实时流式输入音频并得到音频输出,从而显著降低端到端延迟,并增强对话的自然感。此外,function calling 让语音机器人可以在对话中主动触发业务动作。

本文的网关在 SIP/RTP(电话语音世界)与 Azure Voice Live 的 WebSocket 实时 API 之间充当 双向媒体代理(bidirectional media proxy):既翻译信令协议,也转换音频格式,从而让传统 VoIP 基础设施与现代 AI 对话代理实现“无感集成”。

关键设计原则

  1. 异步优先(Asynchronous-First):基于 asyncio 构建非阻塞 I/O,追求低延迟与高并发潜力
  2. 关注点分离(Separation of Concerns):SIP、媒体、Voice Live 集成分层模块化
  3. 生产级错误处理:音频队列下溢时,优雅降级(注入静音帧)而不是抖动/掉话
  4. 结构化日志:使用 structlog 输出可机器解析、带上下文的日志,便于可观测性
  5. 类型安全:pydantic 校验配置 + mypy 静态检查

核心组件

SBC(会话边界控制器)的关键作用:

  • 终止运营商侧 SIP,并向 Azure SIP/RTC 端点进行互通
  • 规范化 SIP Header、移除不支持的选项、并做编解码映射
  • 提供安全能力:TLS 信令、SRTP 媒体、拓扑隐藏、ACL
  • 可选做媒体锚定(media anchoring):用于合规录音、QoS 平滑、合法监听等

项目结构

src/voicelive_sip_gateway/
├── config/
│   ├── __init__.py
│   └── settings.py          # Pydantic-based configuration
├── gateway/
│   └── main.py              # Application entry point & lifecycle
├── logging/
│   ├── __init__.py
│   └── setup.py             # Structlog configuration
├── media/
│   ├── __init__.py
│   ├── stream_bridge.py     # Bidirectional audio queue manager
│   └── transcode.py         # μ-law ↔ PCM16 + resampling
├── sip/
│   ├── __init__.py
│   ├── agent.py             # pjsua2 wrapper & call handling
│   ├── rtp.py               # RTP utilities
│   └── sdp.py               # SDP parsing/generation
└── voicelive/
    ├── __init__.py
    ├── client.py            # Azure Voice Live SDK wrapper
    └── events.py            # Event type mapping

呼叫流程(Call Flow)

  1. 客户拨打你的号码
  2. Audiocodes SBC 从 PSTN 收到呼叫,并向 Asterisk 转发 SIP INVITE
  3. Asterisk 的路由逻辑把 INVITE 转到 voicelive-bot@gateway.example.com
  4. Voice Live Gateway 以 SIP 终端方式向 Asterisk 注册并接听
  5. RTP 音频流:Caller ↔ SBC ↔ Asterisk ↔ Gateway ↔ Azure

音频流水线:从 μ-law 到 PCM16

挑战

传统电话语音常用 G.711 μ-law 编码,采样率 8 kHz(带宽效率高)。而 Azure Voice Live 期望的是 PCM16(16-bit 线性 PCM),采样率 24 kHz。因此网关需要在实时链路中完成:

  • 编解码转换(μ-law ↔ PCM16)
  • 采样率转换(8 kHz ↔ 24 kHz)

并且要尽可能低延迟。

音频流示意

Caller (SIP)                    Gateway                     Azure Voice Live
─────────────                   ───────                     ────────────────
    │                              │                               │
    │ RTP: μ-law 8kHz              │                               │
    ├──────────────────────────────►                               │
    │                              │                               │
    │                         ┌────▼────┐                          │
    │                         │ pjsua2  │ (decodes to PCM16 8kHz)  │
    │                         └────┬────┘                          │
    │                              │                               │
    │                         ┌────▼────────┐                      │
    │                         │ Resample    │ (8kHz → 24kHz)       │
    │                         │ PCM16       │                      │
    │                         └────┬────────┘                      │
    │                              │                               │
    │                              ├───────────────────────────────►
    │                              │   WebSocket: PCM16 24kHz      │
    │                              │                               │
    │                              │◄──────────────────────────────┤
    │                              │   Response: PCM16 24kHz       │
    │                              │                               │
    │                         ┌────▼────────┐                      │
    │                         │ Resample    │ (24kHz → 8kHz)       │
    │                         │ PCM16       │                      │
    │                         └────┬────────┘                      │
    │                              │                               │
    │                         ┌────▼────┐                          │
    │◄────────────────────────┤ pjsua2  │ (encodes to μ-law)       │
    │  RTP: μ-law 8kHz        └─────────┘                          │

下面挑几个关键代码点,帮助理解端到端的音频流。

音频流桥接:stream_bridge.py

AudioStreamBridge 通过 asyncio 队列来编排双向音频流:

class AudioStreamBridge:
    """Bidirectional audio pump between SIP (μ-law) and Voice Live (PCM16 24kHz)."""

    VOICELIVE_SAMPLE_RATE = 24000
    SIP_SAMPLE_RATE = 8000

    def __init__(self, settings: Settings):
        self._inbound_queue: asyncio.Queue[bytes] = asyncio.Queue()   # SIP → Voice Live
        self._outbound_queue: asyncio.Queue[bytes] = asyncio.Queue()  # Voice Live → SIP

入站链路(Caller → AI)

async def _flush(self) -> None:
    """Process inbound audio: PCM16 8kHz from SIP → PCM16 24kHz to Voice Live."""
    while True:
        pcm16_8k = await self._inbound_queue.get()
        pcm16_24k = resample_pcm16(pcm16_8k, self.SIP_SAMPLE_RATE, self.VOICELIVE_SAMPLE_RATE)
        if self._voicelive_client:
            await self._voicelive_client.send_audio_chunk(pcm16_24k)

出站链路(AI → Caller)

async def emit_audio_to_sip(self, pcm_chunk: bytes) -> None:
    """Resample Voice Live audio down to 8 kHz PCM frames for SIP playback."""
    pcm_8k = resample_pcm16(pcm_chunk, self.VOICELIVE_SAMPLE_RATE, self.SIP_SAMPLE_RATE)

    # Split into 20ms frames (160 samples @ 8kHz = 320 bytes)
    frame_size_bytes = 320
    for offset in range(0, len(pcm_8k), frame_size_bytes):
        frame = pcm_8k[offset : offset + frame_size_bytes]
        if frame:
            await self._outbound_queue.put(frame)

帧时序(Frame Timing):VoIP 常用 20ms 帧(ptime=20)。在 8 kHz 下:8000 samples/sec × 0.020 sec = 160 samples = 320 bytes(PCM16)。

使用 pjsua2 处理 SIP 信令

为什么选 pjsua2?

pjproject 基本算是 SIP/RTP 的“工业标准”,很多商业产品(包括 Asterisk)都在用。pjsua2 提供的 API 具备:

  • 完整 SIP 协议栈(INVITE、ACK、BYE、REGISTER 等)
  • RTP/RTCP 媒体处理
  • 内置多种编解码(G.711、G.722、Opus 等)
  • NAT 穿透(STUN/TURN/ICE)
  • 线程安全的 C++ API + Python bindings

自定义媒体端口(Custom Media Ports)

为了把 pjsua2 的媒体管线与 asyncio 队列桥接起来,需要实现自定义的 AudioMediaPort 子类。

接收来电音频(Caller → Voice Live)

def onFrameReceived(self, frame: pj.MediaFrame) -> None:
    """Called by pjsua when it receives audio from caller (to Voice Live)."""
    if self._direction == "to_voicelive" and self._loop:
        if frame.type == pj.PJMEDIA_FRAME_TYPE_AUDIO and frame.buf:
            try:
                asyncio.run_coroutine_threadsafe(
                    self._bridge.enqueue_sip_audio(bytes(frame.buf)),
                    self._loop
                )
            except Exception as e:
                self._logger.warning("media.enqueue_failed", error=str(e))

线程模型注意点:pjsua2 的事件循环运行在专用线程里,示例用 asyncio.run_coroutine_threadsafe() 把音频数据安全地投递回主 asyncio loop。

向来电方发送音频(Voice Live → Caller)

def onFrameRequested(self, frame: pj.MediaFrame) -> None:
    """Called by pjsua when it needs audio to send to caller (from Voice Live)."""
    if self._direction == "from_voicelive" and self._loop:
        try:
            future = asyncio.run_coroutine_threadsafe(
                self._bridge.dequeue_sip_audio_nonblocking(),
                self._loop
            )
            pcm_data = future.result(timeout=0.050)

            # Ensure exactly 320 bytes (160 samples @ 8kHz)
            if len(pcm_data) != 320:
                pcm_data = (pcm_data + b'\x00' * 320)[:320]

            frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
            frame.buf = pj.ByteVector(pcm_data)
            frame.size = len(pcm_data)
        except Exception:
            # Return silence on timeout/error
            frame.type = pj.PJMEDIA_FRAME_TYPE_AUDIO
            frame.buf = pj.ByteVector(b'\x00' * 320)
            frame.size = 320

优雅降级(Graceful Degradation):如果出站队列为空(AI 还没生成音频),就注入静音帧,避免 RTP 抖动/断续。

呼叫状态管理(Call State Management)

class GatewayCall(pj.Call):
    """Handles SIP call lifecycle and connects media bridge."""

    def onCallState(self, prm: pj.OnCallStateParam) -> None:
        ci = self.getInfo()
        self._logger.info("sip.call_state", remote_uri=ci.remoteUri, state=ci.stateText)

        if ci.state == pj.PJSIP_INV_STATE_DISCONNECTED:
            self._cleanup()
            self._account.current_call = None

    def onCallMediaState(self, prm: pj.OnCallMediaStateParam) -> None:
        ci = self.getInfo()
        for mi in ci.media:
            if mi.type == pj.PJMEDIA_TYPE_AUDIO and mi.status == pj.PJSUA_CALL_MEDIA_ACTIVE:
                media = self.getMedia(mi.index)
                aud_media = pj.AudioMedia.typecastFromMedia(media)

                # Create bidirectional media bridge
                self._to_voicelive_port = CustomAudioMediaPort(self._bridge, "to_voicelive", self._logger)
                self._from_voicelive_port = CustomAudioMediaPort(self._bridge, "from_voicelive", self._logger)

                # Connect: Caller → to_voicelive_port → Voice Live
                aud_media.startTransmit(self._to_voicelive_port)
                # Connect: Voice Live → from_voicelive_port → Caller
                self._from_voicelive_port.startTransmit(aud_media)

                # Start AI conversation with greeting
                asyncio.run_coroutine_threadsafe(
                    self._voicelive_client.request_response(interrupt=False),
                    self._loop
                )

账户注册(Account Registration)

在与 Asterisk 组合的生产部署中,网关需要像标准 SIP 终端一样向 PBX 注册:

class SipAgent:
    def _run_pjsua_thread(self, loop: asyncio.AbstractEventLoop) -> None:
        self._ep = pj.Endpoint()
        self._ep.libCreate()
        self._ep.libInit(ep_cfg)
        self._ep.transportCreate(pj.PJSIP_TRANSPORT_UDP, transport_cfg)
        self._ep.libStart()

        self._account = GatewayAccount(self._logger, self._bridge, self._voicelive_client, loop)

        if self._settings.sip.register_with_server and self._settings.sip.server:
            acc_cfg = pj.AccountConfig()
            acc_cfg.idUri = f"sip:{self._settings.sip.user}@{self._settings.sip.server}"
            acc_cfg.regConfig.registrarUri = f"sip:{self._settings.sip.server}"

            # Digest authentication credentials
            cred = pj.AuthCredInfo()
            cred.scheme = "digest"
            cred.realm = self._settings.sip.auth_realm or "*"
            cred.username = self._settings.sip.auth_user
            cred.data = self._settings.sip.auth_password
            cred.dataType = pj.PJSIP_CRED_DATA_PLAIN_PASSWD
            acc_cfg.sipConfig.authCreds.append(cred)

            self._account.create(acc_cfg)

Voice Live 集成

Azure Voice Live 概览

Azure Voice Live 是一个 实时、双向的对话式 AI 服务,把下面能力整合到一起:

  • GPT-4o Realtime Preview:面向口语对话优化的超低延迟语言模型
  • 流式语音识别:连续转写,并提供词级时间戳
  • 神经 TTS:更自然的合成声音,并具备情感表达
  • 服务端 VAD:通过语音活动检测(VAD)做自动轮次切换,无需额外提示词

客户端实现:client.py

class VoiceLiveClient:
    """Manages lifecycle of an Azure Voice Live WebSocket connection."""

    async def connect(self) -> None:
        if self._settings.azure.api_key:
            credential = AzureKeyCredential(self._settings.azure.api_key)
        else:
            # Use AAD authentication (Managed Identity or Service Principal)
            self._aad_credential = DefaultAzureCredential()
            credential = await self._aad_credential.__aenter__()

        self._connection_cm = connect(
            endpoint=self._settings.azure.endpoint,
            credential=credential,
            model=self._settings.azure.model,
        )
        self._connection = await self._connection_cm.__aenter__()

        # Configure session parameters
        session = RequestSession(
            model="gpt-4o",
            modalities=[Modality.TEXT, Modality.AUDIO],
            instructions=self._settings.azure.instructions,
            input_audio_format=InputAudioFormat.PCM16,
            output_audio_format=OutputAudioFormat.PCM16,
            input_audio_transcription=AudioInputTranscriptionOptions(model="azure-speech"),
            turn_detection=ServerVad(
                threshold=0.5,
                prefix_padding_ms=200,
                silence_duration_ms=400
            ),
            voice=AzureStandardVoice(name=self._settings.azure.voice)
        )
        await self._connection.session.update(session=session)

关键配置点

  • turn_detection=ServerVad(...):由 Azure 判断用户何时停止说话,并自动触发 AI 生成响应;无需唤醒词或显式提示
  • prefix_padding_ms=200:保留语音检测前 200ms 音频,避免截断过硬
  • silence_duration_ms=400:检测到 400ms 静音后认定轮次结束

流式响应音频:AI 生成的语音会以 RESPONSE_AUDIO_DELTA 事件分片到达(base64 编码的 PCM16 chunks)。网关解码后立即通过音频桥接下发,实现低延迟播放。

生产部署:Audiocodes + Asterisk

为什么选择这种拓扑?

组件 角色 收益
Audiocodes SBC 会话边界控制器 - NAT/防火墙穿透 - 安全(DOS 防护、加密) - 协议规范化 - 拓扑隐藏 - 媒体锚定(可选转码)
Asterisk PBX SIP 服务器 - 呼叫路由与 IVR - 用户目录与鉴权 - 转接/保持/会议等高级呼叫能力 - CDR/分析 - 与企业电话系统集成
Voice Live Gateway AI 对话端点 - 实时 AI 对话 - 自然语言理解 - 动态响应生成 - 多语言支持

网络架构

                    Internet
                       │
                       │ SIP/RTP
                       ▼
         ┌─────────────────────────┐
         │   Audiocodes SBC        │
         │   Public IP: X.X.X.X    │
         │   Ports: 5060, 10000+   │
         └────────────┬────────────┘
                      │ Private Network
         ┌────────────▼────────────┐
         │   Asterisk PBX          │
         │   Internal: 10.0.1.10   │
         │   Port: 5060            │
         └────────────┬────────────┘
                      │
         ┌────────────▼────────────┐
         │  Voice Live Gateway     │
         │  Internal: 10.0.1.20    │
         │  Port: 5060             │
         │  ┌───────────────────┐  │
         │  │ Outbound HTTPS    │  │
         │  │ to Azure          │  │
         │  └─────────┬─────────┘  │
         └────────────┼────────────┘
                      │ Internet (HTTPS/WSS)
                      ▼
         ┌────────────────────────┐
         │  Azure Voice Live API  │
         │  *.cognitiveservices   │
         └────────────────────────┘

Audiocodes SBC 配置

1. 到 Asterisk 的 SIP Trunk(IP Group)

IP Group Settings:
- Name: Asterisk-Trunk
- Type: Server
- SIP Group Name: asterisk.internal.example.com
- Media Realm: Private
- Proxy Set: Asterisk-ProxySet
- Classification: Classify by Proxy Set
- SBC Operation Mode: SBC-Only
- Topology Location: Internal Network

2. Asterisk 的 Proxy Set

Proxy Set Name: Asterisk-ProxySet
Proxy Address: 10.0.1.10:5060
Transport Type: UDP
Load Balancing Method: Parking Lot

3. IP-to-IP 路由规则

Rule Name: PSTN-to-Gateway
Source IP Group: PSTN-Trunk
Destination IP Group: Asterisk-Trunk
Call Trigger: Any
Destination Prefix Manipulation: None
Message Manipulation: None

4. 媒体设置

Media Realm: Private
IPv4 Interface: LAN1 (10.0.1.1)
Media Security: None (or SRTP if required)
Codec Preference Order: G711Ulaw, G711Alaw
Transcoding: Disabled (pass-through)
RTP Port Range: 10000-20000

Asterisk 配置

/etc/asterisk/pjsip.conf

;=====================================
; Transport Configuration
;=====================================
[transport-udp]
type=transport
protocol=udp
bind=0.0.0.0:5060
local_net=10.0.0.0/8

;=====================================
; Voice Live Gateway Endpoint
;=====================================
[voicelive-gateway]
type=endpoint
context=voicelive-routing
aors=voicelive-gateway
auth=voicelive-auth
disallow=all
allow=ulaw
allow=alaw
direct_media=no
force_rport=yes
rewrite_contact=yes
rtp_symmetric=yes
ice_support=no

[voicelive-gateway]
type=aor
contact=sip:10.0.1.20:5060
qualify_frequency=30

[voicelive-auth]
type=auth
auth_type=userpass
username=
password=
realm=sip.example.com

;=====================================
; SBC Trunk (for inbound calls)
;=====================================
[audiocodes-sbc]
type=endpoint
context=from-sbc
aors=audiocodes-sbc
disallow=all
allow=ulaw
allow=alaw

[audiocodes-sbc]
type=aor
contact=sip:10.0.1.1:5060

/etc/asterisk/extensions.conf

;=====================================
; Incoming calls from SBC
;=====================================
[from-sbc]
; Example: Route calls to 800-AI-VOICE to the gateway
exten => 8002486423,1,NoOp(Routing to Voice Live Gateway)
 same => n,Set(CALLERID(name)=AI Assistant)
 same => n,Dial(PJSIP/voicelive-bot@voicelive-gateway,30)
 same => n,Hangup()

; Default handler for unmatched numbers
exten => _X.,1,NoOp(Unrouted call: ${EXTEN})
 same => n,Playback(invalid)
 same => n,Hangup()

;=====================================
; Voice Live Gateway routing context
;=====================================
[voicelive-routing]
exten => _X.,1,NoOp(Call from gateway: ${CALLERID(num)})
 same => n,Hangup()

网关配置

创建 /opt/voicelive-gateway/.env

#=====================================
# SIP Configuration
#=====================================
SIP_SERVER=asterisk.internal.example.com
SIP_PORT=5060
SIP_USER=voicelive-bot@sip.example.com
AUTH_USER=voicelive-bot
AUTH_REALM=sip.example.com
AUTH_PASSWORD=
REGISTER_WITH_SIP_SERVER=true
DISPLAY_NAME=Voice Live Bot

#=====================================
# Network Configuration
#=====================================
SIP_LOCAL_ADDRESS=0.0.0.0
SIP_VIA_ADDR=10.0.1.20
MEDIA_ADDRESS=10.0.1.20
MEDIA_PORT=10000
MEDIA_PORT_COUNT=1000

#=====================================
# Azure Voice Live
#=====================================
AZURE_VOICELIVE_ENDPOINT=wss://xxxxxx.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=
VOICE_LIVE_MODEL=gpt-4o
VOICE_LIVE_VOICE=en-US-AvaNeural
VOICE_LIVE_INSTRUCTIONS=You are a helpful customer service assistant for Contoso Inc. Answer questions about account balances, order status, and general inquiries. Be friendly and concise.
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS=200
VOICE_LIVE_PROACTIVE_GREETING_ENABLED=true
VOICE_LIVE_PROACTIVE_GREETING=Thank you for calling Contoso customer service. How can I help you today?

#=====================================
# Logging
#=====================================
LOG_LEVEL=INFO
VOICE_LIVE_LOG_FILE=/var/log/voicelive-gateway/gateway.log

Systemd 服务(Linux)

创建 /etc/systemd/system/voicelive-gateway.service

[Unit]
Description=Voice Live SIP Gateway
After=network.target

[Service]
Type=simple
User=voicelive
Group=voicelive
WorkingDirectory=/opt/voicelive-gateway
EnvironmentFile=/opt/voicelive-gateway/.env
ExecStart=/opt/voicelive-gateway/.venv/bin/python3 -m voicelive_sip_gateway.gateway.main
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/log/voicelive-gateway

[Install]
WantedBy=multi-user.target

启用并启动:

sudo systemctl daemon-reload sudo systemctl enable voicelive-gateway sudo systemctl start voicelive-gateway sudo systemctl status voicelive-gateway

配置指南

环境变量参考

Variable Description Example Required
Azure Voice Live
AZURE_VOICELIVE_ENDPOINT WebSocket endpoint URL wss://myresource.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY API key (or use AAD) abc123… ✅*
VOICE_LIVE_MODEL Model identifier gpt-4o ❌ (default: gpt-4o)
VOICE_LIVE_VOICE TTS voice name en-US-AvaNeural
VOICE_LIVE_INSTRUCTIONS System prompt You are a helpful assistant
VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS Max tokens per response 200
VOICE_LIVE_PROACTIVE_GREETING_ENABLED Enable greeting on connect true
VOICE_LIVE_PROACTIVE_GREETING Greeting message Hello! How can I help?
SIP Configuration
SIP_SERVER Asterisk/PBX hostname asterisk.example.com ✅**
SIP_PORT SIP port 5060 ❌ (default: 5060)
SIP_USER SIP user URI bot@sip.example.com ✅**
AUTH_USER Auth username bot ✅**
AUTH_REALM Auth realm sip.example.com ✅**
AUTH_PASSWORD Auth password securepass ✅**
REGISTER_WITH_SIP_SERVER Enable registration true ❌ (default: false)
DISPLAY_NAME Caller ID name Voice Live Bot
Network Settings
SIP_LOCAL_ADDRESS Local bind address 0.0.0.0 ❌ (default: 127.0.0.1)
SIP_VIA_ADDR Via header IP 10.0.1.20
MEDIA_ADDRESS RTP bind address 10.0.1.20
MEDIA_PORT RTP port range start 10000
MEDIA_PORT_COUNT RTP port range count 1000
Logging
LOG_LEVEL Log verbosity INFO ❌ (default: INFO)
VOICE_LIVE_LOG_FILE Log file path logs/gateway.log

*Either AZURE_VOICELIVE_API_KEY or AAD environment variables (AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_CLIENT_SECRET) are required.

**Required only when REGISTER_WITH_SIP_SERVER=true.

本地测试搭建(Local Testing Setup)

如果你想不依赖 SBC/PBX 基础设施进行快速验证,可以按下面步骤做本地测试。

前置条件(Prerequisites)

  1. 安装 pjproject 及 Python bindings

git clone https://github.com/pjsip/pjproject.git?WT.mc_id=AI-MVP-5003172 cd pjproject ./configure CFLAGS="-O2 -DNDEBUG" && make dep && make cd pjsip-apps/src/swig make export PYTHONPATH=$PWD:$PYTHONPATH
  1. 安装网关依赖
cd /path/to/azure-voicelive-sip-python python3 -m venv .venv source .venv/bin/activate pip install -e .[dev]
  1. 安装 PortAudio(可选:用于本机扬声器播放):
# macOS brew install portaudio # Ubuntu/Debian sudo apt-get install portaudio19-dev

本地配置

创建 .env

# No SIP server - direct connection
REGISTER_WITH_SIP_SERVER=false
SIP_LOCAL_ADDRESS=127.0.0.1
SIP_VIA_ADDR=127.0.0.1
MEDIA_ADDRESS=127.0.0.1

# Azure Voice Live
AZURE_VOICELIVE_ENDPOINT=wss://your-resource.cognitiveservices.azure.com/openai/realtime
AZURE_VOICELIVE_API_KEY=your-api-key

# Logging
LOG_LEVEL=DEBUG
VOICE_LIVE_LOG_FILE=logs/gateway.log

运行网关

# Load environment variables set -a && source .env && set +a # Start gateway make run # Or manually: PYTHONPATH=src python3 -m voicelive_sip_gateway.gateway.main

期望输出(示例):

2025-11-27 10:15:32 [info ] voicelive.connected endpoint=wss://… 2025-11-27 10:15:32 [info ] sip.transport_created port=5060 2025-11-27 10:15:32 [info ] sip.agent_started address=127.0.0.1 port=5060

连接 SIP 软电话(Softphone)

可用的免费 SIP 客户端:

配置软电话(示例):

Account Settings: - Username: test - Domain: 127.0.0.1 - Port: 5060 - Transport: UDP - Registration: Disabled (direct connection) To call: sip:test@127.0.0.1:5060

预期呼叫流程

  1. Dial sip:test@127.0.0.1:5060
  2. Gateway answers with 200 OK
  3. Hear AI greeting: “Thank you for calling. How can I help you today?”
  4. Speak your question
  5. Hear AI-generated response

延迟预算(Latency Budget)

Component Typical Latency Notes
Network (caller → gateway) 10-50ms Depends on ISP, distance
SIP/RTP processing <5ms pjproject is highly optimized
Audio resampling <2ms scipy.resample_poly is efficient
WebSocket (gateway → Azure) 20-80ms Depends on region, network
Voice Live processing 200-500ms STT + LLM inference + TTS
Total round-trip 250-650ms Perceived as near real-time

优化建议

  1. 尽量把网关部署在与 Voice Live 资源相同的 Azure 区域
  2. 在 Audiocodes SBC 上启用 expedited routing(同时关闭不必要的 media anchoring)
  3. 尽量减少 SIP hop:简单场景可直接 SBC → Gateway(跳过 Asterisk)
  4. 监控队列深度:当 _inbound_queue_outbound_queue 超过 10 个 item 时输出警告

关键日志事件(Key Log Events)

Event Meaning Action
sip.agent_started Gateway listening ✅ Normal
sip.incoming_call New call received ✅ Normal
sip.media_active Audio bridge established ✅ Normal
voicelive.connected WebSocket connected ✅ Normal
voicelive.audio_delta AI audio chunk received ✅ Normal
voicelive.event_error Voice Live API error ⚠️ Check API key, quota
media.enqueue_failed Audio queue full ⚠️ CPU overload or slow network
sip.thread_error pjsua crash 🔴 Restart gateway

常见问题(Common Issues)

1. 听不到来电方音频

现象:AI 对你的讲话没有响应

诊断

Check RTP packets arriving sudo tcpdump -i any -n udp port 10000-11000

解决

  • 确认 MEDIA_ADDRESS 与网关可达 IP 一致
  • 检查防火墙(放行 UDP 10000-11000)
  • 确认 Asterisk 配置:direct_media=nortp_symmetric=yes

2. 音频断续/卡顿(Choppy audio)

现象:声音发“机械音”、有明显掉包

诊断:查看日志中的队列深度

解决

  • 增加 CPU 资源
  • 降低 VOICE_LIVE_MAX_RESPONSE_OUTPUT_TOKENS,减少生成时间
  • 检查网络抖动(目标 <30ms)

3. SIP 注册失败

现象:Asterisk 执行 pjsip show endpoints 显示 Unreachable

诊断

On Asterisk pjsip set logger on # Watch for REGISTER attempts

解决

  • 核对 AUTH_USER / AUTH_PASSWORD / AUTH_REALM 与 Asterisk pjsip.conf 一致
  • 检查网络连通性:ping asterisk.example.com
  • 确认 REGISTER_WITH_SIP_SERVER=true

结语

这套方案给出了一个“生产级桥接器”的参考实现:在电话语音世界与 Azure 的 AI Voice Live 服务之间建立稳定、低延迟的双向媒体通道。文章中强调的成果包括:

实时音频转码:额外开销 <5ms ✅ 健壮的 SIP 栈:基于工业级 pjproject ✅ 异步架构:具备更高并发潜力 ✅ 企业级部署拓扑:Audiocodes SBC + Asterisk PBX ✅ 可观测性:结构化日志贯穿全链路

下一步(Next Steps)

可以考虑的增强方向:

  1. 多通话支持:重构为单实例支持多个并发呼叫
  2. DTMF:实现 RFC 2833,用于 IVR 场景
  3. 呼叫转接:支持 SIP REFER,把来电转给人工坐席
  4. 录音与合规:将会话写入 Azure Blob Storage
  5. 指标体系:输出 Prometheus 指标(时长、错误率、延迟分位数)
  6. Kubernetes:用 Helm chart 做网关的自动扩缩
  7. TLS/SRTP:端到端加密 SIP 信令与 RTP 媒体

参考资源

获取代码(Get the Code)

完整源码在 GitHub:

https://github.com/monuminu/azure-voicelive-sip-python?WT.mc_id=AI-MVP-5003172