CosyVoice3 语音合成实战

Fun-CosyVoice3-0.5B-2512 是阿里巴巴通义语音团队推出的开源文本转语音(TTS)模型,具备多语言、多方言合成、情感控制、音色克隆等实用功能,且部署门槛低,非常适合初学者上手。本教程将围绕最基础的环境搭建、核心功能使用及简单应用,带大家快速掌握模型的基本用法。


  1. 硬件要求:建议 GPU 显存 ≥ 12GB,无 GPU 环境可使用 CPU 推理,但速度较慢。
  2. 依赖工具:确保已安装 Git(用于克隆代码和下载模型)、Anaconda(用于创建独立环境)。
  3. 课程环境:服务端 Ubuntu 24.04 + RTX 3060 12G + Python 3.10,客户端:Windows/Mac + Python 3.10

1. 模型使用

模型链接:https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512

在 Ubuntu 22.04:

  1. 创建虚拟环境 cosyvoice-env ,并在虚拟环境安装相关依赖包
  2. 创建 my-cosyvoice 项目目录,并在该目录下载模型文件和开发包
# 下载开发包
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
# 下载模型
git clone https://www.modelscope.cn/FunAudioLLM/Fun-CosyVoice3-0.5B-2512.git

# 创建虚拟环境,并安装开发包相关依赖
conda create -n cosyvoice-env python=3.10
conda activate cosyvoice-env
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

在 Windows 11:

  1. 在 PyCharm 创建项目:TTS服务器
  2. 配置项目解释器: Ubuntu 24.04 cosyvoice-env 环境中的解释器
  3. 编写 Python 代码,上传到 Ubuntu 24.04 执行

注意:如果你使用一台电脑学习,那么在本地电脑执行上述步骤即可。


使用示例:

import warnings
warnings.filterwarnings('ignore')

import os
os.environ['TQDM_DISABLE'] = '1'

import logging
logging.disable(logging.CRITICAL)

import onnxruntime as ort
ort.set_default_logger_severity(3)

import sys
sys.path.append('CosyVoice')
sys.path.append('CosyVoice/third_party/Matcha-TTS')

from cosyvoice.cli.cosyvoice import CosyVoice3
import torchaudio

# 加载模型
cosyvoice = CosyVoice3(model_dir='Fun-CosyVoice3-0.5B-2512', fp16=True)

# 1. 同语言复刻
def demo01():
    # 输入中文参考音频,生成中文语音
    prompt_text = 'You are a helpful assistant.<|endofprompt|>如果你对某件事情有强烈的感觉,你应该发声并采取行动。这是我生活的哲学。'
    prompt_wav = 'audio/spk-1.wav'
    tts_text = '在这宁静的夜晚,我们可以沿着小路慢慢走,感受微风拂面的轻柔,与自然融为一体。'
    generator = cosyvoice.inference_zero_shot(tts_text=tts_text, prompt_text=prompt_text, prompt_wav=prompt_wav)
    # 获得语音
    output = next(generator)
    # 存储音频
    torchaudio.save('demo/demo01-1.wav', output['tts_speech'], cosyvoice.sample_rate)

    # 输入英文参考音频,生成英文语音
    prompt_text = 'You are a helpful assistant.<|endofprompt|>Maintaining your ability to learn translates into increased marketability, improved career optionsand higher salaries.'
    prompt_wav = 'audio/spk-3.wav'
    tts_text = 'The device would work during the day as well, if you took steps to either block direct sunlightor point it away from the sun.'
    generator = cosyvoice.inference_zero_shot(tts_text=tts_text, prompt_text=prompt_text, prompt_wav=prompt_wav)
    torchaudio.save('demo/demo01-2.wav', next(generator)['tts_speech'], cosyvoice.sample_rate)


# 2. 跨语言复刻
def demo02():
    # 文本 + 音频 = 精准复刻,跨语言复刻时,这种方式生成的语音会尽量按照参考音频的发音习惯来生成
    prompt_text = 'You are a helpful assistant.<|endofprompt|>Maintaining your ability to learn translates into increased marketability, improved career optionsand higher salaries.'
    prompt_wav = 'audio/spk-3.wav'
    tts_text = '窗前明月光,疑是地上霜,举头望明月,低头思故乡。'
    generator = cosyvoice.inference_zero_shot(tts_text=tts_text, prompt_text=prompt_text, prompt_wav=prompt_wav)
    torchaudio.save('demo/demo02-1.wav', next(generator)['tts_speech'], cosyvoice.sample_rate)

    # 音频 = 模糊复刻,这种方式生成的语音会参考音频的特点,但是语音还是参考目标语言的天然发音习惯
    tts_text = 'You are a helpful assistant.<|endofprompt|>窗前明月光,疑是地上霜,举头望明月,低头思故乡'
    generator = cosyvoice.inference_cross_lingual(tts_text=tts_text, prompt_wav=prompt_wav)
    torchaudio.save('demo/demo02-2.wav', next(generator)['tts_speech'], cosyvoice.sample_rate)

    # 注意:inference_cross_lingual 适合短文本
    # 由于 inference_cross_lingual 要求输入的 tts_text 前面加上 You are a helpful assistant.<|endofprompt|>,
    # 但是加上之后,长文本就不会自动切分,会把长文本一次性扔给模型,由于输入文本太长,模型生成的语音会存在发音不准确,有噪声,不完整的问题
    # 如果不加 You are a helpful assistant.<|endofprompt|>,会生成多段语音,但是语音和语音有很大的差别


# 3. 更多生成控制
def demo03():
    prompt_wav = 'audio/spk-1.wav'

    # =========== 1. 方言 ===========
    tts_text = '那座古老的城堡笼罩在神秘的雾气中,吸引着冒险者前去探索。'
    instruct_text = 'You are a helpful assistant. 请用粤语表达,并用合适的语速。<|endofprompt|>'
    generator = cosyvoice.inference_instruct2(tts_text=tts_text, instruct_text=instruct_text, prompt_wav=prompt_wav)
    torchaudio.save('demo/demo03-1.wav', next(generator)['tts_speech'], cosyvoice.sample_rate)

    # =========== 2. 语速 ===========
    tts_text = '那座古老的城堡笼罩在神秘的雾气中,吸引着冒险者前去探索。'
    instruct_text = 'You are a helpful assistant. 请用尽可能快的语速。<|endofprompt|>'
    generator = cosyvoice.inference_instruct2(tts_text=tts_text, instruct_text=instruct_text, prompt_wav=prompt_wav)
    torchaudio.save('demo/demo03-2.wav', next(generator)['tts_speech'], cosyvoice.sample_rate)

    # =========== 2. 情绪 ===========
    tts_text = '这里一片荒凉,没有水,也没有生机,孤独感和无助让我心如刀割。'
    instruct_text = 'You are a helpful assistant. 请用非常生气的情绪说这句话。<|endofprompt|>'
    generator = cosyvoice.inference_instruct2(tts_text=tts_text, instruct_text=instruct_text, prompt_wav=prompt_wav)
    torchaudio.save('demo/demo03-3.wav', next(generator)['tts_speech'], cosyvoice.sample_rate)


# 4. 参考音频预处理
def demo04():
    prompt_text = 'You are a helpful assistant.<|endofprompt|>如果你对某件事情有强烈的感觉,你应该发声并采取行动。这是我生活的哲学。'
    prompt_wav = 'audio/spk-1.wav'
    cosyvoice.add_zero_shot_spk(prompt_text=prompt_text, prompt_wav=prompt_wav, zero_shot_spk_id='zh-spk')

    prompt_text = 'You are a helpful assistant.<|endofprompt|>Maintaining your ability to learn translates into increased marketability, improved career optionsand higher salaries.'
    prompt_wav = 'audio/spk-3.wav'
    cosyvoice.add_zero_shot_spk(prompt_text=prompt_text, prompt_wav=prompt_wav, zero_shot_spk_id='en-spk')

    cosyvoice.save_spkinfo()

    # 注意:如果输入文本较长,内部会切分成多个 chunk,分别对每个 chunk 进行语音转换
    tts_text = '在技术领域与日常生活中,“组块” 一词都有着实用价值。它指代一个独立且易于处理的单元,既可以是人工智能模型中被拆分处理的数据组块、简化复杂逻辑的代码组块,也能是一段用于专注工作的时间组块。将庞大任务拆解为小的组块,能消解畏难情绪,把无从下手的压力转化为循序渐进的行动力。'
    generator = cosyvoice.inference_zero_shot(tts_text=tts_text, prompt_text='', prompt_wav='', zero_shot_spk_id='zh-spk')
    for idx, chunk in enumerate(generator, start=1):
        torchaudio.save(f'demo/demo04-{idx}.wav', chunk['tts_speech'], cosyvoice.sample_rate)

    # 注意:inference_zero_shot、inference_instruct2 虽然有 zero_shot_spk_id 参数,但是实现逻辑有 BUG, 长文本时会报错
    tts_text = 'You are a helpful assistant.<|endofprompt|>你有没有抬头看过晴朗的天空,好奇它为什么是蓝色的?这其实和阳光与地球大气层的相互作用有关。太阳光看起来是白色的,但实际上它由多种颜色的光组成,每种颜色对应不同的波长。其中,蓝光的波长较短,容易被空气中的分子(如氮气和氧气)散射到各个方向。这种现象被称为“瑞利散射”。由于蓝光比红光、黄光等更容易散射,当我们望向天空时,眼睛接收到的就是这些被散射的蓝光,因此天空呈现出蓝色。'
    # generator = cosyvoice.inference_cross_lingual(tts_text=tts_text, prompt_wav='', zero_shot_spk_id='zh-spk')
    # for chunk in generator:
    #     pass

    tts_text = '你有没有抬头看过晴朗的天空,好奇它为什么是蓝色的?这其实和阳光与地球大气层的相互作用有关。太阳光看起来是白色的,但实际上它由多种颜色的光组成,每种颜色对应不同的波长。其中,蓝光的波长较短,容易被空气中的分子(如氮气和氧气)散射到各个方向。这种现象被称为“瑞利散射”。由于蓝光比红光、黄光等更容易散射,当我们望向天空时,眼睛接收到的就是这些被散射的蓝光,因此天空呈现出蓝色。'
    # instruct_text = 'You are a helpful assistant. 请用粤语表达,并用合适的语速。<|endofprompt|>'
    # generator = cosyvoice.inference_instruct2(tts_text=tts_text, instruct_text=instruct_text, prompt_wav='', zero_shot_spk_id='zh-spk')
    # for chunk in generator:
    #     pass


if __name__ == '__main__':
    demo01()
    demo02()
    demo03()
    demo04()

2. 应用案例

2.1 服务端

将模型部署为服务,会用到 FastAPI 和 Uvicorn 两个工具:

  • FastAPI:一款轻量、高效的 Python Web 框架,专门用于构建 API 服务。它能快速将模型推理功能封装为可调用的接口,上手难度低,适合初学者用于简单服务部署。
  • Uvicorn:一款高性能的 Web 服务器,用于运行 FastAPI 应用。它负责接收客户端的请求并转发给 FastAPI 处理,是 FastAPI 应用部署的常用配套工具,启动简单且占用资源少。

# 禁止输出 DEBU 级别的日志
import logging
logging.disable(logging.DEBUG)

# 设置 onnxruntime 日志输出级别最高
import onnxruntime
onnxruntime.set_default_logger_severity(4)

# 禁止输出所有警告
import warnings
warnings.filterwarnings('ignore')

# 全局禁用所有 tqdm 进度条
import os
os.environ['TQDM_DISABLE'] = '1'

import sys
sys.path.append('CosyVoice')
sys.path.append('CosyVoice/third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice3

from fastapi import FastAPI
from fastapi import Form
from fastapi.responses import StreamingResponse
from fastapi.responses import JSONResponse

import numpy as np
import uvicorn
import time
import json
import glob

app = FastAPI()

# 加载模型
cosyvoice = CosyVoice3(model_dir='Fun-CosyVoice3-0.5B-2512', fp16=True)

# 添加音色
speakers = json.load(open('audio/speakers.json'))
prefix = 'You are a helpful assistant.<|endofprompt|>'
available_speakers = []
for speaker in speakers:
    cosyvoice.add_zero_shot_spk(prefix + speaker['txt'], f'audio/{speaker["wav"]}', zero_shot_spk_id=speaker['sid'])
    available_speakers.append({'name': speaker['name'], 'sid': speaker['sid']})

def text_to_speech(txt, spk):
    generator = cosyvoice.inference_zero_shot(txt, '', '', zero_shot_spk_id=spk)
    for chunk in generator:
        print(chunk)
        chunk = chunk['tts_speech'].numpy().squeeze()
        yield chunk.tobytes()

@app.post('/tts')
def get_audio(txt: str = Form(...), spk: str = Form(...)):
    return StreamingResponse(content=text_to_speech(txt, spk))

@app.get('/spk')
def get_speaker_list():
    return JSONResponse(content=available_speakers)

if __name__ == '__main__':
    uvicorn.run('server:app', host='0.0.0.0', port=8000)

2.2 客户端

为了让 TTS 模型的使用更直观、无需编写代码就能操作,我们可以借助 Gradio 工具。它是一款轻量级 Python 可视化工具,核心作用是快速搭建机器学习应用的 Web 交互式界面,无需掌握复杂前端开发技术,仅需少量 Python 代码,就能生成包含文本输入框、音色选择器、语音播放器的客户端界面,既能方便自己调试模型,也能让非技术人员轻松使用,是快速演示和落地 TTS 模型功能的优选工具。

conda create -n cosyvoice-client python=3.10
conda activate cosyvoice-client
pip install gradio numpy soundfile requests
import warnings
warnings.filterwarnings('ignore')

import gradio as gr
import requests
import soundfile as sf
import numpy as np
import glob
from pathlib import Path
import json


default, sample_rate = sf.read('audio/default.wav', dtype=np.float32)

def generate_audio(txt, spk):
    try:
        response = requests.post('http://192.168.5.20:8000/tts', data={'txt': txt, 'spk': spk}, timeout=60, stream=True)
        response.raise_for_status()
        audio_bytes = bytearray()
        for bytes in response.iter_content(chunk_size=4096):
            audio_bytes.extend(bytes)
        audio = np.frombuffer(audio_bytes, dtype=np.float32)
        return (24000, audio)
    except requests.RequestException:
        return (sample_rate, default)


def page_load():
    try:
        response = requests.get('http://192.168.5.20:8000/spk', timeout=10)
        response.raise_for_status()
        speakers = json.loads(response.content)  # 转换为列表对象
        choices = [(speaker['name'], speaker['sid']) for speaker in speakers]
        return gr.update(choices=choices, value=choices[0][1])
    except requests.RequestException:
        return gr.update(choices=[('网络错误', 'error')], value='error')


# 加载示例文本
samples = { Path(fname).stem: open(fname, 'r', encoding='utf8').read() for fname in glob.glob('example/*.txt')}

with gr.Blocks(title='TTS 语音合成') as app:
    gr.Markdown('<center><h1> 🎙️ 文本转语音 (TTS)</h1></center>')
    audio = gr.Audio(label='播放语音', type='numpy', buttons=['download', ], autoplay=True)

    with gr.Row():
        radio = gr.Radio(choices=samples.keys(), show_label=False, interactive=True, scale=9)
        speak = gr.Dropdown(show_label=False, scale=1)

    inputs = gr.TextArea(show_label=False)

    with gr.Row():
        clear = gr.Button('清空', variant='secondary')
        submit = gr.Button('生成', variant='primary')

    # 绑定事件
    disable = lambda: gr.update(interactive=False)
    enable = lambda: gr.update(interactive=True)
    submit.click(fn=disable, outputs=submit).then(fn=generate_audio, inputs=[inputs, speak], outputs=audio).then(fn=enable, outputs=submit)

    clear.click(fn=lambda : '', outputs=inputs)
    radio.change(fn=lambda title: samples.get(title, ''), inputs=radio, outputs=inputs)

    app.load(fn=page_load, outputs=speak)

if __name__ == '__main__':
    css = '''
         .gradio-container {width: 1050px;margin: 0 auto;background-color:white;}
         .control-wrapper{display:none;}
     '''
    app.launch(theme=gr.themes.Citrus(), css=css)

未经允许不得转载:一亩三分地 » CosyVoice3 语音合成实战
评论 (0)

9 + 8 =