Phi-3 是由微软研究团队开发的小型语言模型。在多个公开基准测试(例如在 MMLU 上 Phi-3 达到了69%的成绩)中,Phi-3 表现良好并且可以支持长达128k的上下文。Phi-3-mini 3.8B版于2024年4月末首次发布,而 Phi-3-small、Phi-3-medium 和 Phi-3-vision 于5月在微软 Build 大会上揭晓。

QLoRA 是一种高效的微调技术,它将预训练语言模型量化为4位,并附加了小型“低秩适配器”,这些适配器是可微调的。这使得在单个GPU上可以微调具有多达650亿参数的模型,QLoRA 的性能与全精度微调相媲美,并在语言任务上实现了最佳效果。

本文将展示如何使用 QLoRA 高效量化 LLMs 的 Flash Attention 进行微调。

准备

先下载 ultrachat 数据集。

1
2
3
4
5
6
7
8
from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", split='train_sft[:2%]')

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

从数据集中选取一个较短的版本来创建训练和测试示例。为了训练我们的模型,我们需要将结构化的示例转换为一系列通过指令描述的任务。我们定义了一个函数,它接受一个样本并返回一个以我们指定格式编写的字符串。

1
2
3
4
5
dataset = dataset.train_test_split(test_size=0.2)
train_dataset = dataset['train']
train_dataset.to_json(f"data/train.jsonl")
test_dataset = dataset['test']
test_dataset.to_json(f"data/eval.jsonl")

将训练数据集和测试数据集保存为JSON格式。现在添加 Azure ML SDK 声明。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component
from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

创建工作空间客户端。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
credential = DefaultAzureCredential()
workspace_ml_client = None
try:
    workspace_ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    subscription_id= "Enter your subscription_id"
    resource_group = "Enter your resource_group"
    workspace= "Enter your workspace name"
    workspace_ml_client = MLClient(credential, subscription_id, 
                                     resource_group, workspace)

创建一个自定义的训练环境。

1
2
3
4
5
6
7
8
from azure.ai.ml.entities import Environment, BuildContext
env_docker_image = Environment(
    image="mcr.microsoft.com/azureml/curated/acft-hf-nlp-gpu:latest",
    conda_file="environment/conda.yml",
    name="llm-training",
    description="Environment created for llm training.",
)
workspace_ml_client.environments.create_or_update(env_docker_image)

来检查一下这个 conda.yaml。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8
  - pip=24.0
  - pip:
    - bitsandbytes==0.43.1
    - transformers~=4.41
    - peft~=0.11
    - accelerate~=0.30
    - trl==0.8.6
    - einops==0.8.0
    - datasets==2.19.1
    - wandb==0.17.0
    - mlflow==2.13.0
    - azureml-mlflow==1.56.0 
    - torchvision==0.18.0

训练

我们将使用 Tim Dettmers 等人在论文“QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation”中提出的新方法。QLoRA 是一种在微调大型语言模型时减少内存占用的新技术,同时不会降低性能。

QLoRA的工作原理概述如下:

  • 将预训练模型量化为4位精度并冻结其参数。
  • 添加小型、可训练的适配层(LoRA)。
  • 仅微调适配层,同时使用冻结的量化模型来获取上下文信息。
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
import os
#import mlflow
import argparse
import sys
import logging

import datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, 
                         TrainingArguments, BitsAndBytesConfig
from datasets import load_dataset

logger = logging.getLogger(__name__)


###################
# Hyper-parameters
###################
training_config = {
    "bf16": True,
    "do_eval": False,
    "learning_rate": 5.0e-06,
    "log_level": "info",
    "logging_steps": 20,
    "logging_strategy": "steps",
    "lr_scheduler_type": "cosine",
    "num_train_epochs": 1,
    "max_steps": -1,
    "output_dir": "./checkpoint_dir",
    "overwrite_output_dir": True,
    "per_device_eval_batch_size": 4,
    "per_device_train_batch_size": 4,
    "remove_unused_columns": True,
    "save_steps": 100,
    "save_total_limit": 1,
    "seed": 0,
    "gradient_checkpointing": True,
    "gradient_checkpointing_kwargs":{"use_reentrant": False},
    "gradient_accumulation_steps": 1,
    "warmup_ratio": 0.2,
    }

peft_config = {
    "r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": "all-linear",
    "modules_to_save": None,
}
train_conf = TrainingArguments(**training_config)
peft_conf = LoraConfig(**peft_config)

###############
# Setup logging
###############
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = train_conf.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()

# Log on each process a small summary
logger.warning(
    f"Process rank: {train_conf.local_rank}, 
    device: {train_conf.device}, n_gpu: {train_conf.n_gpu}"
    + f" distributed training: {bool(train_conf.local_rank != -1)}, 
    16-bits training: {train_conf.fp16}"
)
logger.info(f"Training/evaluation parameters {train_conf}")
logger.info(f"PEFT parameters {peft_conf}")

################
# Modle Loading
################
checkpoint_path = "microsoft/Phi-3-mini-4k-instruct"
# checkpoint_path = "microsoft/Phi-3-mini-128k-instruct"
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
    # loading the model with flash-attenstion support
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map=None
)
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
tokenizer.model_max_length = 2048
# use unk rather than eos token to prevent endless generation
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'

##################
# Data Processing
##################
def apply_chat_template(
    example,
    tokenizer,
):
    messages = example["messages"]
    # Add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False)
    return example



def main(args):
    train_dataset = load_dataset('json', data_files=args.train_file, split='train')
    test_dataset = load_dataset('json', data_files=args.eval_file, split='train')
    column_names = list(train_dataset.features)

    processed_train_dataset = train_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to train_sft",
    )

    processed_test_dataset = test_dataset.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer},
        num_proc=10,
        remove_columns=column_names,
        desc="Applying chat template to test_sft",
    )

    ###########
    # Training
    ###########
    trainer = SFTTrainer(
        model=model,
        args=train_conf,
        peft_config=peft_conf,
        train_dataset=processed_train_dataset,
        eval_dataset=processed_test_dataset,
        max_seq_length=2048,
        dataset_text_field="text",
        tokenizer=tokenizer,
        packing=True
    )
    train_result = trainer.train()
    metrics = train_result.metrics
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()


    #############
    # Evaluation
    #############
    tokenizer.padding_side = 'left'
    metrics = trainer.evaluate()
    metrics["eval_samples"] = len(processed_test_dataset)
    trainer.log_metrics("eval", metrics)
    trainer.save_metrics("eval", metrics)


    # ############
    # # Save model
    # ############
    os.makedirs(args.model_dir, exist_ok=True)
    torch.save(model, os.path.join(args.model_dir, "model.pt"))

def parse_args():
    # setup argparse
    parser = argparse.ArgumentParser()

    # add arguments
    parser.add_argument("--train-file", type=str, help="Input data for training")
    parser.add_argument("--eval-file", type=str, help="Input data for eval")
    parser.add_argument("--model-dir", type=str, default="./", 
    help="output directory for model")
    parser.add_argument("--epochs", default=10, type=int, help="number of epochs")
    parser.add_argument(
        "--batch-size",
        default=16,
        type=int,
        help="mini batch size for each gpu/process",
    )
    parser.add_argument("--learning-rate", default=0.001, type=float, 
    help="learning rate")
    parser.add_argument("--momentum", default=0.9, type=float, 
    help="momentum")
    parser.add_argument(
        "--print-freq",
        default=200,
        type=int,
        help="frequency of printing training statistics",
    )

    # parse args
    args = parser.parse_args()

    # return args
    return args


# run script
if __name__ == "__main__":
    # parse args
    args = parse_args()
    # call main function
    main(args)

创建一个训练计算的实例。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from azure.ai.ml.entities import AmlCompute
# If you have a specific compute size to work with change it here. 
# By default we use the 1 x A100 compute from the above list

compute_cluster_size = "Standard_NC24ads_A100_v4"  # 1 x A100 (80GB)
# If you already have a gpu cluster, mention it here. 
Else will create a new one with the name 'gpu-cluster-big'
compute_cluster = "gpu-a100"
try:
    compute = ml_client.compute.get(compute_cluster)
    print("The compute cluster already exists! Reusing it for the current run")
except Exception as ex:
    print(
        f"Looks like the compute cluster doesn't exist. Creating a new one 
        with compute size {compute_cluster_size}!"
    )
    try:
        print("Attempt #1 - Trying to create a dedicated compute")
        compute = AmlCompute(
            name=compute_cluster,
            size=compute_cluster_size,
            tier="Dedicated",
            # For multi node training set this to an integer value more than 1
            max_instances=1,
        )
        ml_client.compute.begin_create_or_update(compute).wait()
    except Exception as e:
        print("Error")

以下是一些有用的建议:

  • 不需要使用较高的LoRA编码级别(例如,r=256)。根据经验8或16的编码级别就足以作为基础了。
  • 如果训练数据集较小,最好将rank设置为alpha。通常对于较小的数据集,2* rank或4* rank的训练通常不稳定。
  • 在使用LORA时,将学习率设置为较小的值。例如1e-3或2e-4的学习率是不推荐的。建议从8e-4或5e-5开始尝试。
  • 与其增大批处理大小,你应该检查是否有足够的 GPU 内存。这是因为如果上下文长度为8K等较长时,可能会出现OOM(内存不足)情况。使用梯度检查点和梯度累积可以增加批处理大小的效果。
  • 如果你对批量大小和内存敏感,那么绝对不要使用Adam,包括低精度Adam。Adam需要额外的GPU内存来计算动量。SGD(随机梯度下降)的收敛速度较慢,但不会占用额外的GPU内存。

现在使用刚才创建的 AML 计算来运行上述训练脚本。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from azure.ai.ml import command
from azure.ai.ml import Input
from azure.ai.ml.entities import ResourceConfiguration

job = command(
    inputs=dict(
        train_file=Input(
            type="uri_file",
            path="data/train.jsonl",
        ),
        eval_file=Input(
            type="uri_file",
            path="data/eval.jsonl",
        ),        
        epoch=1,
        batchsize=64,
        lr = 0.01,
        momentum = 0.9,
        prtfreq = 200,
        output = "./outputs"
    ),
    code="./src",  # local path where the code is stored
    compute = 'gpu-a100',
    command="""accelerate launch train.py --train-file ${{inputs.train_file}} 
    --eval-file ${{inputs.eval_file}} --epochs ${{inputs.epoch}} 
    --batch-size ${{inputs.batchsize}} --learning-rate ${{inputs.lr}} 
    --momentum ${{inputs.momentum}} --print-freq ${{inputs.prtfreq}} 
    --model-dir ${{inputs.output}}""",
    environment="azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/52",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,
    },
)
returned_job  = workspace_ml_client.jobs.create_or_update(job)
workspace_ml_client.jobs.stream(returned_job.name)

输出管道的结果。

1
2
3
# check if the `trained_model` output is available
job_name = returned_job.name
print("pipeline job outputs: ", workspace_ml_client.jobs.get(job_name).outputs)

收尾

一旦模型完成微调,请在工作空间中注册该工作以创建端点。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from azure.ai.ml.entities import Model
from azure.ai.ml.constants import AssetTypes

run_model = Model(
    path=f"azureml://jobs/{job_name}/outputs/artifacts/paths/outputs/mlflow_model_folder",
    name="phi-3-finetuned",
    description="Model created from run.",
    type=AssetTypes.MLFLOW_MODEL,
)
model = workspace_ml_client.models.create_or_update(run_model)

此时创建一个端点。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    IdentityConfiguration,
    ManagedIdentityConfiguration,
)

# Check if the endpoint already exists in the workspace
try:
    endpoint = workspace_ml_client.online_endpoints.get(endpoint_name)
    print("---Endpoint already exists---")
except:
    # Create an online endpoint if it doesn't exist

    # Define the endpoint
    endpoint = ManagedOnlineEndpoint(
        name=endpoint_name,
        description=f"Test endpoint for {model.name}",
        identity=IdentityConfiguration(
            type="user_assigned",
            user_assigned_identities=[ManagedIdentityConfiguration(resource_id=uai_id)],
        )
        if uai_id != ""
        else None,
    )

# Trigger the endpoint creation
try:
    workspace_ml_client.begin_create_or_update(endpoint).wait()
    print("\n---Endpoint created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Endpoint creation failed. Detailed Response:\n{err}"
    ) from err

一旦创建了端点就可以继续创建部署了。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Initialize deployment parameters

deployment_name = "phi3-deploy"
sku_name = "Standard_NCs_v3"

REQUEST_TIMEOUT_MS = 90000

deployment_env_vars = {
    "SUBSCRIPTION_ID": subscription_id,
    "RESOURCE_GROUP_NAME": resource_group,
    "UAI_CLIENT_ID": uai_client_id,
}

对于推理,要使用不同的基础图像。

1
2
3
4
5
6
7
8
9
from azure.ai.ml.entities import Model, Environment
env = Environment(
    image='mcr.microsoft.com/azureml/curated/foundation-model-inference:latest',
    inference_config={
        "liveness_route": {"port": 5001, "path": "/"},
        "readiness_route": {"port": 5001, "path": "/"},
        "scoring_route": {"port": 5001, "path": "/score"},
    },
)

现在尝试部署这个模型。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from azure.ai.ml.entities import (
    OnlineRequestSettings,
    CodeConfiguration,
    ManagedOnlineDeployment,
    ProbeSettings,
    Environment
)

deployment = ManagedOnlineDeployment(
    name=deployment_name,
    endpoint_name=endpoint_name,
    model=model.id,
    instance_type=sku_name,
    instance_count=1,
    #code_configuration=code_configuration,
    environment = env,
    environment_variables=deployment_env_vars,
    request_settings=OnlineRequestSettings(request_timeout_ms=REQUEST_TIMEOUT_MS),
    liveness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
    readiness_probe=ProbeSettings(
        failure_threshold=30,
        success_threshold=1,
        period=100,
        initial_delay=500,
    ),
)

# Trigger the deployment creation
try:
    workspace_ml_client.begin_create_or_update(deployment).wait()
    print("\n---Deployment created successfully---\n")
except Exception as err:
    raise RuntimeError(
        f"Deployment creation failed. Detailed Response:\n{err}"
    ) from err

如果想删除该端点,请查看以下代码。

1
2
3
workspace_ml_client.online_deployments.begin_delete(name = deployment_name, 
                                                    endpoint_name = endpoint_name)
workspace_ml_client._online_endpoints.begin_delete(name = endpoint_name)

您可以对上面的代码片段进行实际操作,但如果您想对更完善的代码进行实际操作,请参阅 GitHub azure-llm-fine-tuning 存储库

参考来源