基于BL808的人脸关键点识别

内容介绍

一、项目介绍

本次选择的是活动中的项目3 - 人脸关键点识别

目标：实现人脸关键点的识别和简单特效，熟悉BL808的摄像头/屏幕模型操作

具体要求：

将摄像头图像显示到屏幕上
使用矽速提供的人脸68关键点检测模型，检测出68关键点并标注出来
基于68关键点位置，尝试进行一些简易特效，如叠加眼镜，叠加口罩，变脸等操作

1.模型选择为轻量化和易用相对较好的MobileNetV2网络模型

2.采用tensorflow框架搭建模型，并对人脸的68个关键点进行回归

3.使用tflite量化工具将模型量化为8bit的uint_8模型,再用博流官方工具转换为适配BL808的NPU模型

4.编写识别逻辑，在回调函数中测试模型精度

5.在回调函数完成对图像关键点的绘制和眼睛的叠加

二、硬件介绍

采用博流的BL808芯片，该芯片具有三核异构设计并外加NPU用以处理神经网络运算

D0和作为主核具有480MHz的最大工作频率
模组内置了一片64MB 的DRAM
芯片内由专门的NPU用以加速神经网络运算

无线方面支持

wifi 802.11 b/g/n
支持蓝牙5.x（BLE和BL两种模式）
支持蓝牙wifi共存

编解码支持

MJPEG 和 H264
1920*1080@30fps + 640*480@30fps

板上功能

一个1.69寸的240*280的电容触摸
一个OV2640的一个200W像素的摄像头
板载一个模拟麦克风、LED、TFCard卡槽
板载一个USB转UART的调试器

三、设计思路

1.首先在Kaggle上面按需找一个标注好人脸68个关键点的数据集，并对其进行放缩、裁剪、归一化和统一编码导入数据集类

原本数据集编码

数据集处理

class Resize(object):
    # 将输入图像调整为指定大小

    def __init__(self, output_size):
        assert isinstance(output_size, (int, tuple))
        self.output_size = output_size

    def __call__(self, data):
        image = data[0]    # 获取图片
        key_pts = data[1]  # 获取标签
        image_copy = np.copy(image)
        key_pts_copy = np.copy(key_pts)
        h, w = image_copy.shape[:2]
        if isinstance(self.output_size, int):
            if h > w:
                new_h, new_w = self.output_size * h / w, self.output_size
            else:
                new_h, new_w = self.output_size, self.output_size * w / h
        else:
            new_h, new_w = self.output_size

        new_h, new_w = int(new_h), int(new_w)

        img = tf.image.resize(image_copy, (new_h, new_w))

        # scale the pts, too
        key_pts_copy[::2] = key_pts_copy[::2] * new_w / w
        key_pts_copy[1::2] = key_pts_copy[1::2] * new_h / h

        return np.array(img), np.array(key_pts_copy)


class RandomCrop(object):
    # 随机位置裁剪输入的图像

    def __init__(self, output_size):
        assert isinstance(output_size, (int, tuple))
        if isinstance(output_size, int):
            self.output_size = (output_size, output_size)
        else:
            assert len(output_size) == 2
            self.output_size = output_size

    def __call__(self, data):
        image = data[0]
        key_pts = data[1]

        image_copy = np.copy(image)
        key_pts_copy = np.copy(key_pts)

        h, w = image_copy.shape[:2]
        new_h, new_w = self.output_size

        top = np.random.randint(0, h - new_h)
        left = np.random.randint(0, w - new_w)

        image_copy = image_copy[top: top + new_h,
                                left: left + new_w]

        key_pts_copy[::2] = key_pts_copy[::2] - left
        key_pts_copy[1::2] = key_pts_copy[1::2] - top

        return np.array(image_copy), np.array(key_pts_copy)

class Normalize(object):
    def __init__(self,scale) -> None:
        self.scale = scale

    def __call__(self, data):
        image = data[0]   # 获取图片
        key_pts = data[1]  # 获取标签
        
        image_copy = np.copy(image)
        key_pts_copy = np.copy(key_pts)
        
        key_pts_copy = key_pts_copy / self.scale
        return np.array(image_copy,dtype=np.uint8),np.array(key_pts_copy)

class FaceKeyPointsDatasets(tf.keras.utils.Sequence):
    def __init__(self, batch_size, mode='train', grayscale=True):
        self.mode = mode
        if self.mode == "train":
            self.csv_file = "data/training_frames_keypoints.csv"
            self.data_path = "data/training"
        elif self.mode == 'test':
            self.csv_file = "data/test_frames_keypoints.csv"
            self.data_path = "data/test"
        self.df = pd.read_csv(self.csv_file, encoding='utf-8')
        self.grayscale = grayscale
        self.batch_size = batch_size
        self.resize = Resize(256)
        self.crop = RandomCrop(240)
        self.gray = Normalize(scale=100)

    def __getitem__(self, index):
        image_list = []
        kpt_list = []
        for i in range(index*self.batch_size, (index+1)*self.batch_size):
            i = i % len(self.df)
            image_name = os.path.join(self.data_path, self.df.iloc[i, 0])
            # image = mpimg.imread(image_name)[:,:,0:3]
            image = Image.open(image_name).convert("RGB")
            image = np.array(image)
            kpt = self.df.iloc[i, 1:].values
            kpt = kpt.astype('float').reshape(-1)
            image, kpt = self.resize([image, kpt])
            image, kpt = self.crop([image, kpt])
            if self.grayscale:
                image, kpt = self.gray([image, kpt])
            image = np.array(image,dtype=np.uint8)
            kpt = np.array(kpt)
            image_list.append(image)
            kpt_list.append(kpt)
        return np.array(image_list), np.array(kpt_list)

    def __len__(self):
        return math.ceil(len(self.df) / float(self.batch_size))

最终的数据点经过100倍的缩小，图片被规定在240*240大小

FqyCvup6G6D5jV2gCrMkprEER_QE

2.因其硬件资源有限，需要在模型设计和模型量化两方面对模型的大小进行尽量压缩

模型设计：

本项目采用MobileNetV2的模型，因其采用深度可分离卷积算子所以能大幅度减少模型参数，

对于目标点的预测建模为一个简单的回归模型，使用SmothL2作为损失函数，Adam优化器来进行模型训练

Backbone = tf.keras.applications.mobilenet_v2.MobileNetV2(input_shape=(240,240,3),
                                                        alpha=1.0,
                                                        include_top=False,
                                                        weights=None)
inputs = tf.keras.layers.Input(shape=(240,240,3))
x = Backbone(inputs)
x = tf.keras.layers.GlobalAvgPool2D()(x)
x = tf.keras.layers.Dense(1000,activation='relu')(x)
outputs = tf.keras.layers.Dense(68*2)(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
# =====================================================================
def SmoothL1Loss(delta=1.0):
    def _SmoothL1Loss(y_true, y_pred):
        loss = tf.keras.losses.huber(y_true, y_pred, delta=delta)
        loss = tf.reduce_mean(loss)
        return loss
    return _SmoothL1Loss
# =====================================================================
train_datasets = FaceKeyPointsDatasets(batch_size=batch_size,mode='train',grayscale=True)
net = model
callback = [
        tf.keras.callbacks.EarlyStopping(monitor='loss', patience=15, verbose=1),
        tf.keras.callbacks.ModelCheckpoint('weights/ep{epoch:03d}-loss{loss:.4f}.h5',monitor='loss',
                        save_weights_only=True, save_best_only=False, save_freq='epoch'),
        tf.keras.callbacks.TensorBoard(log_dir='./logs'),
        LossHistory('./logs'),
        tf.keras.callbacks.LearningRateScheduler(adjust_lr)
    ]
net.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr),
                loss=SmoothL1Loss())
history = model.fit(
        x                      = train_datasets,
        workers                = 4,
        epochs                 = epochs,
        callbacks              = callback,
        steps_per_epoch        = len(train_datasets),
        verbose=1
    )

模型量化：

在训练得到了keras的h5模型后使用量化工具tflite对模型参数进行8bit量化压缩，再由博流官方的npu工具对模型解析重新生成适配芯片内npu的算子，之后便可借助e902核的优盘烧录固件对模型进行拖动烧录

converter = tf.lite.TFLiteConverter.from_keras_model(model=net)
converter._experimental_disable_per_channel = True
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_quantized_model = converter.convert()
# 查看模型大小
quantized_model_size = len(tflite_quantized_model) / (1024*1024)
print(f'Quantized model size = {quantized_model_size}MBs')

3.在测试集上验证模型效果

interpreter = tf.lite.Interpreter(model_path='face_240_240_uint8.tflite')
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()[0]
output_details = interpreter.get_output_details()[0]

print("input:",input_details["quantization"])
print("output:",output_details["quantization"])

if input_details['dtype'] == np.uint8:
    input_scale,input_zero_point = input_details["quantization"]
    test_image = image / input_scale + input_zero_point
    print("input")
test_image = np.expand_dims(test_image,axis=0).astype(input_details["dtype"])
interpreter.set_tensor(input_details["index"],test_image)
interpreter.invoke()
output = interpreter.get_tensor(output_details["index"])[0]
# output = np.array(output,dtype=np.uint8)
if output_details['dtype'] == np.uint8:
    output_scale,output_zero_point = output_details["quantization"]
    out = (output - output_zero_point)*output_scale

out = out * 100

测试输出输出效果良好

4.模型转换为npu需要的格式

npu转换需要int8格式的tflite模型，在npu内部推理时会自动偏移为uint8的数据来推理

写一个配置文件

username = face_240_240_int8
model = models/tflite/face_240_240_int8.tflite
data_type = int8

# version = 5
#[app_type] CLASSIFICATION, YOLO, SSD, KWS, CUSTOM, RETINA_FACE, RETINA_PERSON, FR_REGISTRATION, FR_EVALUATION, NONE
app_type = NONE

#[input data_type]: image, file
input_type = image
input_data = image/image.png

#[input format]: NHWC, NCHW
input_format = NHWC

# names = NONE

#[memory allocation]
patch_size = 65535
patch_num = 50

模型转换

运行转换脚本，获得模型的头文件，之后可以将头文件用Sipeed给的脚本转换为可以拖动烧录的blai文件

FgrkLlePUGDSDI9e2UOLI3_hNdpB

5.部署验证

由c906来实现数据的采集和处理并导入到npu内进行运算，在每一次的数据运算后需要在回调函数内对输出的数据进行后处理，主要逻辑是通过反量化取出68对坐标，将所68个坐标在图像上标注出来并显示在lcd上。

static void model_result_cb(model_result_t *result, void *arg)
{
    uint8_t *output = result->output;
    float output_scale = result->scale;
    int output_zero_point = result->zero_point;
    uint32_t *output_shape = result->output_shape;
    
    uint16_t out[136] = {0};
    float temp = 0;
    rgb565_frame_t *f = arg;

    printf("output shape %ux%ux%u\r\n", output_shape[0], output_shape[1], output_shape[2]);
    uint32_t output_size = output_shape[0] * output_shape[1] * output_shape[2];
    printf("[%u]output %d, %f:\r\n",output_size, output_zero_point, output_scale);
    for (uint32_t i = 0; i < 136; i++) {
        temp = ((float)output[i] - output_zero_point)*output_scale;
        temp = temp*100;
        out[i] = (uint16_t)temp;
    }
    draw_rect_filled(f,240-out[26]+1,240-out[17]+1,120,50,0xffff);
    for (uint8_t j=0;j<68;j++)
    {
        draw_rect_filled(f,240-out[2*j]+1,240-out[2*j+1]+1,3,3,0x0000);
    }
}

四、模型效果展示

FlRFTznbOBs9W8YfCIuiWOvuq29s

可以看出虽然在pc上量化后的模型跑的效果还不错，但是在实际部署上来说仍有较大差距，这部分我个人认为可能是数据集与实际情况有较大差距，或者是模型在训练时产生了过拟合的问题，但调参对我来说是个玄学问题，目前以我的能力来说还没办法将模型进步一优化

五、遇到的主要难题和解决办法

1.大量采集人脸数据加68个关键点的标注对个人来说有些难度，因此采用Kaggle上面现成的人脸数据来进行来训练，导致模型在实际表现来看效果并不好，采用的解决办法主要是在生成数据集时对图像进行增广，多加一些预处理和增加模型训练的epoch

2.由于芯片比较新，导致相关资料比较少，只能依靠github的几篇readme和问群内大佬来学习将模型向npu内进行部署，采用的办法只能是多问多试，最终实现了模型的部署

3.第一次接触为单片机来训练模型，起初模型采用的是一些较大的模型如resnet50，而优盘最大烧录的大小只有7M左右的空间，最终在了解相关内容后才采用了MobileNetV2这一优化过后的模型。

六、未来规划

1.打算继续学习TinyML相关的知识来丰富自己的所学。

2.第一次接触平头哥的riscv内核的芯片觉得riscv的知识以后是必备技能，决定以后多了解一下平头哥的riscv芯片

3.希望自己在下次有机会时继续参加硬核学堂的项目，觉得每次做一个项目对自己来说都是一次巨大的提升