yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

目标检测和目标分类

图像识别算法是计算机视觉的基础算法，例如VGG，GoogLeNet，ResNet等，这类算法主要是判断图片中目标的种类。

目标检测算法和图像识别算法类似，但是目标检测算法不仅要识别出图像中的物体，还需要获得图像中物体的大小和位置，使用坐标的形式表示出来。如下图：

图片[1]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

图像识别和目标检测

下面我们举个例子来说明对于三种物体人、车、摩托车，对于图像识别，输出列表为三个数字，分别代表图像中物体是人、车、摩托车的概率，例如对于上图的输出值或许是[0.001, 0.998, 0.001]。被预测为车的概率为最高。

而对于目标检测算法来说，它的输出值更像是这样：

图片[2]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

目标检测算法的输出

其中:

pc 为1时代表有物体被检测到，反之，没有物体被检测到，因此其他的输出值可以被忽略bx 目标的x坐标值为目标的左上角到物体中心的相对坐标by 目标的y坐标值为目标的右下角到物体中心的相对坐标值bh 区域框bounding box的高bw 区域框bounding box的宽

c1、c2、c3是目标属于人、车、摩托车的概率，基于此，我们可以使用滑动窗口对整张图片进行处理来定位目标。就像是用一个放大镜去图像的每个区域去查找是否包含目标。这样的方法简单粗暴有效，但是效率极低，计算复杂度太高，所以不会这么去做。

然而，使用滑动窗口时，许多计算都是重复性计算，我们可以使用卷积神经网络的思想。

图片[3]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

使用卷积计算代替滑动窗口

如上图所示，如果我们用14*14的窗口进行计算，需要4次 5*5的卷积、2*2的max pool、55的卷积、1*1的卷积，而使用16*16的窗口1次计算就可以得到结果，效率更高，这个在Andrew Wu的deeplearning.ai的课程中有详述，至于节省了多少计算量，这里就偷懒不算了，直观印象节省了很大很大的计算量。这里的窗口是固定的，如果需要检测不同大小的物体，需要不同大小的很多的窗口，这也是YOLO算法需要解决的重要问题。

至于目标检测的用处，现在最大的场景就是无人驾驶，在无人驾驶中，需要实时检测出途中的人、车、物体、信号灯、交通标线等，再通过融合技术将各类传感器获得的数据提供给控制中心进行决策。而目标检测相当于无人驾驶系统的眼睛。

在目标检测技术领域，有包含region proposals提取阶段的两阶段（two-stage）检测框架如R-CNN/Fast-RCNN/R-FCN等，再就是端到端的但阶段目标检测框架如YOLO系列和SSD等。

下面我们详述YOLO的思想。

YOLO是You Only Look Once的缩写。这也是为了特别突出YOLO区别于两阶段算法的特点，从名字就可以感受到，YOLO算法速度很快，事实上也是如此。可以看出在同样的设备上，YOLO可以达到45帧每秒的速度。

图片[4]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

Grid Cell

在YOLO中，目标图片被划分为Grid Cell，实际应用中，会使用19*19的grid cell，为了容易理解，这里暂时使用3*3的grid cell。

图片[5]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

图片被化为3*3的grid cells

图片中的汽车使用红色线框标识出来。

下一步，对于图中的每一个grid cell，我们都会用如下的标签（前文已经解释过具体含义）来标识。

图片[6]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

然而，如何将grid cell和物体联系起来呢？

对于右边的汽车，相对简单，由于汽车完全处于右方的grid cell，于是它属于右边的grid cell。

而对于中间的卡车来说，它的bounding box跨越了好几个grid cell。YOLO的做法是，物体的中心处于哪个grid cell，那么物体就属于哪个grid cell，因此卡车属于最中间的grid cell。

图片[7]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

每个grid cell的标签如上图所示，没有物体时，pc为0，其他的值也就没必要关注了，当pc为1时，bx/by为物体中心处于grid cell的相对位置，这时，grid cell的高和宽为1，因此bx/by小于1，bh/bw为物体相对于grid cell的高和宽，其值可以大于1，之后的c1/c2/c3为预测的目标的概率。

这里为了方便起见，只有3类物体，实际应用，会使用80种物体，于是会有c1/c2/c3…/c80。

考虑特殊情况下，如果两个物体的中心处于同一个grid cell的情况。需要使用Anchor Box。

Anchor Box

Anchor Box使得YOLO可以检测出同一个grid cell中包含多个物体的情况。

图片[8]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

上图中，人和车的中心都处于中间的grid cell中。Anchor box为标签增加了更多的纬度。如果我们可以对每个物体对应一个anchor box来标识。为了解释方便，这里我们只使用两个anchor box。

图片[9]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

每一个grid cell包含两个anchor box，意味着每一个grid cell可以预测两个物体。至于为什么要选择不同形状的anchor box呢？直观印象是这样，我们将物体与anchor box进行比较，看看更像哪个anchor box的形状，和anchor box更像的物体倾向于被识别为anchor box代表的物体形状。例如anchor box1 更像行人的形状，而anchor box2 更像汽车的形状。

图片[10]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

如图所示，图像中间位置的grid cell可以用此来标识。这么做的另一个原因是使得模型更专业化。某些输出被用来训练检测像车一样宽形的物体，而另外一个则被用来检测行人一样的高瘦的物体。

那么如何定义、如何判断物体具体相似的形状呢？

Intersection over Union

图片[11]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

Intersection over Union IOU 的定义是两个box的交集面积和并集面积的比值。(x1,y1,x2,y2) 分别代表box 左上角和右下角的坐标。

图片[12]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

计算IOU的代码如下所示：

def iou(box1, box2):
“””Implement the intersection over union (IoU) between box1 and box2 Arguments:
box1 — first box, list object with coordinates (x1, y1, x2, y2) box2 — second box, list object with coordinates (x1, y1, x2, y2)
“””

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.
### START CODE HERE ### (≈ 5 lines)
xi1 = max(box1[0], box2[0])
yi1 = max(box1[1], box2[1])
xi2 = min(box1[2], box2[2])
yi2 = min(box1[3], box2[3])
inter_area = (xi2 – xi1) * (yi2 – yi1)
### END CODE HERE ###

# Calculate the Union area by using Formula: Union(A,B) = A + B – Inter(A,B)
### START CODE HERE ### (≈ 3 lines)
box1_area = (box1[2] – box1[0]) * (box1[3] – box1[1])
box2_area = (box2[2] – box2[0]) * (box2[3] – box2[1])
union_area = box1_area + box2_area – inter_area
### END CODE HERE ###

# compute the IoU
### START CODE HERE ### (≈ 1 line)
iou = inter_area / union_area
### END CODE HERE ###

return iou

介绍了YOLO中的一些基本概念后，我们先看看YOLO是如何进行目标检测的

假设我们已经训练出了YOLO的模型

首先输入待检测的图片，对图片进行一系列的处理，使得图片的规格符合数据集的要求。

第二，通过模型计算获得预测输出，假如使用的是19*19的grid cell数目，5个anchor box， 80个分类，于是输出的纬度为（1，19，19，5，80+5）

第三，对于输出值进行处理，过滤掉得分低的值，输出值中的Pc 在原论文中被称为confidence 而C被称为 probs，得分为confidence * probs，可以看出，所谓的得分就是含有目标的概率值。

代码实现如下：

def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
“””Filters YOLO boxes by thresholding on object and class confidence.
Arguments:
box_confidence — tensor of shape (19, 19, 5, 1) boxes — tensor of shape (19, 19, 5, 4)
box_class_probs — tensor of shape (19, 19, 5, 80)threshold — real value, if [ highest class probability score < threshold], then get rid of the corresponding box Returns:
scores — tensor of shape (None,), containing the class probability score for selected boxesboxes — tensor of shape (None, 4), containing (b_x, b_y, b_h, b_w) coordinates of selected boxesclasses — tensor of shape (None,), containing the index of the class detected by the selected boxesNote: “None” is here because you dont know the exact number of selected boxes, as it depends on the threshold.For example, the actual output size of scores would be (10,) if there are 10 boxes. “””

# Step 1: Compute box scores
### START CODE HERE ### (≈ 1 line)
box_scores = box_confidence * box_class_probs
### END CODE HERE ###

# Step 2: Find the box_classes thanks to the max box_scores, keep track of the corresponding score
### START CODE HERE ### (≈ 2 lines)
box_classes = K.argmax(box_scores, axis=-1)
# 五个boxes里面分数最高的
box_class_scores = K.max(box_scores, axis=-1)
### END CODE HERE ###

# Step 3: Create a filtering mask based on “box_class_scores” by using “threshold”. The mask should have the
# same dimension as box_class_scores, and be True for the boxes you want to keep (with probability >= threshold)
### START CODE HERE ### (≈ 1 line)
filtering_mask = box_class_scores >= threshold
### END CODE HERE ###

# Step 4: Apply the mask to scores, boxes and classes
### START CODE HERE ### (≈ 3 lines)
### filtering_mask 是 False True Flase True组成的列表通过tf.boolean_mask 过滤掉值为False
### 的值于是 scores boxes classes 都是至为True的index对应的列表
scores = tf.boolean_mask(box_class_scores, filtering_mask)
boxes = tf.boolean_mask(boxes, filtering_mask)
classes = tf.boolean_mask(box_classes, filtering_mask)
### END CODE HERE ###

return scores, boxes, classes

第四，同一个物体可能会有多个grid cell预测到，那么同一个物体就会有多个bouding box，我们需要留下具有最高pc值的预测值，将其他的预测值过滤掉。如何判断多个bounding box预测的是同一个物体呢，这里就需要使用IOU算法。最后得到的值就是图片中被预测的目标的类型和位置值，再经过一系列计算和转换变成图片上的真实坐标值，使用工具画到原图上。

第四步中的筛选过程被称为Non-max suppression算法

代码如下：

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
“””
Applies Non-max suppression (NMS) to set of boxes
Arguments:
scores — tensor of shape (None,), output of yolo_filter_boxes()boxes — tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)classes — tensor of shape (None,), output of yolo_filter_boxes()max_boxes — integer, maximum number of predicted boxes youd like iou_threshold — real value, “intersection over union” threshold used for NMS filtering
Returns: scores — tensor of shape (, None), predicted score for each box
boxes — tensor of shape (4, None), predicted box coordinates classes — tensor of shape (, None), predicted class for each box
Note: The “None” dimension of the output tensors has obviously to be less than max_boxes. Note also that this function will transpose the shapes of scores, boxes, classes. This is made for convenience.
“””

max_boxes_tensor = K.variable(max_boxes, dtype=int32) # tensor to be used in tf.image.non_max_suppression()
K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

# Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep
### START CODE HERE ### (≈ 1 line) 这是TensorFlow的一个库函数在很多算法中都会用到
nms_indices = tf.image.non_max_suppression( boxes, scores, max_boxes_tensor, iou_threshold)
### END CODE HERE ###

# Use K.gather() to select only nms_indices from scores, boxes and classes
### START CODE HERE ### (≈ 3 lines)
scores = tf.gather(scores, nms_indices)
boxes = tf.gather(boxes, nms_indices)
classes = tf.gather(classes, nms_indices)
### END CODE HERE ###

return scores, boxes, classes

图片[13]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

具体处理的情况如上图，有三个红色的bounding box预测到皮卡，两个黄色的bounding box预测到汽车，我们需要留下pc最高的一个，但是怎么判断哪些bounding box是预测的同一个物体呢，就需要使用IOU方法。

tf.image.non_maxsuppression函数的具体做法是对于所有的boxes先选取具有分数最高pc的box，然后用剩余所有的box和选出的box进行计算IOU的值，当IOU大于iouthreshold时，box被删除掉，然后再在剩余的boxes里取最大值，再做同样的操作，直到boxes的数目为max_boexes_tensor的数目。

处理的结果如下图。

图片[14]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

总结一下整个预测过程如下图：

图片[15]-yolt算法（yolo算法步骤）YOLO算法详细解析-卡咪卡咪哈-一个博客

YOLO的检测过程在deepsystems.io中图示的很清晰，可以作为参考。

对于检测的过程下文将对代码部分进行详细分析。

检测的核心代码如下：

def predict(sess, image_file):
“””
Runs the graph stored in “sess” to predict boxes for “image_file”. Prints and plots the preditions. Arguments:
sess — your tensorflow/Keras session containing the YOLO graph image_file — name of an image stored in the “images” folder.
Returns:
out_scores — tensor of shape (None, ), scores of the predicted boxesout_boxes — tensor of shape (None, 4), coordinates of the predicted boxesout_classes — tensor of shape (None, ), class index of the predicted boxesNote: “None” actually represents the number of predicted boxes, it varies between 0 and max_boxes. “””

# Preprocess your image
image, image_data = preprocess_image(“images/” + image_file, model_image_size = (608, 608))

# Run the session with the correct tensors and choose the correct placeholders in the feed_dict.
# Youll need to use feed_dict={yolo_model.input: … , K.learning_phase(): 0})
### START CODE HERE ### (≈ 1 line)
out_scores, out_boxes, out_classes = sess.run([scores, boxes, classes], feed_dict={yolo_model.input: image_data,
input_image_shape: [image.size[1], image.size[0]],
K.learning_phase(): 0})
### END CODE HERE ###

# Print predictions info
print(Found {} boxes for {}.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image
image.save(os.path.join(“out”, image_file), quality=90)
# Display the results in the notebook
output_image = scipy.misc.imread(os.path.join(“out”, image_file))
imshow(output_image)

return out_scores, out_boxes, out_classes

代码块中有英文注释，简单来说分为几步：

使用preprocess_image函数预处理图片运行run 获得out_scores, out_boxes, outclasses的处理结果，而scores, boxes, classes是yolo_eval()函数的输出值使用draw_boxes对讲预测出的框画到图片上

首先先看下preprocess_image函数做了什么事情，函数中每一行都使用中文做了注释

def preprocess_image(img_path, model_image_size):
image_type = imghdr.what(img_path)
## 获得图片类型
image = Image.open(img_path)
## 将图片处理为608*608的固定大小的图片
resized_image = image.resize(tuple(reversed(model_image_size)), Image.BICUBIC)
#读取图片数据存到数组中
image_data = np.array(resized_image, dtype=float32)
# 除去rgb最大值
image_data /= 255.
# 在最前面加一维批处理纬
image_data = np.expand_dims(image_data, 0) # Add batch dimension.
return image, image_data

之后关注yolo_eval函数

def yolo_eval(yolo_outputs,
image_shape,
max_boxes=10,
score_threshold=.6,
iou_threshold=.5):
“””
Evaluate YOLO model on given input batch and return filtered boxes. Arguments:
yolo_outputs — 经过模型计算后的值再经过yolo_head函数计算得到的
box_confidence, box_xy, box_wh, box_class_probs image_shape — 输入图片的维度
max_boxes — 每一张图片中预测出的boxes的最大值
score_threshold — 最小得分值的阈值
iou_threshold — IOU的阈值 “””
## 提取box_confidence, box_xy, box_wh, box_class_probs
box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs
## 将boxesd 坐标值转换为四个角的坐标值
boxes = yolo_boxes_to_corners(box_xy, box_wh)
## 过滤掉得分小于score_threshold的boxes 前文之中已有代码实现
boxes, scores, classes = yolo_filter_boxes(
box_confidence, boxes, box_class_probs, threshold=score_threshold)

# Scale boxes back to original image shape.
## 将boxes的坐标值按照image_shape的变换比例还原
height = image_shape[0]
width = image_shape[1]
## 讲image_dims变成(4,)的张量
image_dims = K.stack([height, width, height, width])
image_dims = K.reshape(image_dims, [1, 4])

boxes = boxes * image_dims

# Use one of the functions youve implemented to perform Non-max suppression with a threshold of iou_threshold (≈1 line)
## yolo_non_max_suppression 函数上文中已有解析
scores, boxes, classes = yolo_non_max_suppression(scores, boxes, classes, max_boxes = max_boxes, iou_threshold = iou_threshold)

return boxes, scores, classes

于是我们需要关注yolo_outputs是如何获得，即yolo_head函数，至于yolo_boxes_to_corners实现方法比较简，但是需要注意坐标轴的方向

def yolo_boxes_to_corners(box_xy, box_wh):
“””Convert YOLO box predictions to bounding box corners.”””
box_mins = box_xy – (box_wh / 2.)
box_maxes = box_xy + (box_wh / 2.)

## 这里返回的值是[y,x,y,x]的组合
return K.concatenate([
box_mins[…, 1:2], # y_min
box_mins[…, 0:1], # x_min
box_maxes[…, 1:2], # y_max
box_maxes[…, 0:1] # x_max
])

yolo_head函数中涉及到的计算较多，数值和维度的变化也比较多，需要深入理解其细节

def yolo_head(feats, anchors, num_classes):
“””Convert final layer features to bounding box parameters. Parameters
———-
feats : tensor shape为(?,19,19,425)
Final convolutional layer features.
yolo_model的最后一层输出是一个(m,19,19,5,85)的tensor anchors : array-like
Anchor box widths and heights. 5个anchor的宽和高 num_classes : int
Number of target classes. 80个分类
Returns
——-
box_confidence : tensor grid cell内是否含有物体 shape为(?,?,?,5,80) Probability estimate for whether each box contains any object.
通过模型最后一层的计算得到的所有的bounding boxes的坐标值（中心坐标和宽高值） box_xy : tensor shape为(?,?,?,5,2)
x, y box predictions adjusted by spatial location in conv layer. box_wh : tensor shape为(?,?,?,5,2)
w, h box predictions adjusted by anchors and conv spatial resolution. box_class_pred : tensor 物体是80个分类的概率值 shape为(?,?,?,5,1)
Probability distribution estimate for each box over class labels. “””
num_anchors = len(anchors) ## num_anchors 为5

# Reshape to batch, height, width, num_anchors, box_params.
anchors_tensor = K.reshape(K.variable(anchors), [1, 1, 1, num_anchors, 2])

# 取第2、3维即19*19
conv_dims = K.shape(feats)[1:3] #conv_dims变成2维每个维度19个值

# In YOLO the height index is the inner most iteration. [0,1,2,3,4,5,…,18]
conv_height_index = K.arange(0, stop=conv_dims[0])
conv_width_index = K.arange(0, stop=conv_dims[1])
#tile(x, n)函数将x在各个维度上重复n次，x为张量，n为与x维度数目相同的列表这里conv_heifh_index只有一维
# 下面这部分看着很绕但实际上是获得(1,19,19,1,2)维度的位移表
conv_height_index = K.tile(conv_height_index, [conv_dims[1]])
conv_width_index = K.tile(K.expand_dims(conv_width_index, 0), [conv_dims[0], 1])
conv_width_index = K.flatten(K.transpose(conv_width_index))
conv_index = K.transpose(K.stack([conv_height_index, conv_width_index]))
conv_index = K.reshape(conv_index, [1, conv_dims[0], conv_dims[1], 1, 2])
conv_index = K.cast(conv_index, K.dtype(feats)

feats = K.reshape(feats, [–1, conv_dims[0], conv_dims[1], num_anchors, num_classes + 5])
conv_dims = K.cast(K.reshape(conv_dims, [1, 1, 1, 1, 2]), K.dtype(feats))

# Static generation of conv_index: 这种静态方法更容易理解一些
# conv_index = np.array([_ for _ in np.ndindex(conv_width, conv_height)])
# conv_index = conv_index[:, [1, 0]] # swap columns for YOLO ordering.
# conv_index = K.variable(
# conv_index.reshape(1, conv_height, conv_width, 1, 2))
# feats = Reshape(
# (conv_dims[0], conv_dims[1], num_anchors, num_classes + 5))(feats)

box_confidence = K.sigmoid(feats[…, 4:5])
box_xy = K.sigmoid(feats[…, :2])