MSCOCO 数据集是微软构建的一个数据集,其包含 detection, segmentation, keypoints等任务, 是目标检测数据集中主流的数据集之一。
与PASCAL COCO数据集相比,COCO中的图片包含了自然图片以及生活中常见的目标图片,背景比较复杂,目标数量比较多,目标尺寸更小,因此COCO数据集上的任务也更难,当前评价一个检测模型好坏一般使用COCO数据集。
什么是stuff类别 ❓
需要强调的COCO目标检测数据集是80个类别,但是类别ID最大显示为90,最小为1, 不包括Background.
object的80类与stuff中的91类的区别在哪 ❓
每个类别对应的实例个数分布图如下:
以 instances_train2017
标注文件为例:
import json
annotation_file = '/data/code/Det_Code/COCO_2017/annotations/instances_train2017.json'
with open(annotation_file, 'r') as f:
dataset = json.load(f)
读取后, 得到一个包含四个key: [‘info’, ‘licenses’, ‘images’, ‘annotations’, ‘categories’]的字典。
annotations
是一个列表对应全部实例而不是图像的标注信息,以第一个图像实例为例来说:
{'segmentation': [[239.97, 260.24, 222.04, 270.49, 199.84, 253.41, 213.5, 227.79, 259.62, 200.46, 274.13, 202.17, 277.55, 210.71, 249.37, 253.41, 237.41, 264.51, 242.54, 261.95, 228.87, 271.34]], 'area': 2765.1486500000005, 'iscrowd': 0, 'image_id': 558840, 'bbox': [199.84, 200.46, 77.71, 70.88], 'category_id': 58, 'id': 156}
每个key分别对应:目标的分割信息(polygons多边形)(segmentation),实例面积(area),目标边界框信息(左上角x,y坐标,以及宽高)(bbox),类别id(category_id)。
categories
是一个列表(元素个数对应检测目标的类别数)列表中每个元素都是一个dict对应一个类别的目标信息,包括类别id、类别名称和所属超类,注意实际只有80个Object类。如最后一个类别:
79 {'supercategory': 'indoor', 'id': 90, 'name': 'toothbrush'}
images
是一个列表(元素个数对应图像的张数),列表中每个元素都是一个dict,对应一张图片的相关信息。包括对应图像名称、图像宽度、高度等信息。
{'license': 3, 'file_name': '000000391895.jpg', 'coco_url': 'http://images.cocodataset.org/train2017/000000391895.jpg', 'height': 360, 'width': 640, 'date_captured': '2013-11-14 11:18:45', 'flickr_url': 'http://farm9.staticflickr.com/8186/8119368305_4e622c8349_z.jpg', 'id': 391895}
每个key分别对应,文件名/file_name,高度/height,宽度/width
官方有给出一个读取MS COCO数据集信息的API, 这个API具有数据的读取等重要功能。
pip install pycocotools
使用pycocotools
可视化实例边界框
import os
from pycocotools.coco import COCO
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
json_path = "/data/code/Det_Code/COCO_2017/annotations/instances_val2017.json" # 图像注释文件路径
img_path = "/data/code/Det_Code/COCO_2017/val2017" # 图像文件路径
# COCO是 官方API中加载注释文件的 类
coco = COCO(annotation_file=json_path)
# 获取注释文件中的images全部相关信息
ids = list(sorted(coco.imgs.keys()))
print("number of images: {}".format(len(ids)))
# get all coco class labels
coco_classes = dict([(v["id"], v["name"]) for k, v in coco.cats.items()])
# 遍历前两张图像
for img_id in ids[:2]:
# 获取对应图像id的所有annotations idx信息
ann_ids = coco.getAnnIds(imgIds=img_id)
# 根据annotations idx信息获取所有标注信息
targets = coco.loadAnns(ann_ids)
# get image file name
path = coco.loadImgs(img_id)[0]['file_name']
# read image
img = Image.open(os.path.join(img_path, path)).convert('RGB')
draw = ImageDraw.Draw(img)
# draw box to image
for target in targets:
x, y, w, h = target["bbox"]
x1, y1, x2, y2 = x, y, int(x + w), int(y + h)
draw.rectangle((x1, y1, x2, y2))
draw.text((x1, y1), coco_classes[target["category_id"]])
# show image
plt.imshow(img)
plt.savefig('example_{}.png'.format(img_id), dpi=50)
plt.show()
这里需要参考csdn博客给出的代码,里面调用了API中的不少函数,还需要进一步学习。
使用pycocotools
可视化实例分割mask
这里依旧可以参考csdn博客,这里就不放了。
mAP
PASCAL 中在测试mAP时,是在IOU=0.5时测的,而COCO的主要评价指标是AP,指 IOU从0.5到0.95 每变化 0.05 就测试一次 AP,然后求这10次测量结果的平均值作为最终的 AP,也就是说COCO评价指标的 AP@0.5 跟PASCAL VOC中的mAP是相同的含义,AP@0.75 跟PASCAL VOC中的mAP也相同,只是IOU阈值提高到了0.75,显然这个要求的准确度更高,精度也会更低。
IOU越高,AP就越低,所以最终的平均之后的AP要比 AP@0.5 小很多,这也就是为什么COCO的AP 超过 50%的只有寥寥几个而已,因为超过50%还是很难的。
针对不同大小图片的评测
COCO数据集还针对 三种不同大小(small,medium,large) 的图片提出了测量标准,COCO中包含大约 41% 的小目标 (area<32×32), 34% 的中等目标 (32×32<area<96×96), 和 24% 的大目标 (area>96×96). 小目标的AP是很难提升的。
pycocotools同样给出上述评价模型的代码。
这里以DETR中读取代码实现为例,COCO目标检测数据集类 CocoDetection
定义为:
class CocoDetection(torchvision.datasets.CocoDetection):
def __init__(self, img_folder, ann_file, transforms, return_masks):
super(CocoDetection, self).__init__(img_folder, ann_file)
self._transforms = transforms
self.prepare = ConvertCocoPolysToMask(return_masks)
def __getitem__(self, idx):
# breakpoint()
# img: PIL.Image 读取的结果, target 是这个图像的全部实例标注信息
img, target = super(CocoDetection, self).__getitem__(idx)
image_id = self.ids[idx]
target = {'image_id': image_id, 'annotations': target}
img, target = self.prepare(img, target)
# 对target进一步处理,使其方便后面的训练
if self._transforms is not None:
img, target = self._transforms(img, target)
return img, target
可以看到这个类继承了较多torchvision.datasets.CocoDetection
的方法,torchvision.datasets.CocoDetection
是torchvision
借助 cocoAPI也就是 pycocotools 包装实现的,里面包含了较多诸如图像读取,注释读取的函数。这里暂不做仔细深究,只给出我们在使用时涉及得到的 CocoDetection
类的输入和输出。
CocoDetection
的初始化只用给出准备好了的COCO2017数据集的图像文件夹路径 img_folder, 比如训练集图像数据/data/code/Det_Code/COCO_2017/train2017
, 以及注释文件路径 ann_file,如/data/code/Det_Code/COCO_2017/annotations/instances_train2017.json
: transforms(指图像的增广变换函数),return_masks(对于检测涉及不到, 与分割相关)。
DETR给出的实现如下:
def convert_coco_poly_to_mask(segmentations, height, width):
masks = []
for polygons in segmentations:
rles = coco_mask.frPyObjects(polygons, height, width)
mask = coco_mask.decode(rles)
if len(mask.shape) < 3:
mask = mask[..., None]
mask = torch.as_tensor(mask, dtype=torch.uint8)
mask = mask.any(dim=2)
masks.append(mask)
if masks:
masks = torch.stack(masks, dim=0)
else:
masks = torch.zeros((0, height, width), dtype=torch.uint8)
return masks
class ConvertCocoPolysToMask(object):
def __init__(self, return_masks=False):
self.return_masks = return_masks
def __call__(self, image, target):
w, h = image.size
image_id = target["image_id"]
image_id = torch.tensor([image_id])
anno = target["annotations"]
anno = [obj for obj in anno if 'iscrowd' not in obj or obj['iscrowd'] == 0]
boxes = [obj["bbox"] for obj in anno]
# guard against no boxes via resizing
boxes = torch.as_tensor(boxes, dtype=torch.float32).reshape(-1, 4)
boxes[:, 2:] += boxes[:, :2]
boxes[:, 0::2].clamp_(min=0, max=w)
boxes[:, 1::2].clamp_(min=0, max=h)
classes = [obj["category_id"] for obj in anno]
classes = torch.tensor(classes, dtype=torch.int64)
if self.return_masks:
segmentations = [obj["segmentation"] for obj in anno]
masks = convert_coco_poly_to_mask(segmentations, h, w)
keypoints = None
if anno and "keypoints" in anno[0]:
keypoints = [obj["keypoints"] for obj in anno]
keypoints = torch.as_tensor(keypoints, dtype=torch.float32)
num_keypoints = keypoints.shape[0]
if num_keypoints:
keypoints = keypoints.view(num_keypoints, -1, 3)
keep = (boxes[:, 3] > boxes[:, 1]) & (boxes[:, 2] > boxes[:, 0])
boxes = boxes[keep]
classes = classes[keep]
if self.return_masks:
masks = masks[keep]
if keypoints is not None:
keypoints = keypoints[keep]
target = {}
target["boxes"] = boxes
target["labels"] = classes
if self.return_masks:
target["masks"] = masks
target["image_id"] = image_id
if keypoints is not None:
target["keypoints"] = keypoints
# for conversion to coco api
area = torch.tensor([obj["area"] for obj in anno])
iscrowd = torch.tensor([obj["iscrowd"] if "iscrowd" in obj else 0 for obj in anno])
target["area"] = area[keep]
target["iscrowd"] = iscrowd[keep]
target["orig_size"] = torch.as_tensor([int(h), int(w)])
target["size"] = torch.as_tensor([int(h), int(w)])
return image, target
看着还挺复杂啊,这里我们同样只给出函数的作用。 处理前与处理后的image没发生变化,发生变化的是target. 处理前的样子如下,一共包含了8个实例的标注信息
[
{'segmentation': [[500.49, 473.53, 599.73, 419.6, 612.67, 375.37, 608.36, 354.88, 528.54, 269.66, 457.35, 201.71, 420.67, 187.69, 389.39, 192.0, 19.42, 360.27, 1.08, 389.39, 2.16, 427.15, 20.49, 473.53]], 'area': 120057.13925, 'iscrowd': 0, 'image_id': 9, 'bbox': [1.08, 187.69, 611.59, 285.84], 'category_id': 51, 'id': 1038967}, <br>
{'segmentation': [[357.03, 69.03, 311.73, 15.1, 550.11, 4.31, 631.01, 62.56, 629.93, 88.45, 595.42, 185.53, 513.44, 230.83, 488.63, 232.99, 437.93, 190.92, 429.3, 189.84, 434.7, 148.85, 410.97, 121.89, 359.19, 74.43, 358.11, 65.8]], 'area': 44434.751099999994, 'iscrowd': 0, 'image_id': 9, 'bbox': [311.73, 4.31, 319.28, 228.68], 'category_id': 51, 'id': 1039564},
{'segmentation': [[249.6, 348.99, 267.67, 311.72, 291.39, 294.78, 304.94, 294.78, 326.4, 283.48, 345.6, 273.32, 368.19, 269.93, 385.13, 268.8, 388.52, 257.51, 393.04, 250.73, 407.72, 240.56, 425.79, 230.4, 441.6, 229.27, 447.25, 237.18, 447.25, 256.38, 456.28, 254.12, 475.48, 263.15, 486.78, 271.06, 495.81, 264.28, 498.07, 257.51, 500.33, 255.25, 507.11, 259.76, 513.88, 266.54, 513.88, 273.32, 513.88, 276.71, 526.31, 276.71, 526.31, 286.87, 519.53, 291.39, 519.53, 297.04, 524.05, 306.07, 525.18, 315.11, 529.69, 329.79, 529.69, 337.69, 530.82, 348.99, 536.47, 339.95, 545.51, 350.12, 555.67, 360.28, 557.93, 380.61, 561.32, 394.16, 565.84, 413.36, 522.92, 441.6, 469.84, 468.71, 455.15, 474.35, 307.2, 474.35, 316.24, 464.19, 330.92, 438.21, 325.27, 399.81, 310.59, 378.35, 301.55, 371.58, 252.99, 350.12]], 'area': 49577.94434999999, 'iscrowd': 0, 'image_id': 9, 'bbox': [249.6, 229.27, 316.24, 245.08], 'category_id': 56, 'id': 1058555},
{'segmentation': [[434.48, 152.33, 433.51, 184.93, 425.44, 189.45, 376.7, 195.58, 266.94, 248.53, 179.78, 290.17, 51.62, 346.66, 16.43, 366.68, 1.9, 388.63, 0.0, 377.33, 0.0, 357.64, 0.0, 294.04, 22.56, 294.37, 56.14, 300.82, 83.58, 300.82, 109.08, 289.2, 175.26, 263.38, 216.9, 243.36, 326.34, 197.52, 387.03, 172.34, 381.54, 162.33, 380.89, 147.16, 380.89, 140.06, 370.89, 102.29, 330.86, 61.94, 318.91, 48.38, 298.57, 47.41, 287.28, 37.73, 259.51, 33.85, 240.14, 32.56, 240.14, 28.36, 247.57, 24.17, 271.46, 15.13, 282.11, 13.51, 296.96, 18.68, 336.34, 55.48, 391.55, 106.81, 432.87, 147.16], [62.46, 97.21, 130.25, 69.77, 161.25, 59.12, 183.52, 52.02, 180.94, 59.12, 170.93, 78.17, 170.28, 90.76, 157.05, 95.92, 130.25, 120.78, 119.92, 129.49, 102.17, 115.29, 64.72, 119.81, 0.0, 137.89, 0.0, 120.13, 0.0, 117.87]], 'area': 24292.781700000007, 'iscrowd': 0, 'image_id': 9, 'bbox': [0.0, 13.51, 434.48, 375.12], 'category_id': 51, 'id': 1534147},
{'segmentation': [[376.2, 61.55, 391.86, 46.35, 424.57, 40.36, 441.62, 43.59, 448.07, 50.04, 451.75, 63.86, 448.07, 68.93, 439.31, 70.31, 425.49, 73.53, 412.59, 75.38, 402.92, 84.13, 387.71, 86.89, 380.8, 70.77]], 'area': 2239.2924, 'iscrowd': 0, 'image_id': 9, 'bbox': [376.2, 40.36, 75.55, 46.53], 'category_id': 55, 'id': 1913551},
{'segmentation': [[473.92, 85.64, 469.58, 83.47, 465.78, 78.04, 466.87, 72.08, 472.84, 59.59, 478.26, 47.11, 496.71, 38.97, 514.62, 40.6, 521.13, 49.28, 523.85, 55.25, 520.05, 63.94, 501.06, 72.62, 482.6, 82.93]], 'area': 1658.8913000000007, 'iscrowd': 0, 'image_id': 9, 'bbox': [465.78, 38.97, 58.07, 46.67], 'category_id': 55, 'id': 1913746},
{'segmentation': [[385.7, 85.85, 407.12, 80.58, 419.31, 79.26, 426.56, 77.94, 435.45, 74.65, 442.7, 73.66, 449.95, 73.99, 456.87, 77.94, 463.46, 83.87, 467.74, 92.77, 469.39, 104.63, 469.72, 117.15, 469.39, 135.27, 468.73, 141.86, 466.09, 144.17, 449.29, 141.53, 437.1, 136.92, 430.18, 129.67]], 'area': 3609.3030499999995, 'iscrowd': 0, 'image_id': 9, 'bbox': [385.7, 73.66, 84.02, 70.51], 'category_id': 55, 'id': 1913856},
{'segmentation': [[458.81, 24.94, 437.61, 4.99, 391.48, 2.49, 364.05, 56.1, 377.77, 73.56, 377.77, 56.1, 392.73, 41.14, 403.95, 41.14, 420.16, 39.9, 435.12, 42.39, 442.6, 46.13, 455.06, 31.17]], 'area': 2975.276, 'iscrowd': 0, 'image_id': 9, 'bbox': [364.05, 2.49, 94.76, 71.07], 'category_id': 55, 'id': 1914001}]
处理后的结果是:
{
'boxes':
tensor([[ 1.0800, 187.6900, 612.6700, 473.5300],
[311.7300, 4.3100, 631.0100, 232.9900],
[249.6000, 229.2700, 565.8400, 474.3500],
[ 0.0000, 13.5100, 434.4800, 388.6300],
[376.2000, 40.3600, 451.7500, 86.8900],
[465.7800, 38.9700, 523.8500, 85.6400],
[385.7000, 73.6600, 469.7200, 144.1700],
[364.0500, 2.4900, 458.8100, 73.5600]]),
'labels':
tensor([51, 51, 56, 51, 55, 55, 55, 55]),
'image_id': tensor([9]),
'area':
tensor([120057.1406, 44434.7500, 49577.9453, 24292.7812, 2239.2925, 1658.8914, 3609.3030, 2975.2759]),
'iscrowd':
tensor([0, 0, 0, 0, 0, 0, 0, 0]),
'orig_size': tensor([480, 640]),
'size': tensor([480, 640])}
在DETR中,每张读进来的图像大小不同,无法直接组合成Batch,因此需要手动实现一下 DataLoader 中的 collate_fn函数,如下:
def collate_fn(batch):
# batch 是一个列表,列表中每个元素是一个元组tuple(image, target), 即 [(image_1, tuple_1), (image_2, tuple_2), ...], collate_fn 就是要把这些tuple给组装起来,
# 这一步转换成 [(image_1, image_2, ...), (target_1, target_2, ...)]
batch = list(zip(*batch))
batch[0] = nested_tensor_from_tensor_list(batch[0])
return tuple(batch)
这里注意一下 nested_tensor_from_tensor_list作用是将大小不同的图像输入,扩展成大小一致的图像,怎么操作的呢❓
首先统计一下全部的图像最大的宽度W和最大的高度 H,构造出 bs x c x h x w 的值为0的tensor, 然后将这些图像覆盖这个值为0的tensor,从左上角开始,没有覆盖的地方就为0,类似于图像的padding, 与 bs x c x h x w 同时产生还有一个大小为 bs x h x w 的 mask, 还不清楚其作用哈,二者通过NestedTensor包装成一个特殊的实例,这个实例有两个成员一个就是tensor,另一个就是mask.
class NestedTensor(object):
def __init__(self, tensors, mask: Optional[Tensor]):
self.tensors = tensors
self.mask = mask
def to(self, device):
# type: (Device) -> NestedTensor # noqa
cast_tensor = self.tensors.to(device)
mask = self.mask
if mask is not None:
assert mask is not None
cast_mask = mask.to(device)
else:
cast_mask = None
return NestedTensor(cast_tensor, cast_mask)
def decompose(self):
return self.tensors, self.mask
def __repr__(self):
return str(self.tensors)
结合上面的 collate_fn 函数可以看出, batch[0] 是一个 NestedTensor 实例,batch[1] 是图像的标注信息 target, 是一个元组,batchsize=2时,如下所示:
(
{
'boxes':
tensor([[0.6278, 0.6666, 0.2496, 0.1161],
[0.4574, 0.4170, 0.3599, 0.7220],
[0.7177, 0.4324, 0.4917, 0.2580],
[0.5541, 0.6099, 0.6642, 0.2646],
[0.3204, 0.6822, 0.1735, 0.2127],
[0.6150, 0.6643, 0.0393, 0.1714],
[0.4473, 0.3286, 0.0334, 0.0536],
[0.5382, 0.4616, 0.0716, 0.0671]]),
'labels': tensor([18, 1, 1, 15, 27, 44, 84, 27]),
'image_id': tensor([151988]),
'area': tensor([10100.7021, 51015.5039, 29404.1680, 30322.9062, 13935.7568, 2856.5737, 512.4734, 1741.8429]),
'iscrowd': tensor([0, 0, 0, 0, 0, 0, 0, 0]),
'orig_size': tensor([480, 640]),
'size': tensor([608, 810])
},
{
'boxes': tensor([[0.2746, 0.5719, 0.5491, 0.4382]]),
'labels': tensor([6]),
'image_id': tensor([116848]),
'area': tensor([82879.5156]),
'iscrowd': tensor([0]),
'orig_size': tensor([421, 640]),
'size': tensor([512, 778])
}
)
至此可以大致清楚数据集读取以及读取的结果了。