model (str, optional) – Path to the config file or the model name
defined in metafile. For example, it could be
“rtmdet-s” or ‘rtmdet_s_8xb32-300e_coco’ or
“configs/rtmdet/rtmdet_s_8xb32-300e_coco.py”.
If model is not specified, user must provide the
weights saved by MMEngine which contains the config string.
Defaults to None.
weights (str, optional) – Path to the checkpoint. If it is not specified
and model is a model name of metafile, the weights will be loaded
from metafile. Defaults to None.
device (str, optional) – Device to run inference. If None, the available
device will be automatically used. Defaults to None.
scope (str, optional) – The scope of the model. Defaults to mmdet.
palette (str) – Color palette used for visualization. The order of
priority is palette -> config -> checkpoint. Defaults to ‘none’.
show_progress (bool) – Control whether to display the progress
bar during the inference process. Defaults to True.
forward() and processed in postprocess().
If return_datasamples=False, it usually should be a
json-serializable dict containing only basic data elements such
as strings and numbers.
Customize your preprocess by overriding this method. Preprocess should
return an iterable object, of which each item will be used as the
input of model.test_step.
BaseInferencer.preprocess will return an iterable chunked data,
which will be used in __call__ like this:
config (str, Path, or mmengine.Config) – Config file path,
Path, or the config object.
checkpoint (str, optional) – Checkpoint path. If left as None, the model
will not load any weights.
palette (str) – Color palette used for visualization. If palette
is stored in checkpoint, use checkpoint’s palette first, otherwise
use externally passed palette. Currently, supports ‘coco’, ‘voc’,
‘citys’ and ‘random’. Defaults to none.
device (str) – The device where the anchors will be put on.
Defaults to cuda:0.
cfg_options (dict, optional) – Options to override some settings in
the used config.
In segmentation map annotation for ADE20K, 0 stands for background, which
is not included in 150 categories. The img_suffix is fixed to ‘.jpg’,
and seg_map_suffix is fixed to ‘.png’.
Load annotation file and set BaseDataset._fully_initialized to
True.
If lazy_init=False, full_init will be called during the
instantiation and self._fully_initialized will be set to True. If
obj._fully_initialized=False, the class method decorated by
force_full_init will call full_init automatically.
Several steps to initialize annotation:
load_data_list: Load annotations from annotation file.
load_proposals: Load proposals from proposal file, if
self.proposal_file is not None.
filter data information: Filter annotations according to
filter_cfg.
slice_data: Slice dataset according to self._indices
The proposals_list should be a dict[img_path: proposals]
with the same length as data_list. And the proposals should be
a dict or InstanceData usually contains following keys.
bboxes (np.ndarry): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
scores (np.ndarry): Classification scores, has a shape
(num_instance, ).
The img/gt_semantic_seg pair of BaseSegDataset should be of the same
except suffix. A valid img/gt_semantic_seg filename pair should be like
xxx{img_suffix} and xxx{seg_map_suffix} (extension is also included
in the suffix). If split is given, then xxx is specified in txt file.
Otherwise, all files in img_dir/``and``ann_dir will be loaded.
Please refer to docs/en/tutorials/new_dataset.md for more details.
Parameters:
ann_file (str) – Annotation file path. Defaults to ‘’.
metainfo (dict, optional) – Meta information for dataset, such as
specify classes to load. Defaults to None.
data_root (str, optional) – The root directory for data_prefix and
ann_file. Defaults to None.
data_prefix (dict, optional) – Prefix for training data. Defaults to
dict(img_path=None, seg_map_path=None).
img_suffix (str) – Suffix of images. Default: ‘.jpg’
seg_map_suffix (str) – Suffix of segmentation maps. Default: ‘.png’
filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few
data in annotation file to facilitate training/testing on a smaller
dataset. Defaults to None which means using all data_infos.
serialize_data (bool, optional) – Whether to hold memory using
serialized objects, when enabled, data loader workers can use
shared RAM from master process instead of making a copy. Defaults
to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) – test_mode=True means in test phase.
Defaults to False.
lazy_init (bool, optional) – Whether to load annotation during
instantiation. In some cases, such as visualization, only the meta
information of the dataset is needed, which is not necessary to
load annotation file. Basedataset can skip load annotations to
save time by set lazy_init=True. Defaults to False.
use_label_map (bool, optional) – Whether to use label map.
Defaults to False.
max_refetch (int, optional) – If Basedataset.prepare_data get a
None img. The maximum extra number of cycles to get a valid
image. Defaults to 1000.
The label_map is a dictionary, its keys are the old label ids and
its values are the new label ids, and is used for changing pixel
labels in load_annotations. If and only if old classes in cls.METAINFO
is not equal to new classes in self._metainfo and nether of them is not
None, label_map is not None.
Parameters:
new_classes (list, tuple, optional) – The new classes name from
metainfo. Default to None.
Get date processed by self.pipeline. Note that idx is a
video index in default since the base element of video dataset is a
video. However, in some cases, we need to specific both the video index
and frame index. For example, in traing mode, we may want to sample the
specific frames and all the frames must be sampled once in a epoch; in
test mode, we may want to output data of a single image rather than the
whole video for saving memory.
seed (int, optional) – random seed used to shuffle the sampler.
This number should be identical across all
processes in the distributed group. Defaults to None.
num_sample_class (int) – The number of samples taken from each
per-label list. Defaults to 1.
When shuffle=True, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
ann_file (str) – Annotation file path. Defaults to ‘’.
metainfo (dict, optional) – Meta information for dataset, such as class
information. Defaults to None.
data_root (str, optional) – The root directory for data_prefix and
ann_file. Defaults to None.
data_prefix (dict, optional) – Prefix for training data. Defaults to
dict(img=None,ann=None,seg=None). The prefix seg which is
for panoptic segmentation map must be not None.
filter_cfg (dict, optional) – Config for filter data. Defaults to None.
indices (int or Sequence[int], optional) – Support using first few
data in annotation file to facilitate training/testing on a smaller
dataset. Defaults to None which means using all data_infos.
serialize_data (bool, optional) – Whether to hold memory using
serialized objects, when enabled, data loader workers can use
shared RAM from master process instead of making a copy. Defaults
to True.
pipeline (list, optional) – Processing pipeline. Defaults to [].
test_mode (bool, optional) – test_mode=True means in test phase.
Defaults to False.
lazy_init (bool, optional) – Whether to load annotation during
instantiation. In some cases, such as visualization, only the meta
information of the dataset is needed, which is not necessary to
load annotation file. Basedataset can skip load annotations to
save time by set lazy_init=False. Defaults to False.
max_refetch (int, optional) – If Basedataset.prepare_data get a
None img. The maximum extra number of cycles to get a valid
image. Defaults to 1000.
Same as torch.utils.data.dataset.ConcatDataset, support
lazy_init and get_dataset_source.
Note
ConcatDataset should not inherit from BaseDataset since
get_subset and get_subset_ could produce ambiguous meaning
sub-dataset which conflicts with original dataset. If you want to use
a sub-dataset of ConcatDataset, you should set indices
arguments for wrapped dataset which inherit from BaseDataset.
Parameters:
datasets (Sequence[BaseDataset] or Sequence[dict]) – A list of datasets
which will be concatenated.
lazy_init (bool, optional) – Whether to load annotation during
instantiation. Defaults to False.
ignore_keys (List[str] or str) – Ignore the keys that can be
unequal in dataset.metainfo. Defaults to None.
New in version 0.3.0.
data_root (str) – The root directory for
data_prefix and ann_file.
ann_file (str) – Annotation file path.
extra_ann_file (str | optional) – The path of extra image metas
for CrowdHuman. It can be created by CrowdHumanDataset
automatically or by tools/misc/get_crowdhuman_id_hw.py
manually. Defaults to None.
When shuffle=True, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
Load annotations from an annotation file named as self.ann_file
If the annotation file does not follow OpenMMLab 2.0 format dataset .
The subclass must override this method for load annotations. The meta
information of annotation file will be overwritten METAINFO
and metainfo argument of constructor.
Load annotations from an annotation file named as self.ann_file
If the annotation file does not follow OpenMMLab 2.0 format dataset .
The subclass must override this method for load annotations. The meta
information of annotation file will be overwritten METAINFO
and metainfo argument of constructor.
Load annotations from an annotation file named as self.ann_file
If the annotation file does not follow OpenMMLab 2.0 format dataset .
The subclass must override this method for load annotations. The meta
information of annotation file will be overwritten METAINFO
and metainfo argument of constructor.
Parse raw annotation to target format. The difference between this
function and the one in BaseVideoDataset is that the parsing here
adds visibility and mot_conf.
Parameters:
raw_data_info (dict) – Raw data information load from ann_file
Suitable for training on multiple images mixed data augmentation like
mosaic and mixup. For the augmentation pipeline of mixed image data,
the get_indexes method needs to be provided to obtain the image
indexes, and you can set skip_flags to change the pipeline running
process. At the same time, we provide the dynamic_scale parameter
to dynamically change the output image size.
Parameters:
dataset (CustomDataset) – The dataset to be mixed.
pipeline (Sequence[dict]) – Sequence of transform object or
config dict to be composed.
dynamic_scale (tuple[int], optional) – The image scale can be changed
dynamically. Default to None. It is deprecated.
skip_type_keys (list[str], optional) – Sequence of type string to
be skip pipeline. Default to None.
max_refetch (int) – The maximum number of retry iterations for getting
valid results from the pipeline. If the number of iterations is
greater than max_refetch, but results is still None, then the
iteration is terminated and raise the error. Default: 15.
Load annotations from an annotation file named as self.ann_file
If the annotation file does not follow OpenMMLab 2.0 format dataset .
The subclass must override this method for load annotations. The meta
information of annotation file will be overwritten METAINFO
and metainfo argument of constructor.
Sampler that providing image-level sampling outputs for video datasets
in tracking tasks. It could be both used in both distributed and
non-distributed environment.
If using the default sampler in pytorch, the subsequent data receiver will
get one video, which is not desired in some cases:
(Take a non-distributed environment as an example)
1. In test mode, we want only one image is fed into the data pipeline. This
is in consideration of memory usage since feeding the whole video commonly
requires a large amount of memory (>=20G on MOTChallenge17 dataset), which
is not available in some machines.
2. In training mode, we may want to make sure all the images in one video
are randomly sampled once in one epoch and this can not be guaranteed in
the default sampler in pytorch.
Parameters:
dataset (Sized) – Dataset used for sampling.
seed (int, optional) – random seed used to shuffle the sampler. This
number should be identical across all processes in the distributed
group. Defaults to None.
seed (int, optional) – random seed used to shuffle the sampler.
This number should be identical across all
processes in the distributed group. Defaults to None.
num_sample_class (int) – The number of samples taken from each
per-label list. Defaults to 1.
When shuffle=True, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
When shuffle=True, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
The default data sampler for both distributed and non-distributed
environment.
It has several differences from the PyTorch DistributedSampler as
below:
This sampler supports non-distributed environment.
The round up behaviors are a little different.
If round_up=True, this sampler will add extra samples to make the
number of samples is evenly divisible by the world size. And
this behavior is the same as the DistributedSampler with
drop_last=False.
If round_up=False, this sampler won’t remove or add any samples
while the DistributedSampler with drop_last=True will remove
tail samples.
Parameters:
dataset (Sized) – The dataset.
dataset_ratio (Sequence(int))
seed (int, optional) – Random seed used to shuffle the sampler if
shuffle=True. This number should be identical across all
processes in the distributed group. Defaults to None.
round_up (bool) – Whether to add extra samples to make the number of
samples evenly divisible by the world size. Defaults to True.
When shuffle=True, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
Sampler that providing image-level sampling outputs for video datasets
in tracking tasks. It could be both used in both distributed and
non-distributed environment.
If using the default sampler in pytorch, the subsequent data receiver will
get one video, which is not desired in some cases:
(Take a non-distributed environment as an example)
1. In test mode, we want only one image is fed into the data pipeline. This
is in consideration of memory usage since feeding the whole video commonly
requires a large amount of memory (>=20G on MOTChallenge17 dataset), which
is not available in some machines.
2. In training mode, we may want to make sure all the images in one video
are randomly sampled once in one epoch and this can not be guaranteed in
the default sampler in pytorch.
Parameters:
dataset (Sized) – Dataset used for sampling.
seed (int, optional) – random seed used to shuffle the sampler. This
number should be identical across all processes in the distributed
group. Defaults to None.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
policies (List[List[Union[dict, ConfigDict]]]) – The policies of auto augmentation.Each policy in policies
is a specific augmentation policy, and is composed by several
augmentations. When AutoAugment is called, a random policy in
policies will be selected to augment images.
Defaults to policy_v0().
prob (list[float], optional) – The probabilities associated
with each policy. The length should be equal to the policy
number and the sum should be 1. If not given, a uniform
distribution will be assumed. Defaults to None.
Adjust the brightness of the image. A magnitude=0 gives a black image,
whereas magnitude=1 gives the original image. The bboxes, masks and
segmentations are not modified.
Required Keys:
img
Modified Keys:
img
Parameters:
prob (float) – The probability for performing Brightness transformation.
Defaults to 1.0.
level (int, optional) – Should be in range [0,_MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum magnitude for Brightness transformation.
Defaults to 0.1.
max_mag (float) – The maximum magnitude for Brightness transformation.
Defaults to 1.9.
mixup transform
+------------------------------+
| mixup image | |
| +--------|--------+ |
| | | | |
|---------------+ | |
| | | |
| | image | |
| | | |
| | | |
| |-----------------+ |
| pad |
+------------------------------+
The cached mixup transform steps are as follows:
1. Append the results from the last transform into the cache.
2. Another random image is picked from the cache and embedded in
the top left patch(after padding and resizing)
3. The target of mixup transform is the weighted average of mixup
image and origin image.
Required Keys:
img
gt_bboxes (np.float32) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_ignore_flags (bool) (optional)
mix_results (List[dict])
Modified Keys:
img
img_shape
gt_bboxes (optional)
gt_bboxes_labels (optional)
gt_ignore_flags (optional)
Parameters:
img_scale (Sequence[int]) – Image output size after mixup pipeline.
The shape order should be (width, height). Defaults to (640, 640).
ratio_range (Sequence[float]) – Scale ratio of mixup image.
Defaults to (0.5, 1.5).
flip_ratio (float) – Horizontal flip ratio of mixup image.
Defaults to 0.5.
pad_val (int) – Pad value. Defaults to 114.
max_iters (int) – The maximum number of iterations. If the number of
iterations is greater than max_iters, but gt_bbox is still
empty, then the iteration is terminated. Defaults to 15.
bbox_clip_border (bool, optional) – Whether to clip the objects outside
the border of the image. In some dataset like MOT17, the gt bboxes
are allowed to cross the border of images. Therefore, we don’t
need to clip the gt bboxes in these cases. Defaults to True.
max_cached_images (int) – The maximum length of the cache. The larger
the cache, the stronger the randomness of this transform. As a
rule of thumb, providing 10 caches for each image suffices for
randomness. Defaults to 20.
random_pop (bool) – Whether to randomly pop a result from the cache
when the cache is full. If set to False, use FIFO popping method.
Defaults to True.
prob (float) – Probability of applying this transformation.
Defaults to 1.0.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Cached mosaic transform will random select images from the cache
and combine them into one output image.
mosaic transform
center_x
+------------------------------+
| pad | pad |
| +-----------+ |
| | | |
| | image1 |--------+ |
| | | | |
| | | image2 | |
center_y |----+-------------+-----------|
| | cropped | |
|pad | image3 | image4 |
| | | |
+----|-------------+-----------+
| |
+-------------+
The cached mosaic transform steps are as follows:
1. Append the results from the last transform into the cache.
2. Choose the mosaic center as the intersections of 4 images
3. Get the left top image according to the index, and randomly
sample another 3 images from the result cache.
4. Sub image will be cropped if image is larger than mosaic patch
Required Keys:
img
gt_bboxes (np.float32) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_ignore_flags (bool) (optional)
Modified Keys:
img
img_shape
gt_bboxes (optional)
gt_bboxes_labels (optional)
gt_ignore_flags (optional)
Parameters:
img_scale (Sequence[int]) – Image size before mosaic pipeline of single
image. The shape order should be (width, height).
Defaults to (640, 640).
center_ratio_range (Sequence[float]) – Center ratio range of mosaic
output. Defaults to (0.5, 1.5).
bbox_clip_border (bool, optional) – Whether to clip the objects outside
the border of the image. In some dataset like MOT17, the gt bboxes
are allowed to cross the border of images. Therefore, we don’t
need to clip the gt bboxes in these cases. Defaults to True.
pad_val (int) – Pad value. Defaults to 114.
prob (float) – Probability of applying this transformation.
Defaults to 1.0.
max_cached_images (int) – The maximum length of the cache. The larger
the cache, the stronger the randomness of this transform. As a
rule of thumb, providing 10 caches for each image suffices for
randomness. Defaults to 40.
random_pop (bool) – Whether to randomly pop a result from the cache
when the cache is full. If set to False, use FIFO popping method.
Defaults to True.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Adjust the color balance of the image, in a manner similar to the
controls on a colour TV set. A magnitude=0 gives a black & white image,
whereas magnitude=1 gives the original image. The bboxes, masks and
segmentations are not modified.
Required Keys:
img
Modified Keys:
img
Parameters:
prob (float) – The probability for performing Color transformation.
Defaults to 1.0.
level (int, optional) – Should be in range [0,_MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum magnitude for Color transformation.
Defaults to 0.1.
max_mag (float) – The maximum magnitude for Color transformation.
Defaults to 1.9.
Base class for color transformations. All color transformations need to
inherit from this base class. ColorTransform unifies the class
attributes and class functions of color transformations (Color, Brightness,
Contrast, Sharpness, Solarize, SolarizeAdd, Equalize, AutoContrast, Invert,
and Posterize), and only distort color channels, without impacting the
locations of the instances.
Required Keys:
img
Modified Keys:
img
Parameters:
prob (float) – The probability for performing the geometric
transformation and should be in range [0, 1]. Defaults to 1.0.
level (int, optional) – The level should be in range [0, _MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum magnitude for color transformation.
Defaults to 0.1.
max_mag (float) – The maximum magnitude for color transformation.
Defaults to 1.9.
Control the contrast of the image. A magnitude=0 gives a gray image,
whereas magnitude=1 gives the original imageThe bboxes, masks and
segmentations are not modified.
Required Keys:
img
Modified Keys:
img
Parameters:
prob (float) – The probability for performing Contrast transformation.
Defaults to 1.0.
level (int, optional) – Should be in range [0,_MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum magnitude for Contrast transformation.
Defaults to 0.1.
max_mag (float) – The maximum magnitude for Contrast transformation.
Defaults to 1.9.
Simple Copy-Paste is a Strong Data Augmentation Method for Instance
Segmentation The simple copy-paste transform steps are as follows:
The destination image is already resized with aspect ratio kept,
cropped and padded.
Randomly select a source image, which is also already resized
with aspect ratio kept, cropped and padded in a similar way
as the destination image.
Randomly select some objects from the source image.
Paste these source objects to the destination image directly,
due to the source and destination image have the same size.
Update object masks of the destination image, for some origin objects
may be occluded.
Generate bboxes from the updated destination masks and
filter some objects which are totally occluded, and adjust bboxes
which are partly occluded.
Append selected source bboxes, masks, and labels.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_ignore_flags (bool) (optional)
gt_masks (BitmapMasks) (optional)
Modified Keys:
img
gt_bboxes (optional)
gt_bboxes_labels (optional)
gt_ignore_flags (optional)
gt_masks (optional)
Parameters:
max_num_pasted (int) – The maximum number of pasted objects.
Defaults to 100.
bbox_occluded_thr (int) – The threshold of occluded bbox.
Defaults to 10.
mask_occluded_thr (int) – The threshold of occluded mask.
Defaults to 300.
selected (bool) – Whether select objects or not. If select is False,
all objects of the source image will be pasted to the
destination image.
Defaults to True.
paste_by_box (bool) – Whether use boxes as masks when masks are not
available.
Defaults to False.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Randomly drop some regions of image used in
Cutout.
Required Keys:
img
Modified Keys:
img
Parameters:
n_holes (int or tuple[int, int]) – Number of regions to be dropped.
If it is given as a list, number of holes will be randomly
selected from the closed interval [n_holes[0], n_holes[1]].
cutout_shape (tuple[int, int] or list[tuple[int, int]], optional) – The candidate shape of dropped regions. It can be
tuple[int,int] to use a fixed cutout shape, or
list[tuple[int,int]] to randomly choose shape
from the list. Defaults to None.
(tuple[float (cutout_ratio) – optional): The candidate ratio of dropped regions. It can be
tuple[float,float] to use a fixed ratio or
list[tuple[float,float]] to randomly choose ratio
from the list. Please note that cutout_shape and
cutout_ratio cannot be both given at the same time.
Defaults to None.
list[tuple[float (float] or) – optional): The candidate ratio of dropped regions. It can be
tuple[float,float] to use a fixed ratio or
list[tuple[float,float]] to randomly choose ratio
from the list. Please note that cutout_shape and
cutout_ratio cannot be both given at the same time.
Defaults to None.
float]] – optional): The candidate ratio of dropped regions. It can be
tuple[float,float] to use a fixed ratio or
list[tuple[float,float]] to randomly choose ratio
from the list. Please note that cutout_shape and
cutout_ratio cannot be both given at the same time.
Defaults to None.
:paramoptional): The candidate ratio of dropped regions. It can be
tuple[float,float] to use a fixed ratio or
list[tuple[float,float]] to randomly choose ratio
from the list. Please note that cutout_shape and
cutout_ratio cannot be both given at the same time.
Defaults to None.
Parameters:
fill_in (tuple[float, float, float] or tuple[int, int, int]) – The value
of pixel to fill in the dropped regions. Defaults to (0, 0, 0).
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
This transform resizes the input image according to width and
height. Bboxes, masks, and seg map are then resized
with the same parameters.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
img_shape
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
scale
scale_factor
keep_ratio
homography_matrix
Parameters:
width (int) – width for resizing.
height (int) – height for resizing.
Defaults to None.
pad_val (Number | dict[str, Number], optional) –
Padding value for if
the pad_mode is “constant”. If it is a single number, the value
to pad the image is the number and to pad the semantic
segmentation map is 255. If it is a dict, it should have the
following keys:
img: The value to pad the image.
seg: The value to pad the semantic segmentation map.
Defaults to dict(img=0, seg=255).
keep_ratio (bool) – Whether to keep the aspect ratio when resizing the
image. Defaults to False.
clip_object_border (bool) – Whether to clip the objects
outside the border of the image. In some dataset like MOT17, the gt
bboxes are allowed to cross the border of images. Therefore, we
don’t need to clip the gt bboxes in these cases. Defaults to True.
backend (str) – Image resize backend, choices are ‘cv2’ and ‘pillow’.
These two backends generates slightly different results. Defaults
to ‘cv2’.
interpolation (str) – Interpolation method, accepted values are
“nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’
backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults
to ‘bilinear’.
Transform function to resize images, bounding boxes, semantic
segmentation map and keypoints.
Parameters:
results (dict) – Result dict from loading pipeline.
Returns:
Resized results, ‘img’, ‘gt_bboxes’, ‘gt_seg_map’,
‘gt_keypoints’, ‘scale’, ‘scale_factor’, ‘img_shape’,
and ‘keep_ratio’ keys are updated in result dict.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Base class for geometric transformations. All geometric transformations
need to inherit from this base class. GeomTransform unifies the class
attributes and class functions of geometric transformations (ShearX,
ShearY, Rotate, TranslateX, and TranslateY), and records the homography
matrix.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
homography_matrix
Parameters:
prob (float) – The probability for performing the geometric
transformation and should be in range [0, 1]. Defaults to 1.0.
level (int, optional) – The level should be in range [0, _MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum magnitude for geometric transformation.
Defaults to 0.0.
max_mag (float) – The maximum magnitude for geometric transformation.
Defaults to 1.0.
reversal_prob (float) – The probability that reverses the geometric
transformation magnitude. Should be in range [0,1].
Defaults to 0.5.
img_border_value (int | float | tuple) – The filled values for
image border. If float, the same fill value will be used for
all the three channels of image. If tuple, it should be 3 elements.
Defaults to 128.
mask_border_value (int) – The fill value used for masks. Defaults to 0.
seg_ignore_label (int) – The fill value used for segmentation map.
Note this value must equals ignore_label in semantic_head
of the corresponding config. Defaults to 255.
interpolation (str) – Interpolation method, accepted values are
“nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’
backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults
to ‘bilinear’.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
The dimension order of input image is (H, W, C). The pipeline will convert
it to (C, H, W). If only 2 dimension (H, W) is given, the output would be
(1, H, W).
Parameters:
keys (Sequence[str]) – Key of images to be converted to Tensor.
Similar with LoadImageFromFile, but the image has been loaded as
np.ndarray in results['img']. Can be used when loading image
from webcam.
Required Keys:
img
Modified Keys:
img
img_path
img_shape
ori_shape
Parameters:
to_float32 (bool) – Whether to convert the loaded image to a float32
numpy array. If set to False, the loaded image is an uint8 array.
Defaults to False.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Load and process the instances and seg_map annotation provided
by dataset.
The annotation format is as the following:
{'instances':[{# List of 4 numbers representing the bounding box of the# instance, in (x1, y1, x2, y2) order.'bbox':[x1,y1,x2,y2],# Label of image classification.'bbox_label':1,# Used in instance/panoptic segmentation. The segmentation mask# of the instance or the information of segments.# 1. If list[list[float]], it represents a list of polygons,# one for each connected component of the object. Each# list[float] is one simple polygon in the format of# [x1, y1, ..., xn, yn] (n >= 3). The Xs and Ys are absolute# coordinates in unit of pixels.# 2. If dict, it represents the per-pixel segmentation mask in# COCO's compressed RLE format. The dict should have keys# “size” and “counts”. Can be loaded by pycocotools'mask':list[list[float]]ordict,}]# Filename of semantic or panoptic segmentation ground truth file.'seg_map_path':'a/b/c'}
After this module, the annotation has been changed to the format below:
{# In (x1, y1, x2, y2) order, float type. N is the number of bboxes# in an image'gt_bboxes':BaseBoxes(N,4)# In int type.'gt_bboxes_labels':np.ndarray(N,)# In built-in class'gt_masks':PolygonMasks(H,W)orBitmapMasks(H,W)# In uint8 type.'gt_seg_map':np.ndarray(H,W)# in (x, y, v) order, float type.}
Required Keys:
height
width
instances
bbox (optional)
bbox_label
mask (optional)
ignore_flag
seg_map_path (optional)
Added Keys:
gt_bboxes (BaseBoxes[torch.float32])
gt_bboxes_labels (np.int64)
gt_masks (BitmapMasks | PolygonMasks)
gt_seg_map (np.uint8)
gt_ignore_flags (bool)
Parameters:
with_bbox (bool) – Whether to parse and load the bbox annotation.
Defaults to True.
with_label (bool) – Whether to parse and load the label annotation.
Defaults to True.
with_mask (bool) – Whether to parse and load the mask annotation.
Default: False.
with_seg (bool) – Whether to parse and load the semantic segmentation
annotation. Defaults to False.
poly2mask (bool) – Whether to convert mask to bitmap. Default: True.
box_type (str) – The box type used to wrap the bboxes. If box_type
is None, gt_bboxes will keep being np.ndarray. Defaults to ‘hbox’.
reduce_zero_label (bool) – Whether reduce all label value
by 1. Usually used for datasets where 0 is background label.
Defaults to False.
ignore_index (int) – The label index to be ignored.
Valid only if reduce_zero_label is true. Defaults is 255.
imdecode_backend (str) – The image decoding backend type. The backend
argument for :func:mmcv.imfrombytes.
See :fun:mmcv.imfrombytes for details.
Defaults to ‘cv2’.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend. Defaults to None.
with_bbox (bool) – Whether to load the pseudo bbox annotation.
Defaults to True.
with_label (bool) – Whether to load the pseudo label annotation.
Defaults to True.
with_mask (bool) – Whether to load the pseudo mask annotation.
Default: False.
with_seg (bool) – Whether to load the pseudo semantic segmentation
annotation. Defaults to False.
seg_ignore_label (int) – The fill value used for segmentation map.
Note this value must equals ignore_label in semantic_head
of the corresponding config. Defaults to 255.
Similar with LoadImageFromFile, but the image has been loaded as
np.ndarray in results['img']. Can be used when loading image
from webcam.
Required Keys:
img
Modified Keys:
img
img_path
img_shape
ori_shape
Parameters:
to_float32 (bool) – Whether to convert the loaded image to a float32
numpy array. If set to False, the loaded image is an uint8 array.
Defaults to False.
Load multi-channel images from a list of separate channel files.
Required Keys:
img_path
Modified Keys:
img
img_shape
ori_shape
Parameters:
to_float32 (bool) – Whether to convert the loaded image to a float32
numpy array. If set to False, the loaded image is an uint8 array.
Defaults to False.
color_type (str) – The flag argument for :func:mmcv.imfrombytes.
Defaults to ‘unchanged’.
imdecode_backend (str) – The image decoding backend type. The backend
argument for :func:mmcv.imfrombytes.
See :func:mmcv.imfrombytes for details.
Defaults to ‘cv2’.
file_client_args (dict) – Arguments to instantiate the
corresponding backend in mmdet <= 3.0.0rc6. Defaults to None.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend in mmdet >= 3.0.0rc7. Defaults to None.
{'instances':[{# List of 4 numbers representing the bounding box of the# instance, in (x1, y1, x2, y2) order.'bbox':[x1,y1,x2,y2],# Label of image classification.'bbox_label':1,},...]'segments_info':[{# id = cls_id + instance_id * INSTANCE_OFFSET'id':int,# Contiguous category id defined in dataset.'category':int# Thing flag.'is_thing':bool},...]# Filename of semantic or panoptic segmentation ground truth file.'seg_map_path':'a/b/c'}
After this module, the annotation has been changed to the format below:
{# In (x1, y1, x2, y2) order, float type. N is the number of bboxes# in an image'gt_bboxes':BaseBoxes(N,4)# In int type.'gt_bboxes_labels':np.ndarray(N,)# In built-in class'gt_masks':PolygonMasks(H,W)orBitmapMasks(H,W)# In uint8 type.'gt_seg_map':np.ndarray(H,W)# in (x, y, v) order, float type.}
Required Keys:
height
width
instances
- bbox
- bbox_label
- ignore_flag
segments_info
- id
- category
- is_thing
seg_map_path
Added Keys:
gt_bboxes (BaseBoxes[torch.float32])
gt_bboxes_labels (np.int64)
gt_masks (BitmapMasks | PolygonMasks)
gt_seg_map (np.uint8)
gt_ignore_flags (bool)
Parameters:
with_bbox (bool) – Whether to parse and load the bbox annotation.
Defaults to True.
with_label (bool) – Whether to parse and load the label annotation.
Defaults to True.
with_mask (bool) – Whether to parse and load the mask annotation.
Defaults to True.
with_seg (bool) – Whether to parse and load the semantic segmentation
annotation. Defaults to False.
box_type (str) – The box mode used to wrap the bboxes.
imdecode_backend (str) – The image decoding backend type. The backend
argument for :func:mmcv.imfrombytes.
See :fun:mmcv.imfrombytes for details.
Defaults to ‘cv2’.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend in mmdet >= 3.0.0rc7. Defaults to None.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Load and process the instances and seg_map annotation provided
by dataset. It must load instances_ids which is only used in the
tracking tasks. The annotation format is as the following:
After this module, the annotation has been changed to the format below:
.. code-block:: python
{
# In (x1, y1, x2, y2) order, float type. N is the number of bboxes
# in an image
‘gt_bboxes’: np.ndarray(N, 4)
# In int type.
‘gt_bboxes_labels’: np.ndarray(N, )
# In built-in class
‘gt_masks’: PolygonMasks (H, W) or BitmapMasks (H, W)
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
mixup transform
+------------------------------+
| mixup image | |
| +--------|--------+ |
| | | | |
|---------------+ | |
| | | |
| | image | |
| | | |
| | | |
| |-----------------+ |
| pad |
+------------------------------+
The mixup transform steps are as follows:
1. Another random image is picked by dataset and embedded in
the top left patch(after padding and resizing)
2. The target of mixup transform is the weighted average of mixup
image and origin image.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_ignore_flags (bool) (optional)
mix_results (List[dict])
Modified Keys:
img
img_shape
gt_bboxes (optional)
gt_bboxes_labels (optional)
gt_ignore_flags (optional)
Parameters:
img_scale (Sequence[int]) – Image output size after mixup pipeline.
The shape order should be (width, height). Defaults to (640, 640).
ratio_range (Sequence[float]) – Scale ratio of mixup image.
Defaults to (0.5, 1.5).
flip_ratio (float) – Horizontal flip ratio of mixup image.
Defaults to 0.5.
pad_val (int) – Pad value. Defaults to 114.
max_iters (int) – The maximum number of iterations. If the number of
iterations is greater than max_iters, but gt_bbox is still
empty, then the iteration is terminated. Defaults to 15.
bbox_clip_border (bool, optional) – Whether to clip the objects outside
the border of the image. In some dataset like MOT17, the gt bboxes
are allowed to cross the border of images. Therefore, we don’t
need to clip the gt bboxes in these cases. Defaults to True.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Given 4 images, mosaic transform combines them into
one output image. The output image is composed of the parts from each sub-
image.
mosaic transform
center_x
+------------------------------+
| pad | pad |
| +-----------+ |
| | | |
| | image1 |--------+ |
| | | | |
| | | image2 | |
center_y |----+-------------+-----------|
| | cropped | |
|pad | image3 | image4 |
| | | |
+----|-------------+-----------+
| |
+-------------+
The mosaic transform steps are as follows:
1. Choose the mosaic center as the intersections of 4 images
2. Get the left top image according to the index, and randomly
sample another 3 images from the custom dataset.
3. Sub image will be cropped if image is larger than mosaic patch
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_ignore_flags (bool) (optional)
mix_results (List[dict])
Modified Keys:
img
img_shape
gt_bboxes (optional)
gt_bboxes_labels (optional)
gt_ignore_flags (optional)
Parameters:
img_scale (Sequence[int]) – Image size before mosaic pipeline of single
image. The shape order should be (width, height).
Defaults to (640, 640).
center_ratio_range (Sequence[float]) – Center ratio range of mosaic
output. Defaults to (0.5, 1.5).
bbox_clip_border (bool, optional) – Whether to clip the objects outside
the border of the image. In some dataset like MOT17, the gt bboxes
are allowed to cross the border of images. Therefore, we don’t
need to clip the gt bboxes in these cases. Defaults to True.
pad_val (int) – Pad value. Defaults to 114.
prob (float) – Probability of applying this transformation.
Defaults to 1.0.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Generate multiple data-augmented versions of the same image.
MultiBranch needs to specify the branch names of all
pipelines of the dataset, perform corresponding data augmentation
for the current branch, and return None for other branches,
which ensures the consistency of return format across
different samples.
Parameters:
branch_field (list) – List of branch names.
branch_pipelines (dict) – Dict of different pipeline configs
to be composed.
Pack the inputs data for the detection / semantic segmentation /
panoptic segmentation.
The img_meta item is always populated. The contents of the
img_meta dictionary depends on meta_keys. By default this includes:
img_id: id of the image
img_path: path to the image file
ori_shape: original shape of the image as a tuple (h, w)
img_shape: shape of the image input to the network as a tuple (h, w). Note that images may be zero padded on the bottom/right if the batch tensor is larger than this shape.
scale_factor: a float indicating the preprocessing scale
flip: a boolean indicating if image flip transform was used
flip_direction: the flipping direction
Parameters:
meta_keys (Sequence[str], optional) – Meta keys to be converted to
mmcv.DataContainer and collected in data[img_metas].
Default: ('img_id','img_path','ori_shape','img_shape','scale_factor','flip','flip_direction')
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Pack the inputs data for the ReID. The meta_info item is always
populated. The contents of the meta_info dictionary depends on
meta_keys. By default this includes:
img_path: path to the image file.
ori_shape: original shape of the image as a tuple (H, W).
img_shape: shape of the image input to the network as a tuple
(H, W). Note that images may be zero padded on the bottom/right
if the batch tensor is larger than this shape.
scale: scale of the image as a tuple (W, H).
scale_factor: a float indicating the pre-processing scale.
flip: a boolean indicating if image flip transform was used.
flip_direction: the flipping direction.
Parameters:
meta_keys (Sequence[str], optional) – The meta keys to saved in the
metainfo of the packed data_sample.
Pack the inputs data for the multi object tracking and video instance
segmentation. All the information of images are packed to inputs. All
the information except images are packed to data_samples. In order to
get the original annotaiton and meta info, we add instances key into meta
keys.
Parameters:
meta_keys (Sequence[str]) – Meta keys to be collected in
data_sample.metainfo. Defaults to None.
default_meta_keys (tuple) – Default meta keys. Defaults to (‘img_id’,
‘img_path’, ‘ori_shape’, ‘img_shape’, ‘scale_factor’,
‘flip’, ‘flip_direction’, ‘frame_id’, ‘is_video_data’,
‘video_id’, ‘video_length’, ‘instances’).
There are three padding modes: (1) pad to a fixed size and (2) pad to the
minimum size that is divisible by some number. and (3)pad to square. Also,
pad to square and pad to the minimum size can be used as the same time.
size_divisor (int, optional) – The divisor of padded size. Defaults to
None.
pad_to_square (bool) – Whether to pad the image into a square.
Currently only used for YOLOX. Defaults to False.
pad_val (Number | dict[str, Number], optional) –
the pad_mode is “constant”. If it is a single number, the value
to pad the image is the number and to pad the semantic
segmentation map is 255. If it is a dict, it should have the
following keys:
img: The value to pad the image.
seg: The value to pad the semantic segmentation map.
Defaults to dict(img=0, seg=255).
padding_mode (str) –
Type of padding. Should be: constant, edge,
reflect or symmetric. Defaults to ‘constant’.
constant: pads with a constant value, this value is specified
with pad_val.
edge: pads with the last value at the edge of the image.
reflect: pads with reflection of image without repeating the last
value on the edge. For example, padding [1, 2, 3, 4] with 2
elements on both sides in reflect mode will result in
[3, 2, 1, 2, 3, 4, 3, 2].
symmetric: pads with reflection of image repeating the last value
on the edge. For example, padding [1, 2, 3, 4] with 2 elements on
both sides in symmetric mode will result in
[2, 1, 1, 2, 3, 4, 4, 3]
Apply photometric distortion to image sequentially, every transformation
is applied with a probability of 0.5. The position of random contrast is in
second or second to last.
random brightness
random contrast (mode 0)
convert color from BGR to HSV
random saturation
random hue
convert color from HSV to BGR
random contrast (mode 1)
randomly swap channels
Required Keys:
img (np.uint8)
Modified Keys:
img (np.float32)
Parameters:
brightness_delta (int) – delta of brightness.
contrast_range (sequence) – range of contrast.
saturation_range (sequence) – range of saturation.
hue_delta (int) – delta of hue.
swap_channels (bool) – Whether to randomly swap channels.
Defaults to True.
A transform wrapper to apply the wrapped transforms to process both
gt_bboxes and proposals without adding any codes. It will do the
following steps:
Scatter the broadcasting targets to a list of inputs of the wrapped
transforms. The type of the list should be list[dict, dict], which
the first is the original inputs, the second is the processing
results that gt_bboxes being rewritten by the proposals.
Apply self.transforms, with same random parameters, which is
sharing with a context manager. The type of the outputs is a
list[dict, dict].
Gather the outputs, update the proposals in the first item of
the outputs with the gt_bboxes in the second .
Parameters:
transforms (list, optional) – Sequence of transform
object or config dict to be wrapped. Defaults to [].
Note: The TransformBroadcaster in MMCV can achieve the same operation as
ProposalBroadcaster, but need to set more complex parameters.
aug_space (List[List[Union[dict, ConfigDict]]]) – The augmentation space
of rand augmentation. Each augmentation transform in aug_space
is a specific transform, and is composed by several augmentations.
When RandAugment is called, a random transform in aug_space
will be selected to augment images. Defaults to aug_space.
aug_num (int) – Number of augmentation to apply equentially.
Defaults to 2.
prob (list[float], optional) – The probabilities associated with
each augmentation. The length should be equal to the
augmentation space and the sum should be 1. If not given,
a uniform distribution will be assumed. Defaults to None.
This operation randomly generates affine transform matrix which including
rotation, translation, shear and scaling transforms.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_ignore_flags (bool) (optional)
Modified Keys:
img
img_shape
gt_bboxes (optional)
gt_bboxes_labels (optional)
gt_ignore_flags (optional)
Parameters:
max_rotate_degree (float) – Maximum degrees of rotation transform.
Defaults to 10.
max_translate_ratio (float) – Maximum ratio of translation.
Defaults to 0.1.
scaling_ratio_range (tuple[float]) – Min and max ratio of
scaling transform. Defaults to (0.5, 1.5).
max_shear_degree (float) – Maximum degrees of shear
transform. Defaults to 2.
border (tuple[int]) – Distance from width and height sides of input
image to adjust output shape. Only used in mosaic dataset.
Defaults to (0, 0).
border_val (tuple[int]) – Border padding values of 3 channels.
Defaults to (114, 114, 114).
bbox_clip_border (bool, optional) – Whether to clip the objects outside
the border of the image. In some dataset like MOT17, the gt bboxes
are allowed to cross the border of images. Therefore, we don’t
need to clip the gt bboxes in these cases. Defaults to True.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Random center crop and random around padding for CornerNet.
This operation generates randomly cropped image from the original image and
pads it simultaneously. Different from RandomCrop, the output
shape may not equal to crop_size strictly. We choose a random value
from ratios and the output shape could be larger or smaller than
crop_size. The padding operation is also different from Pad,
here we use around padding instead of right-bottom padding.
The relation between output image (padding image) and original image:
output image
+----------------------------+
| padded area |
+------|----------------------------|----------+
| | cropped area | |
| | +---------------+ | |
| | | . center | | | original image
| | | range | | |
| | +---------------+ | |
+------|----------------------------|----------+
| padded area |
+----------------------------+
There are 5 main areas in the figure:
output image: output image of this operation, also called padding
image in following instruction.
original image: input image of this operation.
padded area: non-intersect area of output image and original image.
cropped area: the overlap of output image and original image.
center range: a smaller area where random center chosen from.
center range is computed by border and original image’s shape
to avoid our random center is too close to original image’s border.
Also this operation act differently in train and test mode, the summary
pipeline is listed below.
Train pipeline:
Choose a random_ratio from ratios, the shape of padding image
will be random_ratio*crop_size.
Choose a random_center in center range.
Generate padding image with center matches the random_center.
Initialize the padding image with pixel value equals to mean.
Copy the cropped area to padding image.
Refine annotations.
Test pipeline:
Compute output shape according to test_pad_mode.
Generate padding image with center matches the original image
center.
Initialize the padding image with pixel value equals to mean.
Copy the croppedarea to padding image.
Required Keys:
img (np.float32)
img_shape (tuple)
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_ignore_flags (bool) (optional)
Modified Keys:
img (np.float32)
img_shape (tuple)
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_ignore_flags (bool) (optional)
Parameters:
crop_size (tuple, optional) – expected size after crop, final size will
computed according to ratio. Requires (width, height)
in train mode, and None in test mode.
ratios (tuple, optional) – random select a ratio from tuple and crop
image to (crop_size[0] * ratio) * (crop_size[1] * ratio).
Only available in train mode. Defaults to (0.9, 1.0, 1.1).
border (int, optional) – max distance from center select area to image
border. Only available in train mode. Defaults to 128.
mean (sequence, optional) – Mean values of 3 channels.
std (sequence, optional) – Std values of 3 channels.
to_rgb (bool, optional) – Whether to convert the image from BGR to RGB.
test_mode (bool) – whether involve random variables in transform.
In train mode, crop_size is fixed, center coords and ratio is
random selected from predefined lists. In test mode, crop_size
is image’s original shape, center coords and ratio is fixed.
Defaults to False.
test_pad_mode (tuple, optional) –
padding method and padding shape
value, only available in test mode. Default is using
‘logical_or’ with 127 as padding shape value.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
The absolute crop_size is sampled based on crop_type and
image_size, then the cropped results are generated.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_bboxes_labels (np.int64) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_ignore_flags (bool) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
img_shape
gt_bboxes (optional)
gt_bboxes_labels (optional)
gt_masks (optional)
gt_ignore_flags (optional)
gt_seg_map (optional)
gt_instances_ids (options, only used in MOT/VIS)
Added Keys:
homography_matrix
Parameters:
crop_size (tuple) – The relative ratio or absolute pixels of
(width, height).
crop_type (str, optional) – One of “relative_range”, “relative”,
“absolute”, “absolute_range”. “relative” randomly crops
(h * crop_size[0], w * crop_size[1]) part from an input of size
(h, w). “relative_range” uniformly samples relative crop size from
range [crop_size[0], 1] and [crop_size[1], 1] for height and width
respectively. “absolute” crops from an input with absolute size
(crop_size[0], crop_size[1]). “absolute_range” uniformly samples
crop_h in range [crop_size[0], min(h, crop_size[1])] and crop_w
in range [crop_size[0], min(w, crop_size[1])].
Defaults to “absolute”.
allow_negative_crop (bool, optional) – Whether to allow a crop that does
not contain any bbox area. Defaults to False.
recompute_bbox (bool, optional) – Whether to re-compute the boxes based
on cropped instance masks. Defaults to False.
bbox_clip_border (bool, optional) – Whether clip the objects outside
the border of the image. Defaults to True.
Note
If the image is smaller than the absolute crop size, return the
original image.
The keys for bboxes, labels and masks must be aligned. That is,
gt_bboxes corresponds to gt_labels and gt_masks, and
gt_bboxes_ignore corresponds to gt_labels_ignore and
gt_masks_ignore.
If the crop does not contain any gt-bbox region and
allow_negative_crop is set to False, skip this image.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
n_patches (int or tuple[int, int]) – Number of regions to be dropped.
If it is given as a tuple, number of patches will be randomly
selected from the closed interval [n_patches[0],
n_patches[1]].
ratio (float or tuple[float, float]) – The ratio of erased regions.
It can be float to use a fixed ratio or tuple[float,float]
to randomly choose ratio from the interval.
squared (bool) – Whether to erase square region. Defaults to True.
bbox_erased_thr (float) – The threshold for the maximum area proportion
of the bbox to be erased. When the proportion of the area where the
bbox is erased is greater than the threshold, the bbox will be
removed. Defaults to 0.9.
img_border_value (int or float or tuple) – The filled values for
image border. If float, the same fill value will be used for
all the three channels of image. If tuple, it should be 3 elements.
Defaults to 128.
mask_border_value (int) – The fill value used for masks. Defaults to 0.
seg_ignore_label (int) – The fill value used for segmentation map.
Note this value must equals ignore_label in semantic_head
of the corresponding config. Defaults to 255.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Flip the image & bbox & mask & segmentation map. Added or Updated keys:
flip, flip_direction, img, gt_bboxes, and gt_seg_map. There are 3 flip
modes:
prob is float, direction is string: the image will be
direction``lyflippedwithprobabilityof``prob .
E.g., prob=0.5, direction='horizontal',
then image will be horizontally flipped with probability of 0.5.
prob is float, direction is list of string: the image will
be direction[i]``lyflippedwithprobabilityof``prob/len(direction).
E.g., prob=0.5, direction=['horizontal','vertical'],
then image will be horizontally flipped with probability of 0.25,
vertically with probability of 0.25.
prob is list of float, direction is list of string:
given len(prob)==len(direction), the image will
be direction[i]``lyflippedwithprobabilityof``prob[i].
E.g., prob=[0.3,0.5], direction=['horizontal','vertical'], then image will be horizontally flipped with
probability of 0.3, vertically with probability of 0.5.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
flip
flip_direction
homography_matrix
Parameters:
prob (float | list[float], optional) – The flipping probability.
Defaults to None.
direction (str | list[str]) – The flipping direction. Options
If input is a list, the length must equal prob. Each
element in prob indicates the flip probability of
corresponding direction. Defaults to ‘horizontal’.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Shift the image and box given shift pixels and probability.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32])
gt_bboxes_labels (np.int64)
gt_ignore_flags (bool) (optional)
Modified Keys:
img
gt_bboxes
gt_bboxes_labels
gt_ignore_flags (bool) (optional)
Parameters:
prob (float) – Probability of shifts. Defaults to 0.5.
max_shift_px (int) – The max pixels for shifting. Defaults to 32.
filter_thr_px (int) – The width and height threshold for filtering.
The bbox and the rest of the targets below the width and
height threshold will be filtered. Defaults to 1.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
This transform resizes the input image according to scale or
scale_factor. Bboxes, masks, and seg map are then resized
with the same scale factor.
if scale and scale_factor are both set, it will use scale to
resize.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
img_shape
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
scale
scale_factor
keep_ratio
homography_matrix
Parameters:
scale (int or tuple) – Images scales for resizing. Defaults to None
scale_factor (float or tuple[float]) – Scale factors for resizing.
Defaults to None.
keep_ratio (bool) – Whether to keep the aspect ratio when resizing the
image. Defaults to False.
clip_object_border (bool) – Whether to clip the objects
outside the border of the image. In some dataset like MOT17, the gt
bboxes are allowed to cross the border of images. Therefore, we
don’t need to clip the gt bboxes in these cases. Defaults to True.
backend (str) – Image resize backend, choices are ‘cv2’ and ‘pillow’.
These two backends generates slightly different results. Defaults
to ‘cv2’.
interpolation (str) – Interpolation method, accepted values are
“nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’
backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults
to ‘bilinear’.
Transform function to resize images, bounding boxes, semantic
segmentation map and keypoints.
Parameters:
results (dict) – Result dict from loading pipeline.
Returns:
Resized results, ‘img’, ‘gt_bboxes’, ‘gt_seg_map’,
‘gt_keypoints’, ‘scale’, ‘scale_factor’, ‘img_shape’,
and ‘keep_ratio’ keys are updated in result dict.
This transform attempts to scale the shorter edge to the given
scale, as long as the longer edge does not exceed max_size.
If max_size is reached, then downscale so that the longer
edge does not exceed max_size.
Required Keys:
img
gt_seg_map (optional)
Modified Keys:
img
img_shape
gt_seg_map (optional))
Added Keys:
scale
scale_factor
keep_ratio
Parameters:
scale (Union[int, Tuple[int, int]]) – The target short edge length.
If it’s tuple, will select the min value as the short edge length.
max_size (int) – The maximum allowed longest edge length.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Rotate the images, bboxes, masks and segmentation map.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
homography_matrix
Parameters:
prob (float) – The probability for perform transformation and
should be in range 0 to 1. Defaults to 1.0.
level (int, optional) – The level should be in range [0, _MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The maximum angle for rotation.
Defaults to 0.0.
max_mag (float) – The maximum angle for rotation.
Defaults to 30.0.
reversal_prob (float) – The probability that reverses the rotation
magnitude. Should be in range [0,1]. Defaults to 0.5.
img_border_value (int | float | tuple) – The filled values for
image border. If float, the same fill value will be used for
all the three channels of image. If tuple, it should be 3 elements.
Defaults to 128.
mask_border_value (int) – The fill value used for masks. Defaults to 0.
seg_ignore_label (int) – The fill value used for segmentation map.
Note this value must equals ignore_label in semantic_head
of the corresponding config. Defaults to 255.
interpolation (str) – Interpolation method, accepted values are
“nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’
backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults
to ‘bilinear’.
Adjust images sharpness. A positive magnitude would enhance the
sharpness and a negative magnitude would make the image blurry. A
magnitude=0 gives the origin img.
Required Keys:
img
Modified Keys:
img
Parameters:
prob (float) – The probability for performing Sharpness transformation.
Defaults to 1.0.
level (int, optional) – Should be in range [0,_MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum magnitude for Sharpness transformation.
Defaults to 0.1.
max_mag (float) – The maximum magnitude for Sharpness transformation.
Defaults to 1.9.
Shear the images, bboxes, masks and segmentation map horizontally.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
homography_matrix
Parameters:
prob (float) – The probability for performing Shear and should be in
range [0, 1]. Defaults to 1.0.
level (int, optional) – The level should be in range [0, _MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum angle for the horizontal shear.
Defaults to 0.0.
max_mag (float) – The maximum angle for the horizontal shear.
Defaults to 30.0.
reversal_prob (float) – The probability that reverses the horizontal
shear magnitude. Should be in range [0,1]. Defaults to 0.5.
img_border_value (int | float | tuple) – The filled values for
image border. If float, the same fill value will be used for
all the three channels of image. If tuple, it should be 3 elements.
Defaults to 128.
mask_border_value (int) – The fill value used for masks. Defaults to 0.
seg_ignore_label (int) – The fill value used for segmentation map.
Note this value must equals ignore_label in semantic_head
of the corresponding config. Defaults to 255.
interpolation (str) – Interpolation method, accepted values are
“nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’
backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults
to ‘bilinear’.
Shear the images, bboxes, masks and segmentation map vertically.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
homography_matrix
Parameters:
prob (float) – The probability for performing ShearY and should be in
range [0, 1]. Defaults to 1.0.
level (int, optional) – The level should be in range [0,_MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum angle for the vertical shear.
Defaults to 0.0.
max_mag (float) – The maximum angle for the vertical shear.
Defaults to 30.0.
reversal_prob (float) – The probability that reverses the vertical
shear magnitude. Should be in range [0,1]. Defaults to 0.5.
img_border_value (int | float | tuple) – The filled values for
image border. If float, the same fill value will be used for
all the three channels of image. If tuple, it should be 3 elements.
Defaults to 128.
mask_border_value (int) – The fill value used for masks. Defaults to 0.
seg_ignore_label (int) – The fill value used for segmentation map.
Note this value must equals ignore_label in semantic_head
of the corresponding config. Defaults to 255.
interpolation (str) – Interpolation method, accepted values are
“nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’
backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults
to ‘bilinear’.
Translate the images, bboxes, masks and segmentation map horizontally.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
homography_matrix
Parameters:
prob (float) – The probability for perform transformation and
should be in range 0 to 1. Defaults to 1.0.
level (int, optional) – The level should be in range [0, _MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum pixel’s offset ratio for horizontal
translation. Defaults to 0.0.
max_mag (float) – The maximum pixel’s offset ratio for horizontal
translation. Defaults to 0.1.
reversal_prob (float) – The probability that reverses the horizontal
translation magnitude. Should be in range [0,1]. Defaults to 0.5.
img_border_value (int | float | tuple) – The filled values for
image border. If float, the same fill value will be used for
all the three channels of image. If tuple, it should be 3 elements.
Defaults to 128.
mask_border_value (int) – The fill value used for masks. Defaults to 0.
seg_ignore_label (int) – The fill value used for segmentation map.
Note this value must equals ignore_label in semantic_head
of the corresponding config. Defaults to 255.
interpolation (str) – Interpolation method, accepted values are
“nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’
backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults
to ‘bilinear’.
Translate the images, bboxes, masks and segmentation map vertically.
Required Keys:
img
gt_bboxes (BaseBoxes[torch.float32]) (optional)
gt_masks (BitmapMasks | PolygonMasks) (optional)
gt_seg_map (np.uint8) (optional)
Modified Keys:
img
gt_bboxes
gt_masks
gt_seg_map
Added Keys:
homography_matrix
Parameters:
prob (float) – The probability for perform transformation and
should be in range 0 to 1. Defaults to 1.0.
level (int, optional) – The level should be in range [0, _MAX_LEVEL].
If level is None, it will generate from [0, _MAX_LEVEL] randomly.
Defaults to None.
min_mag (float) – The minimum pixel’s offset ratio for vertical
translation. Defaults to 0.0.
max_mag (float) – The maximum pixel’s offset ratio for vertical
translation. Defaults to 0.1.
reversal_prob (float) – The probability that reverses the vertical
translation magnitude. Should be in range [0,1]. Defaults to 0.5.
img_border_value (int | float | tuple) – The filled values for
image border. If float, the same fill value will be used for
all the three channels of image. If tuple, it should be 3 elements.
Defaults to 128.
mask_border_value (int) – The fill value used for masks. Defaults to 0.
seg_ignore_label (int) – The fill value used for segmentation map.
Note this value must equals ignore_label in semantic_head
of the corresponding config. Defaults to 255.
interpolation (str) – Interpolation method, accepted values are
“nearest”, “bilinear”, “bicubic”, “area”, “lanczos” for ‘cv2’
backend, “nearest”, “bilinear” for ‘pillow’ backend. Defaults
to ‘bilinear’.
Apply HSV augmentation to image sequentially. It is referenced from
https://github.com/Megvii-
BaseDetection/YOLOX/blob/main/yolox/data/data_augment.py#L21.
Required Keys:
img
Modified Keys:
img
Parameters:
hue_delta (int) – delta of hue. Defaults to 5.
saturation_delta (int) – delta of saturation. Defaults to 30.
value_delta (int) – delat of value. Defaults to 30.
The transform function. All subclass of BaseTransform should
override this method.
This function takes the result dict as the input, and can add new
items to the dict or modify existing items in the dict. And the result
dict will be returned in the end, which allows to concate multiple
transforms into a pipeline.
Mean Teacher is an efficient semi-supervised learning method in
Mean Teacher.
This method requires two models with exactly the same structure,
as the student model and the teacher model, respectively.
The student model updates the parameters through gradient descent,
and the teacher model updates the parameters through
exponential moving average of the student model.
Compared with the student model, the teacher model
is smoother and accumulates more knowledge.
Parameters:
momentum (float) –
The momentum used for updating teacher’s parameter.
interval (int) – Update teacher’s parameter every interval iteration.
Defaults to 1.
skip_buffers (bool) – Whether to skip the model buffers, such as
batchnorm running stats (running_mean, running_var), it does not
perform the ema operation. Default to True.
Calculate average precision (for single or multiple scales).
Parameters:
recalls (ndarray) – shape (num_scales, num_dets) or (num_dets, )
precisions (ndarray) – shape (num_scales, num_dets) or (num_dets, )
mode (str) – ‘area’ or ‘11points’, ‘area’ means calculating the area
under precision-recall curve, ‘11points’ means calculating
the average precision of recalls at [0, 0.1, …, 1]
Calculate the ious between each bbox of bboxes1 and bboxes2.
Parameters:
bboxes1 (ndarray) – Shape (n, 4)
bboxes2 (ndarray) – Shape (k, 4)
mode (str) – IOU (intersection over union) or IOF (intersection
over foreground)
use_legacy_coordinate (bool) – Whether to use coordinate system in
mmdet v1.x. which means width, height should be
calculated as ‘x2 - x1 + 1` and ‘y2 - y1 + 1’ respectively.
Note when function is used in VOCDataset, it should be
True to align with the official implementation
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCdevkit_18-May-2011.tar
Default: False.
det_results (list[list]) – [[cls1_det, cls2_det, …], …].
The outer list indicates images, and the inner list indicates
per-class detected bboxes.
annotations (list[dict]) –
Ground truth annotations where each item of
the list indicates an image. Keys of annotations are:
bboxes: numpy array of shape (n, 4)
labels: numpy array of shape (n, )
bboxes_ignore (optional): numpy array of shape (k, 4)
labels_ignore (optional): numpy array of shape (k, )
scale_ranges (list[tuple] | None) – Range of scales to be evaluated,
in the format [(min1, max1), (min2, max2), …]. A range of
(32, 64) means the area range between (32**2, 64**2).
Defaults to None.
iou_thr (float) – IoU threshold to be considered as matched.
Defaults to 0.5.
ioa_thr (float | None) – IoA threshold to be considered as matched,
which only used in OpenImages evaluation. Defaults to None.
dataset (list[str] | str | None) – Dataset name or dataset classes,
there are minor differences in metrics for different datasets, e.g.
“voc”, “imagenet_det”, etc. Defaults to None.
logger (logging.Logger | str | None) – The way to print the mAP
summary. See mmengine.logging.print_log() for details.
Defaults to None.
tpfp_fn (callable | None) – The function used to determine true/
false positives. If None, tpfp_default() is used as default
unless dataset is ‘det’ or ‘vid’ (tpfp_imagenet() in this
case). If it is given as a function, then this function is used
to evaluate tp & fp. Default None.
nproc (int) – Processes used for computing TP and FP.
Defaults to 4.
use_legacy_coordinate (bool) – Whether to use coordinate system in
mmdet v1.x. which means width, height should be
calculated as ‘x2 - x1 + 1` and ‘y2 - y1 + 1’ respectively.
Defaults to False.
use_group_of (bool) – Whether to use group of when calculate TP and FP,
which only used in OpenImages evaluation. Defaults to False.
eval_mode (str) – ‘area’ or ‘11points’, ‘area’ means calculating the
area under precision-recall curve, ‘11points’ means calculating
the average precision of recalls at [0, 0.1, …, 1],
PASCAL VOC2007 uses 11points as default evaluate mode, while
others are ‘area’. Defaults to ‘area’.
logger (logging.Logger | str | None) – The way to print the recall
summary. See mmengine.logging.print_log() for details.
Default: None.
use_legacy_coordinate (bool) – Whether use coordinate system
in mmdet v1.x. “1” was added to both height and width
which means w, h should be
computed as ‘x2 - x1 + 1` and ‘y2 - y1 + 1’. Default: False.
evalInstanceLevelSemanticLabeling.evaluateImgLists``. Support loading
groundtruth image from file backend.
:param prediction_list: A list of prediction txt file.
:type prediction_list: list
:param groundtruth_list: A list of groundtruth image file.
:type groundtruth_list: list
:param args: A global object setting in
Evaluate the metrics of Panoptic Segmentation with multithreading.
Same as the function with the same name in panopticapi.
Parameters:
matched_annotations_list (list) – The matched annotation list. Each
element is a tuple of annotations of the same image with the
format (gt_anns, pred_anns).
gt_folder (str) – The path of the ground truth images.
pred_folder (str) – The path of the prediction images.
categories (str) – The categories of the dataset.
backend_args (object) – The file client of the dataset. If None,
the backend will be set to local.
nproc (int) – Number of processes for panoptic quality computing.
Defaults to 32. When nproc exceeds the number of cpu cores,
the number of cpu cores is used.
The metric first processes each batch of data_samples and predictions,
and appends the processed results to the results list. Then it
collects all results together from all ranks if distributed training
is used. Finally, it computes the metrics of the entire dataset.
A subclass of class:BaseVideoMetric should assign a meaningful value
to the class attribute default_prefix. See the argument prefix for
details.
Save the generated captions and transform into coco format.
Calling COCO API for caption metrics.
Parameters:
ann_file (str) – the path for the COCO format caption ground truth
json file, load for evaluations.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Should be modified according to the
retrieval_type for unambiguous results. Defaults to TR.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (Any) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from
the model.
outfile_prefix (str) – The prefix of txt and png files. The txt and
png file will be save in a directory whose path is
“outfile_prefix.results/”.
seg_prefix (str, optional) – Path to the directory which contains the
cityscapes instance segmentation masks. It’s necessary when
training and validation. It could be None when infer on test
dataset. Defaults to None.
format_only (bool) – Format the output results without perform
evaluation. It is useful when you want to format the result
to a specific format and submit it to the test server.
Defaults to False.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
dump_matches (bool) – Whether dump matches.json file during evaluating.
Defaults to False.
file_client_args (dict, optional) – Arguments to instantiate the
corresponding backend in mmdet <= 3.0.0rc6. Defaults to None.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend. Defaults to None.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
Evaluate AR, AP, and mAP for detection tasks including proposal/box
detection and instance segmentation. Please refer to
https://cocodataset.org/#detection-eval for more details.
Parameters:
ann_file (str, optional) – Path to the coco format annotation file.
If not specified, ground truth annotations from the dataset will
be converted to coco format. Defaults to None.
metric (str | List[str]) – Metrics to be evaluated. Valid metrics
include ‘bbox’, ‘segm’, ‘proposal’, and ‘proposal_fast’.
Defaults to ‘bbox’.
classwise (bool) – Whether to evaluate the metric class-wise.
Defaults to False.
proposal_nums (Sequence[int]) – Numbers of proposals to be evaluated.
Defaults to (100, 300, 1000).
iou_thrs (float | List[float], optional) – IoU threshold to compute AP
and AR. If not specified, IoUs from 0.5 to 0.95 will be used.
Defaults to None.
metric_items (List[str], optional) – Metric result names to be
recorded in the evaluation result. Defaults to None.
format_only (bool) – Format the output results without perform
evaluation. It is useful when you want to format the result
to a specific format and submit it to the test server.
Defaults to False.
outfile_prefix (str, optional) – The prefix of json files. It includes
the file path and the prefix of filename, e.g., “a/b/prefix”.
If not specified, a temp file will be created. Defaults to None.
file_client_args (dict, optional) – Arguments to instantiate the
corresponding backend in mmdet <= 3.0.0rc6. Defaults to None.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend. Defaults to None.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
sort_categories (bool) – Whether sort categories in annotations. Only
used for Objects365V1Dataset. Defaults to False.
use_mp_eval (bool) – Whether to use mul-processing evaluation
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
Dump the detection results to a COCO style json file.
There are 3 types of results: proposals, bbox predictions, mask
predictions, and they have different data types. This method will
automatically recognize the type, and dump them to json files.
Parameters:
results (Sequence[dict]) – Testing results of the
dataset.
outfile_prefix (str) – The filename prefix of the json files. If the
prefix is “somepath/xxx”, the json files will be named
“somepath/xxx.bbox.json”, “somepath/xxx.segm.json”,
“somepath/xxx.proposal.json”.
Returns:
Possible keys are “bbox”, “segm”, “proposal”, and
values are corresponding filenames.
Separated COCO and Occluded COCO are automatically generated subsets of
COCO val dataset, collecting separated objects and partially occluded
objects for a large variety of categories. In this way, we define
occlusion into two major categories: separated and partially occluded.
Separation: target object segmentation mask is separated into distinct
regions by the occluder.
Partial Occlusion: target object is partially occluded but the
segmentation mask is connected.
These two new scalable real-image datasets are to benchmark a model’s
capability to detect occluded objects of 80 common categories.
Please cite the paper if you use this dataset:
@article{zhan2022triocc,
title={A Tri-Layer Plugin to Improve Occluded Detection},
author={Zhan, Guanqi and Xie, Weidi and Zisserman, Andrew},
journal={British Machine Vision Conference},
year={2022}
}
Parameters:
occluded_ann (str) – Path to the occluded coco annotation file.
separated_ann (str) – Path to the separated coco annotation file.
score_thr (float) – Score threshold of the detection masks.
Defaults to 0.3.
iou_thr (float) – IoU threshold for the recall calculation.
Defaults to 0.75.
metric (str | List[str]) – Metrics to be evaluated. Valid metrics
include ‘bbox’, ‘segm’, ‘proposal’, and ‘proposal_fast’.
Defaults to ‘bbox’.
ann_file (str, optional) – Path to the coco format annotation file.
If not specified, ground truth annotations from the dataset will
be converted to coco format. Defaults to None.
seg_prefix (str, optional) – Path to the directory which contains the
coco panoptic segmentation mask. It should be specified when
evaluate. Defaults to None.
classwise (bool) – Whether to evaluate the metric class-wise.
Defaults to False.
outfile_prefix (str, optional) – The prefix of json files. It includes
the file path and the prefix of filename, e.g., “a/b/prefix”.
If not specified, a temp file will be created.
It should be specified when format_only is True. Defaults to None.
format_only (bool) – Format the output results without perform
evaluation. It is useful when you want to format the result
to a specific format and submit it to the test server.
Defaults to False.
nproc (int) – Number of processes for panoptic quality computing.
Defaults to 32. When nproc exceeds the number of cpu cores,
the number of cpu cores is used.
file_client_args (dict, optional) – Arguments to instantiate the
corresponding backend in mmdet <= 3.0.0rc6. Defaults to None.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend. Defaults to None.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
Evaluate AR, AP, and mAP for detection tasks including proposal/box
detection and instance segmentation. Please refer to
https://cocodataset.org/#detection-eval
for more details.
Evaluate Average Precision (AP), Miss Rate (MR) and Jaccard Index (JI)
for detection tasks.
Parameters:
ann_file (str) – Path to the annotation file.
metric (str | List[str]) – Metrics to be evaluated. Valid metrics
include ‘AP’, ‘MR’ and ‘JI’. Defaults to ‘AP’.
format_only (bool) – Format the output results without perform
evaluation. It is useful when you want to format the result
to a specific format and submit it to the test server.
Defaults to False.
outfile_prefix (str, optional) – The prefix of json files. It includes
the file path and the prefix of filename, e.g., “a/b/prefix”.
If not specified, a temp file will be created. Defaults to None.
file_client_args (dict, optional) – Arguments to instantiate the
corresponding backend in mmdet <= 3.0.0rc6. Defaults to None.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend. Defaults to None.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
eval_mode (int) – Select the mode of evaluate. Valid mode include
0(just body box), 1(just head box) and 2(both of them).
Defaults to 0.
iou_thres (float) – IoU threshold. Defaults to 0.5.
compare_matching_method (str, optional) – Matching method to compare
the detection results with the ground_truth when compute ‘AP’
and ‘MR’.Valid method include VOC and None(CALTECH). Default to
None.
mr_ref (str) – Different parameter selection to calculate MR. Valid
ref include CALTECH_-2 and CALTECH_-4. Defaults to CALTECH_-2.
num_ji_process (int) – The number of processes to evaluation JI.
Defaults to 10.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (Any) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from
the model.
Dump the detection results to a COCO style json file.
There are 3 types of results: proposals, bbox predictions, mask
predictions, and they have different data types. This method will
automatically recognize the type, and dump them to json files.
Parameters:
results (Sequence[dict]) – Testing results of the
dataset.
Returns:
Possible keys are “bbox”, “segm”, “proposal”, and
values are corresponding filenames.
Dump model predictions to a pickle file for offline evaluation.
Different from DumpResults in MMEngine, it compresses instance
segmentation masks into RLE format.
Parameters:
out_file_path (str) – Path of the dumped file. Must end with ‘.pkl’
or ‘.pickle’.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (Any) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from
the model.
output_dir (str) – The root directory for proposals_file.
Defaults to ‘’.
proposals_file (str) – Proposals file path. Defaults to ‘proposals.pkl’.
num_max_proposals (int, optional) – Maximum number of proposals to dump.
If not specified, all proposals will be dumped.
file_client_args (dict, optional) – Arguments to instantiate the
corresponding backend in mmdet <= 3.0.0rc6. Defaults to None.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend. Defaults to None.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
Process one batch of data samples and predictions.
The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
:param data_batch: A batch of data from the dataloader.
:type data_batch: dict
:param data_samples: A batch of data samples that
ann_file (str, optional) – Path to the coco format annotation file.
If not specified, ground truth annotations from the dataset will
be converted to coco format. Defaults to None.
metric (str | List[str]) – Metrics to be evaluated. Valid metrics
include ‘bbox’, ‘segm’, ‘proposal’, and ‘proposal_fast’.
Defaults to ‘bbox’.
classwise (bool) – Whether to evaluate the metric class-wise.
Defaults to False.
proposal_nums (Sequence[int]) – Numbers of proposals to be evaluated.
Defaults to (100, 300, 1000).
iou_thrs (float | List[float], optional) – IoU threshold to compute AP
and AR. If not specified, IoUs from 0.5 to 0.95 will be used.
Defaults to None.
metric_items (List[str], optional) – Metric result names to be
recorded in the evaluation result. Defaults to None.
format_only (bool) – Format the output results without perform
evaluation. It is useful when you want to format the result
to a specific format and submit it to the test server.
Defaults to False.
outfile_prefix (str, optional) – The prefix of json files. It includes
the file path and the prefix of filename, e.g., “a/b/prefix”.
If not specified, a temp file will be created. Defaults to None.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
file_client_args (dict, optional) – Arguments to instantiate the
corresponding backend in mmdet <= 3.0.0rc6. Defaults to None.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend. Defaults to None.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
metric (str | list[str]) – Metrics to be evaluated. Options are
‘HOTA’, ‘CLEAR’, ‘Identity’.
Defaults to [‘HOTA’, ‘CLEAR’, ‘Identity’].
outfile_prefix (str, optional) – Path to save the formatted results.
Defaults to None.
track_iou_thr (float) – IoU threshold for tracking evaluation.
Defaults to 0.5.
benchmark (str) – Benchmark to be evaluated. Defaults to ‘MOT17’.
format_only (bool) – If True, only formatting the results to the
official format and not performing evaluation. Defaults to False.
postprocess_tracklet_cfg (List[dict], optional) –
configs for tracklets
postprocessing methods. InterpolateTracklets is supported.
Defaults to []
- InterpolateTracklets:
min_num_frames (int, optional): The minimum length of a
track that will be interpolated. Defaults to 5.
max_num_frames (int, optional): The maximum disconnected
length in a track. Defaults to 20.
use_gsi (bool, optional): Whether to use the GSI (Gaussian-
smoothed interpolation) method. Defaults to False.
smooth_tau (int, optional): smoothing parameter in GSI.
Defaults to 10.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Default: None
iou_thrs (float or List[float]) – IoU threshold. Defaults to 0.5.
ioa_thrs (float or List[float]) – IoA threshold. Defaults to 0.5.
scale_ranges (List[tuple], optional) – Scale ranges for evaluating
mAP. If not specified, all bounding boxes would be included in
evaluation. Defaults to None
use_group_of (bool) – Whether consider group of groud truth bboxes
during evaluating. Defaults to True.
get_supercategory (bool) – Whether to get parent class of the
current class. Default: True.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
metric (str | list[str]) – Metrics to be evaluated.
Default value is mAP.
metric_options – (dict, optional): Options for calculating metrics.
Allowed keys are ‘rank_list’ and ‘max_rank’. Defaults to None.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Default: None
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (Any) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from
the model.
iou_metrics (list[str] | str) – Metrics to be calculated, the options
includes ‘mIoU’, ‘mDice’ and ‘mFscore’.
beta (int) – Determines the weight of recall in the combined score.
Default: 1.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
output_dir (str) – The directory for output prediction. Defaults to
None.
format_only (bool) – Only format result for results commit without
perform evaluation. It is useful when you want to save the result
to a specific format and submit it to the test server.
Defaults to False.
backend_args (dict, optional) – Arguments to instantiate the
corresponding backend. Defaults to None.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
iou_thrs (float or List[float]) – IoU threshold. Defaults to 0.5.
scale_ranges (List[tuple], optional) – Scale ranges for evaluating
mAP. If not specified, all bounding boxes would be included in
evaluation. Defaults to None.
metric (str | list[str]) –
Metrics to be evaluated. Options are
‘mAP’, ‘recall’. If is list, the first setting in the list will
be used to evaluate metric.
proposal_nums (Sequence[int]) – Proposal number used for evaluating
recalls, such as recall@100, recall@1000.
Default: (100, 300, 1000).
eval_mode (str) – ‘area’ or ‘11points’, ‘area’ means calculating the
area under precision-recall curve, ‘11points’ means calculating
the average precision of recalls at [0, 0.1, …, 1].
The PASCAL VOC2007 defaults to use ‘11points’, while PASCAL
VOC2012 defaults to use ‘area’.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonymous metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Defaults to None.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (dict) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of data samples that
contain annotations and predictions.
metric (str | list[str]) – Metrics to be evaluated.
Default value is youtube_vis_ap.
metric_items (List[str], optional) – Metric result names to be
recorded in the evaluation result. Defaults to None.
outfile_prefix (str | None) – The prefix of json files. It includes
the file path and the prefix of filename, e.g., “a/b/prefix”.
If not specified, a temp file will be created. Defaults to None.
collect_device (str) – Device name used for collecting results from
different ranks during distributed training. Must be ‘cpu’ or
‘gpu’. Defaults to ‘cpu’.
prefix (str, optional) – The prefix that will be added in the metric
names to disambiguate homonyms metrics of different evaluators.
If prefix is not provided in the argument, self.default_prefix
will be used instead. Default: None
format_only (bool) – If True, only formatting the results to the
official format and not performing evaluation. Defaults to False.
Process one batch of data samples and predictions. The processed
results should be stored in self.results, which will be used to
compute the metrics when all batches have been processed.
Parameters:
data_batch (Any) – A batch of data from the dataloader.
data_samples (Sequence[dict]) – A batch of outputs from
the model.
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
This has an effect only on certain modules. See the documentation of
particular modules for details of their behaviors in training/evaluation
mode, i.e., whether they are affected, e.g. Dropout, BatchNorm,
etc.
Parameters:
mode (bool) – whether to set training mode (True) or evaluation
mode (False). Default: True.
arch (str) – Architecture of CSPNeXt, from {P5, P6}.
Defaults to P5.
expand_ratio (float) – Ratio to adjust the number of channels of the
hidden layer. Defaults to 0.5.
deepen_factor (float) – Depth multiplier, multiply number of
blocks in CSP layer by this amount. Defaults to 1.0.
widen_factor (float) – Width multiplier, multiply number of
channels in each layer by this amount. Defaults to 1.0.
out_indices (Sequence[int]) – Output from which stages.
Defaults to (2, 3, 4).
frozen_stages (int) – Stages to be frozen (stop grad and set eval
mode). -1 means not freezing any parameters. Defaults to -1.
use_depthwise (bool) – Whether to use depthwise separable convolution.
Defaults to False.
arch_ovewrite (list) – Overwrite default arch settings.
Defaults to None.
spp_kernel_sizes – (tuple[int]): Sequential of kernel sizes of SPP
layers. Defaults to (5, 9, 13).
channel_attention (bool) – Whether to add channel attention in each
stage. Defaults to True.
conv_cfg (ConfigDict or dict, optional) – Config dict for
convolution layer. Defaults to None.
norm_cfg (ConfigDict or dict) – Dictionary to construct and
config norm layer. Defaults to dict(type=’BN’, requires_grad=True).
act_cfg (ConfigDict or dict) – Config dict for activation layer.
Defaults to dict(type=’SiLU’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only.
:param init_cfg (ConfigDict or dict or list[dict] or: list[ConfigDict]): Initialization config dict.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
This has an effect only on certain modules. See the documentation of
particular modules for details of their behaviors in training/evaluation
mode, i.e., whether they are affected, e.g. Dropout, BatchNorm,
etc.
Parameters:
mode (bool) – whether to set training mode (True) or evaluation
mode (False). Default: True.
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only.
pretrained (str, optional) – model pretrained path. Default: None
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
In Darknet backbone, ConvLayer is usually followed by ResBlock. This
function will make that. The Conv layers always have 3x3 filters with
stride=2. The number of the filters in Conv layer is the same as the
out channels of the ResBlock.
Parameters:
in_channels (int) – The number of input channels.
out_channels (int) – The number of output channels.
res_repeat (int) – The number of ResBlocks.
conv_cfg (dict) – Config dict for convolution layer. Default: None.
norm_cfg (dict) – Dictionary to construct and config norm layer.
Default: dict(type=’BN’, requires_grad=True)
This has an effect only on certain modules. See the documentation of
particular modules for details of their behaviors in training/evaluation
mode, i.e., whether they are affected, e.g. Dropout, BatchNorm,
etc.
Parameters:
mode (bool) – whether to set training mode (True) or evaluation
mode (False). Default: True.
stage_with_sac (list) – Which stage to use sac. Default: (False, False,
False, False).
rfp_inplanes (int, optional) – The number of channels from RFP.
Default: None. If specified, an additional conv layer will be
added for rfp_feat. Otherwise, the structure is the same as
base class.
output_img (bool) – If True, the input image will be inserted into
the starting position of output. Default: False.
arch (str) – Architecture of efficientnet. Defaults to b0.
out_indices (Sequence[int]) – Output from which stages.
Defaults to (6, ).
frozen_stages (int) – Stages to be frozen (all param fixed).
Defaults to 0, which means not freezing any parameters.
conv_cfg (dict) – Config dict for convolution layer.
Defaults to None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer.
Defaults to dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer.
Defaults to dict(type=’Swish’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only. Defaults to False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed. Defaults to False.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
This has an effect only on certain modules. See the documentation of
particular modules for details of their behaviors in training/evaluation
mode, i.e., whether they are affected, e.g. Dropout, BatchNorm,
etc.
Parameters:
mode (bool) – whether to set training mode (True) or evaluation
mode (False). Default: True.
Detailed configuration for each stage of HRNet.
There must be 4 stages, the configuration for each stage must have
5 keys:
num_modules(int): The number of HRModule in this stage.
num_branches(int): The number of branches in the HRModule.
block(str): The type of convolution block.
num_blocks(tuple): The number of blocks in each branch.
The length must be equal to num_branches.
num_channels(tuple): The number of channels in each branch.
The length must be equal to num_branches.
in_channels (int) – Number of input image channels. Default: 3.
conv_cfg (dict) – Dictionary to construct and config conv layer.
norm_cfg (dict) – Dictionary to construct and config norm layer.
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only. Default: True.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed. Default: False.
zero_init_residual (bool) – Whether to use zero init for last norm layer
in resblocks to let them behave as identity. Default: False.
multiscale_output (bool) – Whether to output multi-level features
produced by multiple branches. If False, only the first level
feature will be output. Default: True.
pretrained (str, optional) – Model pretrained path. Default: None.
widen_factor (float) – Width multiplier, multiply number of
channels in each layer by this amount. Default: 1.0.
out_indices (Sequence[int], optional) – Output from which stages.
Default: (1, 2, 4, 7).
frozen_stages (int) – Stages to be frozen (all param fixed).
Default: -1, which means not freezing any parameters.
conv_cfg (dict, optional) – Config dict for convolution layer.
Default: None, which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer.
Default: dict(type=’BN’).
act_cfg (dict) – Config dict for activation layer.
Default: dict(type=’ReLU6’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only. Default: False.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed. Default: False.
pretrained (str, optional) – model pretrained path. Default: None
use_abs_pos_embed (bool) – If True, add absolute position embedding to
the patch embedding. Defaults: True.
use_conv_ffn (bool) – If True, use Convolutional FFN to replace FFN.
Default: False.
act_cfg (dict) – The activation config for FFNs.
Default: dict(type=’GELU’).
norm_cfg (dict) – Config dict for normalization layer.
Default: dict(type=’LN’).
pretrained (str, optional) – model pretrained path. Default: None.
convert_weights (bool) – The flag indicates whether the
pre-trained model is from the original repo. We may need
to convert some keys to make it compatible.
Default: True.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
wm (float): quantization parameter to quantize the width
depth (int): depth of the backbone
group_w (int): width of group
bot_mul (float): bottleneck ratio, i.e. expansion of bottleneck.
strides (Sequence[int]) – Strides of the first block of each stage.
base_channels (int) – Base channels after stem layer.
in_channels (int) – Number of input image channels. Default: 3.
dilations (Sequence[int]) – Dilation of each stage.
out_indices (Sequence[int]) – Output from which stages.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two
layer is the 3x3 conv layer, otherwise the stride-two layer is
the first 1x1 conv layer.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means
not freezing any parameters.
norm_cfg (dict) – dictionary to construct and config norm layer.
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed.
zero_init_residual (bool) – whether to use zero init for last norm layer
in resblocks to let them behave as identity.
pretrained (str, optional) – model pretrained path. Default: None
base_width (int) – Basic width of each scale. Default: 26
depth (int) – Depth of res2net, from {50, 101, 152}.
in_channels (int) – Number of input image channels. Default: 3.
num_stages (int) – Res2net stages. Default: 4.
strides (Sequence[int]) – Strides of the first block of each stage.
dilations (Sequence[int]) – Dilation of each stage.
out_indices (Sequence[int]) – Output from which stages.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two
layer is the 3x3 conv layer, otherwise the stride-two layer is
the first 1x1 conv layer.
deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv
avg_down (bool) – Use AvgPool instead of stride conv when
downsampling in the bottle2neck.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode).
-1 means not freezing any parameters.
norm_cfg (dict) – Dictionary to construct and config norm layer.
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only.
plugins (list[dict]) –
List of plugins for stages, each dict contains:
cfg (dict, required): Cfg dict to build plugin.
position (str, required): Position inside block to insert
plugin, options are ‘after_conv1’, ‘after_conv2’, ‘after_conv3’.
stages (tuple[bool], optional): Stages to apply plugin, length
should be same as ‘num_stages’.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed.
zero_init_residual (bool) – Whether to use zero init for last norm layer
in resblocks to let them behave as identity.
pretrained (str, optional) – model pretrained path. Default: None
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
in_channels (int) – Number of input image channels. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
groups (int) – Group of resnext.
base_width (int) – Base width of resnext.
strides (Sequence[int]) – Strides of the first block of each stage.
dilations (Sequence[int]) – Dilation of each stage.
out_indices (Sequence[int]) – Output from which stages.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two
layer is the 3x3 conv layer, otherwise the stride-two layer is
the first 1x1 conv layer.
frozen_stages (int) – Stages to be frozen (all param fixed). -1 means
not freezing any parameters.
norm_cfg (dict) – dictionary to construct and config norm layer.
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed.
zero_init_residual (bool) – whether to use zero init for last norm layer
in resblocks to let them behave as identity.
depth (int) – Depth of resnet, from {18, 34, 50, 101, 152}.
stem_channels (int | None) – Number of stem channels. If not specified,
it will be the same as base_channels. Default: None.
base_channels (int) – Number of base channels of res layer. Default: 64.
in_channels (int) – Number of input image channels. Default: 3.
num_stages (int) – Resnet stages. Default: 4.
strides (Sequence[int]) – Strides of the first block of each stage.
dilations (Sequence[int]) – Dilation of each stage.
out_indices (Sequence[int]) – Output from which stages.
style (str) – pytorch or caffe. If set to “pytorch”, the stride-two
layer is the 3x3 conv layer, otherwise the stride-two layer is
the first 1x1 conv layer.
deep_stem (bool) – Replace 7x7 conv in input stem with 3 3x3 conv
avg_down (bool) – Use AvgPool instead of stride conv when
downsampling in the bottleneck.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode).
-1 means not freezing any parameters.
norm_cfg (dict) – Dictionary to construct and config norm layer.
norm_eval (bool) – Whether to set norm layers to eval mode, namely,
freeze running stats (mean and var). Note: Effect on Batch Norm
and its variants only.
plugins (list[dict]) –
List of plugins for stages, each dict contains:
cfg (dict, required): Cfg dict to build plugin.
position (str, required): Position inside block to insert
plugin, options are ‘after_conv1’, ‘after_conv2’, ‘after_conv3’.
stages (tuple[bool], optional): Stages to apply plugin, length
should be same as ‘num_stages’.
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some
memory while slowing down the training speed.
zero_init_residual (bool) – Whether to use zero init for last norm layer
in resblocks to let them behave as identity.
pretrained (str, optional) – model pretrained path. Default: None
Currently we support to insert context_block,
empirical_attention_block, nonlocal_block into the backbone
like ResNet/ResNeXt. They could be inserted after conv1/conv2/conv3 of
Bottleneck.
Compared with default ResNet(ResNetV1b), ResNetV1d replaces the 7x7 conv in
the input stem with three 3x3 convs. And in the downsampling block, a 2x2
avg_pool with stride 2 is added before conv, whose stride is changed to 1.
mlp_ratio (int) – Ratio of mlp hidden dim to embedding dim.
Default: 4.
depths (tuple[int]) – Depths of each Swin Transformer stage.
Default: (2, 2, 6, 2).
num_heads (tuple[int]) – Parallel attention heads of each Swin
Transformer stage. Default: (3, 6, 12, 24).
strides (tuple[int]) – The patch merging or patch embedding stride of
each Swin Transformer stage. (In swin, we set kernel size equal to
stride.) Default: (4, 2, 2, 2).
out_indices (tuple[int]) – Output from which stages.
Default: (0, 1, 2, 3).
qkv_bias (bool, optional) – If True, add a learnable bias to query, key,
value. Default: True
qk_scale (float | None, optional) – Override default qk scale of
head_dim ** -0.5 if set. Default: None.
patch_norm (bool) – If add a norm layer for patch embed and patch
merging. Default: True.
use_abs_pos_embed (bool) – If True, add absolute position embedding to
the patch embedding. Defaults: False.
act_cfg (dict) – Config dict for activation layer.
Default: dict(type=’GELU’).
norm_cfg (dict) – Config dict for normalization layer at
output of backone. Defaults: dict(type=’LN’).
with_cp (bool, optional) – Use checkpoint or not. Using checkpoint
will save some memory while slowing down the training speed.
Default: False.
pretrained (str, optional) – model pretrained path. Default: None.
convert_weights (bool) – The flag indicates whether the
pre-trained model is from the original repo. We may need
to convert some keys to make it compatible.
Default: False.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode).
Default: -1 (-1 means not freezing any parameters).
init_cfg (dict, optional) – The Config for initialization.
Defaults to None.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
The stem layer, stage 1 and stage 2 in Trident ResNet are identical to
ResNet, while in stage 3, Trident BottleBlock is utilized to replace the
normal BottleBlock to yield trident output. Different branch shares the
convolution weight but uses different dilations to achieve multi-scale
output.
depth (int) – Depth of resnet, from {50, 101, 152}.
num_branch (int) – Number of branches in TridentNet.
test_branch_idx (int) – In inference, all 3 branches will be used
if test_branch_idx==-1, otherwise only branch with index
test_branch_idx will be used.
trident_dilations (tuple[int]) – Dilations of different trident branch.
len(trident_dilations) should be equal to num_branch.
Take semi-supervised object detection as an example, assume that
the ratio of labeled data and unlabeled data in a batch is 1:2,
sup indicates the branch where the labeled data is augmented,
unsup_teacher and unsup_student indicate the branches where
the unlabeled data is augmented by different pipeline.
The input format of multi-branch data is shown as below :
The format of multi-branch data
after filtering None is shown as below :
In order to reuse DetDataPreprocessor for the data
from different branches, the format of multi-branch data
grouped by branch is as below :
After preprocessing data from different branches,
the multi-branch data needs to be reformatted as:
Parameters:
data_preprocessor (ConfigDict or dict) – Config of
DetDataPreprocessor to process the input data.
Comparing with the mmengine.model.ImgDataPreprocessor,
It won’t do normalization if mean is not specified.
It does normalization and color space conversion after stacking batch.
It supports batch augmentations like mixup and cutmix.
It provides the data pre-processing as follows
Collate and move data to the target device.
Pad inputs to the maximum size of current batch with defined
pad_value. The padding size can be divisible by a defined
pad_size_divisor
Stack inputs to batch_inputs.
Convert inputs from bgr to rgb if the shape of input is (3, H, W).
Normalize image with defined std and mean.
Do batch augmentations like Mixup and Cutmix during training.
Parameters:
mean (Sequence[Number], optional) – The pixel mean of R, G, B channels.
Defaults to None.
std (Sequence[Number], optional) – The pixel standard deviation of
R, G, B channels. Defaults to None.
pad_size_divisor (int) – The size of padded image should be
divisible by pad_size_divisor. Defaults to 1.
pad_value (Number) – The padded pixel value. Defaults to 0.
to_rgb (bool) – whether to convert image from BGR to RGB.
Defaults to False.
to_onehot (bool) – Whether to generate one-hot format gt-labels and set
to data samples. Defaults to False.
num_classes (int, optional) – The number of classes. Defaults to None.
batch_augments (dict, optional) – The batch augmentations settings,
including “augments” and “probs”. For more details, see
mmpretrain.models.RandomBatchAugment.
Accepts the data sampled by the dataloader, and preprocesses
it into the format of the model input. TrackDataPreprocessor
provides the tracking data pre-processing as follows:
Collate and move data to the target device.
Pad inputs to the maximum size of current batch with defined
pad_value. The padding size can be divisible by a defined
pad_size_divisor
Stack inputs to inputs.
Convert inputs from bgr to rgb if the shape of input is (1, 3, H, W).
Normalize image with defined std and mean.
Do batch augmentations during training.
Record the information of batch_input_shape and pad_shape.
Args:
mean (Sequence[Number], optional): The pixel mean of R, G, B
channels. Defaults to None.
std (Sequence[Number], optional): The pixel standard deviation of
R, G, B channels. Defaults to None.
pad_size_divisor (int): The size of padded image should be
divisible by pad_size_divisor. Defaults to 1.
pad_value (Number): The padded pixel value. Defaults to 0.
pad_mask (bool): Whether to pad instance masks. Defaults to False.
mask_pad_value (int): The padded pixel value for instance masks.
Defaults to 0.
bgr_to_rgb (bool): whether to convert image from BGR to RGB.
Defaults to False.
rgb_to_bgr (bool): whether to convert image from RGB to RGB.
Defaults to False.
use_det_processor: (bool): whether to use DetDataPreprocessor
in training phrase. This is mainly for some tracking models
fed into one image rather than a group of image in training.
Defaults to False.
. boxtype2tensor (bool): Whether to convert the BaseBoxes type of
ATSS head structure is similar with FCOS, however ATSS use anchor boxes
and assign label by Adaptive Training Sample Selection instead max-iou.
Parameters:
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
pred_kernel_size (int) – Kernel size of nn.Conv2d
stacked_convs (int) – Number of stacking convs of the head.
conv_cfg (ConfigDict or dict, optional) – Config dict for
convolution layer. Defaults to None.
norm_cfg (ConfigDict or dict) – Config dict for normalization
layer. Defaults to dict(type='GN',num_groups=32,requires_grad=True).
reg_decoded_bbox (bool) – If true, the regression loss would be
applied directly on decoded bounding boxes, converting both
the predicted boxes and regression targets to absolute
coordinates format. Defaults to False. It should be True when
using IoULoss, GIoULoss, or DIoULoss in the bbox head.
loss_centerness (ConfigDict or dict) – Config of centerness loss.
Defaults to dict(type='CrossEntropyLoss',use_sigmoid=True,loss_weight=1.0).
:param init_cfg (ConfigDict or dict or list[dict] or: list[ConfigDict]): Initialization config dict.
This method is almost the same as AnchorHead.get_targets(). Besides
returning the targets as the parent method does, it also returns the
anchors as the first element of the returned tuple.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
centernesses (list[Tensor]) – Centerness for each scale
level with shape (N, num_anchors * 1, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Calculate the loss of a single scale level based on the features
extracted by the detection head.
Parameters:
cls_score (Tensor) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W).
bbox_pred (Tensor) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W).
anchors (Tensor) – Box reference for each scale level with shape
(N, num_total_anchors, 4).
labels (Tensor) – Labels of each anchors with shape
(N, num_total_anchors).
label_weights (Tensor) – Label weights of each anchor with shape
(N, num_total_anchors)
bbox_targets (Tensor) – BBox regression targets of each anchor with
shape (N, num_total_anchors, 4).
avg_factor (float) – Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
centernesses (list[Tensor]) – Centerness for each scale
level with shape (N, num_anchors * 1, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Transform a batch of output features extracted from the head into
bbox results.
Note: When score_factors is not None, the cls_scores are
usually multiplied by it then obtain the real score used in NMS,
such as CenterNess in FCOS, IoU branch in ATSS.
Parameters:
cls_logits (list[Tensor]) – Classification scores for all
scale levels, each is a 4D-tensor, has shape
(batch_size, num_priors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for all
scale levels, each is a 4D-tensor, has shape
(batch_size, num_priors * 4, H, W).
score_factors (list[Tensor], optional) – Score factor for
all scale level, each is a 4D-tensor, has shape
(batch_size, num_priors * 1, H, W). Defaults to None.
batch_img_metas (list[dict], Optional) – Batch image meta info.
Defaults to None.
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
feat_channels (int) – Number of hidden channels. Used in child classes.
stacked_convs (int) – Number of stacking convs of the head.
strides (Sequence[int] or Sequence[Tuple[int, int]]) – Downsample
factor of each feature map.
dcn_on_last_conv (bool) – If true, use dcn in the last layer of
towers. Defaults to False.
conv_bias (bool or str) – If specified as auto, it will be decided by
the norm_cfg. Bias of conv will be set as True if norm_cfg is
None, otherwise False. Default: “auto”.
loss_cls (ConfigDict or dict) – Config of classification loss.
loss_bbox (ConfigDict or dict) – Config of localization loss.
bbox_coder (ConfigDict or dict) – Config of bbox coder. Defaults
‘DistancePointBBoxCoder’.
conv_cfg (ConfigDict or dict, Optional) – Config dict for
convolution layer. Defaults to None.
norm_cfg (ConfigDict or dict, Optional) – Config dict for
normalization layer. Defaults to None.
train_cfg (ConfigDict or dict, Optional) – Training config of
anchor-free head.
test_cfg (ConfigDict or dict, Optional) – Testing config of
anchor-free head.
init_cfg (ConfigDict or dict or list[ConfigDict or dict]) – Initialization config dict.
aug_batch_feats (list[Tensor]) – the outer list indicates test-time
augmentations and inner Tensor should have a shape NxCxHxW,
which contains features for all images in the batch.
aug_batch_img_metas (list[list[dict]]) – the outer list indicates
test-time augs (multiscale, flip, etc.) and the inner list
indicates images in a batch. each dict has image information.
rescale (bool, optional) – Whether to rescale the results.
Defaults to False.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level,
each is a 4D-tensor, the channel number is
num_points * num_classes.
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level, each is a 4D-tensor, the channel number is
num_points * 4.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
feat_channels (int) – Number of hidden channels. Used in child classes.
anchor_generator (dict) – Config dict for anchor generator
bbox_coder (dict) – Config of bounding box coder.
reg_decoded_bbox (bool) – If true, the regression loss would be
applied directly on decoded bounding boxes, converting both
the predicted boxes and regression targets to absolute
coordinates format. Default False. It should be True when
using IoULoss, GIoULoss, or DIoULoss in the bbox head.
loss_cls (dict) – Config of classification loss.
loss_bbox (dict) – Config of localization loss.
train_cfg (dict) – Training config of anchor head.
test_cfg (dict) – Testing config of anchor head.
init_cfg (dict or list[dict], optional) – Initialization config dict.
cls_score (Tensor): Cls scores for a single scale level the channels number is num_base_priors * num_classes.
bbox_pred (Tensor): Box energies / deltas for a single scale level, the channels number is num_base_priors * 4.
Compute regression and classification targets for anchors in
multiple images.
Parameters:
anchor_list (list[list[Tensor]]) – Multi level anchors of each
image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, 4).
valid_flag_list (list[list[Tensor]]) – Multi level valid flags of
each image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, )
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors. Defaults to True.
return_sampling_results (bool) – Whether to return the sampling
results. Defaults to False.
Returns:
Usually returns a tuple containing learning targets.
labels_list (list[Tensor]): Labels of each level.
label_weights_list (list[Tensor]): Label weights of each
level.
bbox_targets_list (list[Tensor]): BBox targets of each level.
bbox_weights_list (list[Tensor]): BBox weights of each level.
avg_factor (int): Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
additional_returns: This function enables user-defined returns from
self._get_targets_single. These returns are currently refined
to properties at each feature map (i.e. having HxW dimension).
The results will be concatenated after the end
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
has shape (N, num_anchors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Calculate the negative loss of all points in feature map.
Parameters:
cls_score (Tensor) – All category scores for each point on
the feature map. The shape is (num_points, num_class).
objectness (Tensor) – Foreground probability of all points
and is shape of (num_points, 1).
gt_instances (InstanceData) – Ground truth of instance
annotations. It should includes bboxes and labels
attributes.
ious (Tensor) – Float tensor with shape of (num_points, num_gt).
Each value represent the iou of pred_bbox and gt_bboxes.
inside_gt_bbox_mask (Tensor) – Tensor of bool type,
with shape of (num_points, num_gt), each
value is used to mark whether this point falls
within a certain gt.
Returns:
neg_loss (Tensor): The negative loss of all points in the feature map.
Compute regression targets and each point inside or outside gt_bbox
in multiple images.
Parameters:
points (list[Tensor]) – Points of all fpn level, each has shape
(num_points, 2).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
Returns:
inside_gt_bbox_mask_list (list[Tensor]): Each Tensor is with bool type and shape of (num_points, num_gt), each value is used to mark whether this point falls within a certain gt.
concat_lvl_bbox_targets (list[Tensor]): BBox targets of each level. Each tensor has shape (num_points, num_gt, 4).
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level,
each is a 4D-tensor, the channel number is
num_points * num_classes.
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level, each is a 4D-tensor, the channel number is
num_points * 4.
objectnesses (list[Tensor]) – objectness for each scale level, each
is a 4D-tensor, the channel number is num_points * 1.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
The CascadeRPNHead will predict more accurate region proposals, which is
required for two-stage detectors (such as Fast/Faster R-CNN). CascadeRPN
consists of a sequence of RPNStage to progressively improve the accuracy of
the detected proposals.
More details can be found in https://arxiv.org/abs/1909.06720.
Parameters:
num_stages (int) – number of CascadeRPN stages.
stages (list[ConfigDict or dict]) – list of configs to build
the stages.
train_cfg (list[ConfigDict or dict]) – list of configs at
training time each stage.
test_cfg (ConfigDict or dict) – config at testing time.
init_cfg (ConfigDict or list[ConfigDict] or dict or list[dict]) – Initialization config dict.
center_heatmap_preds (list[Tensor]) – center predict heatmaps for
all levels with shape (B, num_classes, H, W).
wh_preds (list[Tensor]) – wh predicts for all levels with
shape (B, 2, H, W).
offset_preds (list[Tensor]) – offset predicts for all levels
with shape (B, 2, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Returns:
which has components below:
loss_center_heatmap (Tensor): loss of center heatmap.
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channel in the input feature map.
regress_ranges (Sequence[Tuple[int, int]]) – Regress range of multiple
level points.
hm_min_radius (int) – Heatmap target minimum radius of cls branch.
Defaults to 4.
hm_min_overlap (float) – Heatmap target minimum overlap of cls branch.
Defaults to 0.8.
more_pos_thresh (float) – The filtering threshold when the cls branch
adds more positive samples. Defaults to 0.2.
more_pos_topk (int) – The maximum number of additional positive samples
added to each gt. Defaults to 9.
soft_weight_on_reg (bool) – Whether to use the soft target of the
cls branch as the soft weight of the bbox branch.
Defaults to False.
loss_cls (ConfigDict or dict) – Config of cls loss. Defaults to
dict(type=’GaussianFocalLoss’, loss_weight=1.0)
loss_bbox (ConfigDict or dict) – Config of bbox loss. Defaults to
dict(type=’GIoULoss’, loss_weight=2.0).
norm_cfg (ConfigDict or dict, optional) – dictionary to construct
and config norm layer. Defaults to
norm_cfg=dict(type='GN',num_groups=32,requires_grad=True).
train_cfg (ConfigDict or dict, optional) – Training config.
Unused in CenterNet. Reserved for compatibility with
SingleStageDetector.
test_cfg (ConfigDict or dict, optional) – Testing config
of CenterNet.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level,
each is a 4D-tensor, the channel number is num_classes.
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level, each is a 4D-tensor, the channel number is 4.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Head of CentripetalNet: Pursuing High-quality Keypoint Pairs for Object
Detection.
CentripetalHead inherits from CornerHead. It removes the
embedding branch and adds guiding shift and centripetal shift branches.
More details can be found in the paper .
Parameters:
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
num_feat_levels (int) – Levels of feature from the previous module.
2 for HourglassNet-104 and 1 for HourglassNet-52. HourglassNet-104
outputs the final feature and intermediate supervision feature and
HourglassNet-52 only outputs the final feature. Defaults to 2.
corner_emb_channels (int) – Channel of embedding vector. Defaults to 1.
train_cfg (ConfigDict or dict, optional) – Training config.
Useless in CornerHead, but we keep this variable for
SingleStageDetector.
test_cfg (ConfigDict or dict, optional) – Testing config of
CornerHead.
loss_heatmap (ConfigDict or dict) – Config of corner heatmap
loss. Defaults to GaussianFocalLoss.
loss_embedding (ConfigDict or dict) – Config of corner embedding
loss. Defaults to AssociativeEmbeddingLoss.
loss_offset (ConfigDict or dict) – Config of corner offset loss.
Defaults to SmoothL1Loss.
loss_guiding_shift (ConfigDict or dict) – Config of
guiding shift loss. Defaults to SmoothL1Loss.
loss_centripetal_shift – Config of
centripetal shift loss. Defaults to SmoothL1Loss.
Transform a batch of output features extracted from the head into
bbox results.
Parameters:
tl_heats (list[Tensor]) – Top-left corner heatmaps for each level
with shape (N, num_classes, H, W).
br_heats (list[Tensor]) – Bottom-right corner heatmaps for each
level with shape (N, num_classes, H, W).
tl_offs (list[Tensor]) – Top-left corner offsets for each level
with shape (N, corner_offset_channels, H, W).
br_offs (list[Tensor]) – Bottom-right corner offsets for each level
with shape (N, corner_offset_channels, H, W).
tl_guiding_shifts (list[Tensor]) – Top-left guiding shifts for each
level with shape (N, guiding_shift_channels, H, W). Useless in
this function, we keep this arg because it’s the raw output
from CentripetalHead.
br_guiding_shifts (list[Tensor]) – Bottom-right guiding shifts for
each level with shape (N, guiding_shift_channels, H, W).
Useless in this function, we keep this arg because it’s the
raw output from CentripetalHead.
tl_centripetal_shifts (list[Tensor]) – Top-left centripetal shifts
for each level with shape (N, centripetal_shift_channels, H,
W).
br_centripetal_shifts (list[Tensor]) – Bottom-right centripetal
shifts for each level with shape (N,
centripetal_shift_channels, H, W).
batch_img_metas (list[dict], optional) – Batch image meta info.
Defaults to None.
rescale (bool) – If True, return boxes in original image space.
Defaults to False.
with_nms (bool) – If True, do nms before return boxes.
Defaults to True.
Returns:
Object detection results of each image
after the post process. Each item usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level,
each is a 4D-tensor, the channel number is
num_points * num_classes.
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level, each is a 4D-tensor, the channel number is
num_points * 4.
centernesses (list[Tensor]) – centerness for each scale level, each
is a 4D-tensor, the channel number is num_points * 1.
param_preds (List[Tensor]) – param_pred for each scale level, each
is a 4D-tensor, the channel number is num_params.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Transform a batch of output features extracted from the head into
bbox results.
Note: When score_factors is not None, the cls_scores are
usually multiplied by it then obtain the real score used in NMS,
such as CenterNess in FCOS, IoU branch in ATSS.
Parameters:
cls_scores (list[Tensor]) – Classification scores for all
scale levels, each is a 4D-tensor, has shape
(batch_size, num_priors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for all
scale levels, each is a 4D-tensor, has shape
(batch_size, num_priors * 4, H, W).
score_factors (list[Tensor], optional) – Score factor for
all scale level, each is a 4D-tensor, has shape
(batch_size, num_priors * 1, H, W). Defaults to None.
param_preds (list[Tensor], optional) – Params for all scale
level, each is a 4D-tensor, has shape
(batch_size, num_priors * num_params, H, W)
batch_img_metas (list[dict], Optional) – Batch image meta info.
Defaults to None.
cfg (ConfigDict, optional) – Test / postprocessing
configuration, if None, test_cfg would be used.
Defaults to None.
rescale (bool) – If True, return boxes in original image space.
Defaults to False.
with_nms (bool) – If True, do nms before return boxes.
Defaults to True.
Returns:
Object detection results of each image
after the post process. Each item usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
Forward feature from the upstream network to get prototypes and
linearly combine the prototypes, using masks coefficients, into
instance masks. Finally, crop the instance masks with given bboxes.
Parameters:
x (Tuple[Tensor]) – Feature from the upstream network, which is
a 4D-tensor.
positive_infos (List[:obj:InstanceData]) – Positive information
that calculate from detect head.
hidden_states (Tensor) – Features from transformer decoder. If
return_intermediate_dec is True output has shape
(num_decoder_layers, bs, num_queries, dim), else has shape (1,
bs, num_queries, dim) which only contains the last layer
outputs.
references (Tensor) – References from transformer decoder, has
shape (bs, num_queries, 2).
Returns:
results of head containing the following tensor.
layers_cls_scores (Tensor): Outputs from the classification head,
shape (num_decoder_layers, bs, num_queries, cls_out_channels).
Note cls_out_channels should include background.
layers_bbox_preds (Tensor): Sigmoid outputs from the regression
head with normalized coordinate format (cx, cy, w, h), has shape
(num_decoder_layers, bs, num_queries, 4).
Perform forward propagation of the head, then calculate loss and
predictions from the features and data samples. Over-write because
img_metas are needed as inputs for bbox_head.
Parameters:
hidden_states (Tensor) – Features from the transformer decoder, has
shape (num_decoder_layers, bs, num_queries, dim).
references (Tensor) – References from the transformer decoder, has
shape (num_decoder_layers, bs, num_queries, 2).
batch_data_samples (list[DetDataSample]) – Each item contains
the meta information of each image and corresponding
annotations.
Returns:
The return value is a tuple contains:
losses: (dict[str, Tensor]): A dictionary of loss components.
predictions (list[InstanceData]): Detection
results of each image after the post process.
Perform forward propagation of the detection head and predict
detection results on the features of the upstream network. Over-write
because img_metas are needed as inputs for bbox_head.
Parameters:
hidden_states (Tensor) – Features from the transformer decoder, has
shape (num_decoder_layers, bs, num_queries, dim).
references (Tensor) – References from the transformer decoder, has
shape (num_decoder_layers, bs, num_queries, 2).
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool, optional) – Whether to rescale the results.
Defaults to True.
Returns:
InstanceData]: Detection results of each image
after the post process.
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
num_feat_levels (int) – Levels of feature from the previous module.
2 for HourglassNet-104 and 1 for HourglassNet-52. Because
HourglassNet-104 outputs the final feature and intermediate
supervision feature and HourglassNet-52 only outputs the final
feature. Defaults to 2.
corner_emb_channels (int) – Channel of embedding vector. Defaults to 1.
train_cfg (ConfigDict or dict, optional) – Training config.
Useless in CornerHead, but we keep this variable for
SingleStageDetector.
test_cfg (ConfigDict or dict, optional) – Testing config of
CornerHead.
loss_heatmap (ConfigDict or dict) – Config of corner heatmap
loss. Defaults to GaussianFocalLoss.
loss_embedding (ConfigDict or dict) – Config of corner embedding
loss. Defaults to AssociativeEmbeddingLoss.
loss_offset (ConfigDict or dict) – Config of corner offset loss.
Defaults to SmoothL1Loss.
init_cfg (ConfigDict or dict, optional) – the config to control
the initialization.
feats (tuple[Tensor]) – Features from the upstream network, each is
a 4D-tensor.
Returns:
Usually a tuple of corner heatmaps, offset heatmaps and
embedding heatmaps.
tl_heats (list[Tensor]): Top-left corner heatmaps for all
levels, each is a 4D-tensor, the channels number is
num_classes.
br_heats (list[Tensor]): Bottom-right corner heatmaps for all
levels, each is a 4D-tensor, the channels number is
num_classes.
tl_embs (list[Tensor] | list[None]): Top-left embedding
heatmaps for all levels, each is a 4D-tensor or None.
If not None, the channels number is corner_emb_channels.
br_embs (list[Tensor] | list[None]): Bottom-right embedding
heatmaps for all levels, each is a 4D-tensor or None.
If not None, the channels number is corner_emb_channels.
tl_offs (list[Tensor]): Top-left offset heatmaps for all
levels, each is a 4D-tensor. The channels number is
corner_offset_channels.
br_offs (list[Tensor]): Bottom-right offset heatmaps for all
levels, each is a 4D-tensor. The channels number is
corner_offset_channels.
hidden_states (Tensor) – Features from transformer decoder. If
return_intermediate_dec is True output has shape
(num_decoder_layers, bs, num_queries, dim), else has shape (1,
bs, num_queries, dim) which only contains the last layer
outputs.
references (Tensor) – References from transformer decoder. If
return_intermediate_dec is True output has shape
(num_decoder_layers, bs, num_queries, 2/4), else has shape (1,
bs, num_queries, 2/4)
which only contains the last layer reference.
Returns:
results of head containing the following tensor.
layers_cls_scores (Tensor): Outputs from the classification head,
shape (num_decoder_layers, bs, num_queries, cls_out_channels).
Note cls_out_channels should include background.
layers_bbox_preds (Tensor): Sigmoid outputs from the regression
head with normalized coordinate format (cx, cy, w, h), has shape
(num_decoder_layers, bs, num_queries, 4).
Perform forward propagation of the detection head and predict
detection results on the features of the upstream network. Over-write
because img_metas are needed as inputs for bbox_head.
Parameters:
hidden_states (Tensor) – Feature from the transformer decoder, has
shape (num_decoder_layers, bs, num_queries, dim).
references (Tensor) – references from the transformer decoder, has
shape (num_decoder_layers, bs, num_queries, 2/4).
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool, optional) – Whether to rescale the results.
Defaults to True.
Returns:
InstanceData]: Detection results of each image
after the post process.
DDOD head decomposes conjunctions lying in most current one-stage
detectors via label assignment disentanglement, spatial feature
disentanglement, and pyramid supervision disentanglement.
Parameters:
num_classes (int) – Number of categories excluding the
background category.
in_channels (int) – Number of channels in the input feature map.
stacked_convs (int) – The number of stacked Conv. Defaults to 4.
conv_cfg (ConfigDict or dict, optional) – Config dict for
convolution layer. Defaults to None.
use_dcn (bool) – Use dcn, Same as ATSS when False. Defaults to True.
norm_cfg (ConfigDict or dict) – Normal config of ddod head.
Defaults to dict(type=’GN’, num_groups=32, requires_grad=True).
loss_iou (ConfigDict or dict) – Config of IoU loss. Defaults to
dict(type=’CrossEntropyLoss’, use_sigmoid=True, loss_weight=1.0).
This method is almost the same as AnchorHead.get_targets().
Besides returning the targets as the parent method does,
it also returns the anchors as the first element of the
returned tuple.
Parameters:
anchor_list (list[Tensor]) – anchors of each image.
valid_flag_list (list[Tensor]) – Valid flags of each image.
num_level_anchors_list (list[Tensor]) – Number of anchors of each
scale level of all image.
cls_score_list (list[Tensor]) – Classification scores for all scale
levels, each is a 4D-tensor, the channels number is
num_base_priors * num_classes.
bbox_pred_list (list[Tensor]) – Box energies / deltas for all scale
levels, each is a 4D-tensor, the channels number is
num_base_priors * 4.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors.
This method is almost the same as AnchorHead.get_targets() when
is_cls_assigner is False. Besides returning the targets as the parent
method does, it also returns the anchors as the first element of the
returned tuple.
Parameters:
anchor_list (list[Tensor]) – anchors of each image.
valid_flag_list (list[Tensor]) – Valid flags of each image.
num_level_anchors_list (list[Tensor]) – Number of anchors of each
scale level of all image.
cls_score_list (list[Tensor]) – Classification scores for all scale
levels, each is a 4D-tensor, the channels number is
num_base_priors * num_classes.
bbox_pred_list (list[Tensor]) – Box energies / deltas for all scale
levels, each is a 4D-tensor, the channels number is
num_base_priors * 4.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_base_priors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_base_priors * 4, H, W)
iou_preds (list[Tensor]) – Score factor for all scale level,
each is a 4D-tensor, has shape (batch_size, 1, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
cls_score (Tensor) – Box scores for each scale level
Has shape (N, num_base_priors * num_classes, H, W).
labels (Tensor) – Labels of each anchors with shape
(N, num_total_anchors).
label_weights (Tensor) – Label weights of each anchor with shape
(N, num_total_anchors)
reweight_factor (List[float]) – Reweight factor for cls and reg
loss.
avg_factor (float) – Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
Compute reg loss of a single scale level based on the features
extracted by the detection head.
Parameters:
anchors (Tensor) – Box reference for each scale level with shape
(N, num_total_anchors, 4).
bbox_pred (Tensor) – Box energies / deltas for each scale
level with shape (N, num_base_priors * 4, H, W).
iou_pred (Tensor) – Iou for a single scale level, the
channel number is (N, num_base_priors * 1, H, W).
labels (Tensor) – Labels of each anchors with shape
(N, num_total_anchors).
label_weights (Tensor) – Label weights of each anchor with shape
(N, num_total_anchors)
bbox_targets (Tensor) – BBox regression targets of each anchor with
shape (N, num_total_anchors, 4).
bbox_weights (Tensor) – BBox weights of all anchors in the
image with shape (N, 4)
reweight_factor (List[float]) – Reweight factor for cls and reg
loss.
avg_factor (float) – Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
Compute common vars for regression and classification targets.
Parameters:
anchor_list (List[List[Tensor]]) – anchors of each image.
valid_flag_list (List[List[Tensor]]) – Valid flags of each image.
cls_scores (List[Tensor]) – Classification scores for all scale
levels, each is a 4D-tensor, the channels number is
num_base_priors * num_classes.
bbox_preds (list[Tensor]) – Box energies / deltas for all scale
levels, each is a 4D-tensor, the channels number is
num_base_priors * 4.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, bs, num_queries_total,
dim), where num_queries_total is the sum of
num_denoising_queries, num_queries and num_dense_queries
when self.training is True, else num_queries.
references (list[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). Each reference has shape (bs,
num_queries_total, 4) with the last dimension arranged as
(cx, cy, w, h).
Returns:
results of head containing the following tensors.
all_layers_outputs_classes (Tensor): Outputs from the
classification head, has shape (num_decoder_layers, bs,
num_queries_total, cls_out_channels).
all_layers_outputs_coords (Tensor): Sigmoid outputs from the
regression head with normalized coordinate format (cx, cy, w,
h), has shape (num_decoder_layers, bs, num_queries_total, 4)
with the last dimension arranged as (cx, cy, w, h).
Perform forward propagation and loss calculation of the detection
head on the queries of the upstream network.
Parameters:
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, bs, num_queries_total,
dim), where num_queries_total is the sum of
num_denoising_queries, num_queries and num_dense_queries
when self.training is True, else num_queries.
references (list[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). Each reference has shape (bs,
num_queries_total, 4) with the last dimension arranged as
(cx, cy, w, h).
enc_outputs_class (Tensor) – The top k classification score of
each point on encoder feature map, has shape (bs, num_queries,
cls_out_channels).
enc_outputs_coord (Tensor) – The proposal generated from points
with top k score, has shape (bs, num_queries, 4) with the
last dimension arranged as (cx, cy, w, h).
batch_data_samples (list[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
dn_meta (Dict[str, int]) – The dictionary saves information about
group collation, including ‘num_denoising_queries’ and
‘num_denoising_groups’. It will be used for split outputs of
denoising and matching parts and loss calculation.
aux_enc_outputs_class (Tensor) – The dense_topk classification
score of each point on encoder feature map, has shape (bs,
num_dense_queries, cls_out_channels).
It is None when self.training is False.
aux_enc_outputs_coord (Tensor) – The proposal generated from points
with dense_topk score, has shape (bs, num_dense_queries, 4)
with the last dimension arranged as (cx, cy, w, h).
It is None when self.training is False.
all_layers_cls_scores (Tensor) – Classification scores of all
decoder layers, has shape (num_decoder_layers, bs,
num_queries_total, cls_out_channels).
all_layers_bbox_preds (Tensor) – Bbox coordinates of all decoder
layers. Each has shape (num_decoder_layers, bs,
num_queries_total, 4) with normalized coordinate format
(cx, cy, w, h).
enc_cls_scores (Tensor) – The top k score of each point on
encoder feature map, has shape (bs, num_queries,
cls_out_channels).
enc_bbox_preds (Tensor) – The proposal generated from points
with top k score, has shape (bs, num_queries, 4) with the
last dimension arranged as (cx, cy, w, h).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image,
e.g., image size, scaling factor, etc.
dn_meta (Dict[str, int]) – The dictionary saves information about
group collation, including ‘num_denoising_queries’ and
‘num_denoising_groups’. It will be used for split outputs of
denoising and matching parts and loss calculation.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Calculate the loss of distinct queries, that is, excluding denoising
and dense queries. Only select the distinct queries in decoder for
loss.
Parameters:
all_layers_cls_scores (Tensor) – Classification scores of all
decoder layers, has shape (num_decoder_layers, bs,
num_queries, cls_out_channels).
all_layers_bbox_preds (Tensor) – Bbox coordinates of all decoder
layers. It has shape (num_decoder_layers, bs,
num_queries, 4) with the last dimension arranged as
(cx, cy, w, h).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image,
e.g.
size (image)
factor (scaling)
etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Transform a batch of output features extracted from the head into
bbox results.
Parameters:
layer_cls_scores (Tensor) – Classification scores of all
decoder layers, has shape (num_decoder_layers, bs,
num_queries, cls_out_channels).
layer_bbox_preds (Tensor) – Bbox coordinates of all decoder layers.
Each has shape (num_decoder_layers, bs, num_queries, 4)
with normalized coordinate format (cx, cy, w, h).
batch_img_metas (list[dict]) – Meta information of each image.
rescale (bool, optional) – If True, return boxes in original
image space. Default False.
Returns:
InstanceData]: Detection results of each image
after the post process.
hidden_states (Tensor) – Features from transformer decoder. If
return_intermediate_dec in detr.py is True output has shape
(num_decoder_layers, bs, num_queries, dim), else has shape
(1, bs, num_queries, dim) which only contains the last layer
outputs.
Returns:
results of head containing the following tensor.
layers_cls_scores (Tensor): Outputs from the classification head,
shape (num_decoder_layers, bs, num_queries, cls_out_channels).
Note cls_out_channels should include background.
layers_bbox_preds (Tensor): Sigmoid outputs from the regression
head with normalized coordinate format (cx, cy, w, h), has shape
(num_decoder_layers, bs, num_queries, 4).
Compute regression and classification targets for a batch image.
Outputs from a single decoder layer of a single feature level are used.
Parameters:
cls_scores_list (list[Tensor]) – Box score logits from a single
decoder layer for each image, has shape [num_queries,
cls_out_channels].
bbox_preds_list (list[Tensor]) – Sigmoid outputs from a single
decoder layer for each image, with normalized coordinate
(cx, cy, w, h) and shape [num_queries, 4].
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
Returns:
a tuple containing the following targets.
labels_list (list[Tensor]): Labels for all images.
label_weights_list (list[Tensor]): Label weights for all images.
bbox_targets_list (list[Tensor]): BBox targets for all images.
bbox_weights_list (list[Tensor]): BBox weights for all images.
num_total_pos (int): Number of positive samples in all images.
num_total_neg (int): Number of negative samples in all images.
Perform forward propagation and loss calculation of the detection
head on the features of the upstream network.
Parameters:
hidden_states (Tensor) – Feature from the transformer decoder, has
shape (num_decoder_layers, bs, num_queries, cls_out_channels)
or (num_decoder_layers, num_queries, bs, cls_out_channels).
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
Perform forward propagation of the head, then calculate loss and
predictions from the features and data samples. Over-write because
img_metas are needed as inputs for bbox_head.
Parameters:
hidden_states (tuple[Tensor]) – Feature from the transformer
decoder, has shape (num_decoder_layers, bs, num_queries, dim).
batch_data_samples (list[DetDataSample]) – Each item contains
the meta information of each image and corresponding
annotations.
Returns:
the return value is a tuple contains:
losses: (dict[str, Tensor]): A dictionary of loss components.
predictions (list[InstanceData]): Detection
results of each image after the post process.
Only outputs from the last feature level are used for computing
losses by default.
Parameters:
all_layers_cls_scores (Tensor) – Classification outputs
of each decoder layers. Each is a 4D-tensor, has shape
(num_decoder_layers, bs, num_queries, cls_out_channels).
all_layers_bbox_preds (Tensor) – Sigmoid regression
outputs of each decoder layers. Each is a 4D-tensor with
normalized coordinate format (cx, cy, w, h) and shape
(num_decoder_layers, bs, num_queries, 4).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Loss function for outputs from a single decoder layer of a single
feature level.
Parameters:
cls_scores (Tensor) – Box score logits from a single decoder layer
for all images, has shape (bs, num_queries, cls_out_channels).
bbox_preds (Tensor) – Sigmoid outputs from a single decoder layer
for all images, with normalized coordinate (cx, cy, w, h) and
shape (bs, num_queries, 4).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
Returns:
A tuple including loss_cls, loss_box and
loss_iou.
Perform forward propagation of the detection head and predict
detection results on the features of the upstream network. Over-write
because img_metas are needed as inputs for bbox_head.
Parameters:
hidden_states (tuple[Tensor]) – Multi-level features from the
upstream network, each is a 4D-tensor.
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool, optional) – Whether to rescale the results.
Defaults to True.
Returns:
InstanceData]: Detection results of each image
after the post process.
Transform network outputs for a batch into bbox predictions.
Parameters:
layer_cls_scores (Tensor) – Classification outputs of the last or
all decoder layer. Each is a 4D-tensor, has shape
(num_decoder_layers, bs, num_queries, cls_out_channels).
layer_bbox_preds (Tensor) – Sigmoid regression outputs of the last
or all decoder layer. Each is a 4D-tensor with normalized
coordinate format (cx, cy, w, h) and shape
(num_decoder_layers, bs, num_queries, 4).
batch_img_metas (list[dict]) – Meta information of each image.
rescale (bool, optional) – If True, return boxes in original
image space. Defaults to True.
Returns:
Object detection results of each image
after the post process. Each item usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
Get targets in denoising part for a batch of images.
Parameters:
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
dn_meta (Dict[str, int]) – The dictionary saves information about
group collation, including ‘num_denoising_queries’ and
‘num_denoising_groups’. It will be used for split outputs of
denoising and matching parts and loss calculation.
Returns:
a tuple containing the following targets.
labels_list (list[Tensor]): Labels for all images.
label_weights_list (list[Tensor]): Label weights for all images.
bbox_targets_list (list[Tensor]): BBox targets for all images.
bbox_weights_list (list[Tensor]): BBox weights for all images.
num_total_pos (int): Number of positive samples in all images.
num_total_neg (int): Number of negative samples in all images.
Perform forward propagation and loss calculation of the detection
head on the queries of the upstream network.
Parameters:
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, bs, num_queries_total,
dim), where num_queries_total is the sum of
num_denoising_queries and num_matching_queries when
self.training is True, else num_matching_queries.
references (list[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). The init_reference has shape (bs,
num_queries_total, 4) and each inter_reference has shape
(bs, num_queries, 4) with the last dimension arranged as
(cx, cy, w, h).
enc_outputs_class (Tensor) – The score of each point on encode
feature map, has shape (bs, num_feat_points, cls_out_channels).
enc_outputs_coord (Tensor) – The proposal generate from the
encode feature map, has shape (bs, num_feat_points, 4) with the
last dimension arranged as (cx, cy, w, h).
batch_data_samples (list[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
dn_meta (Dict[str, int]) – The dictionary saves information about
group collation, including ‘num_denoising_queries’ and
‘num_denoising_groups’. It will be used for split outputs of
denoising and matching parts and loss calculation.
all_layers_cls_scores (Tensor) – Classification scores of all
decoder layers, has shape (num_decoder_layers, bs,
num_queries_total, cls_out_channels), where
num_queries_total is the sum of num_denoising_queries
and num_matching_queries.
all_layers_bbox_preds (Tensor) – Regression outputs of all decoder
layers. Each is a 4D-tensor with normalized coordinate format
(cx, cy, w, h) and has shape (num_decoder_layers, bs,
num_queries_total, 4).
enc_cls_scores (Tensor) – The score of each point on encode
feature map, has shape (bs, num_feat_points, cls_out_channels).
enc_bbox_preds (Tensor) – The proposal generate from the encode
feature map, has shape (bs, num_feat_points, 4) with the last
dimension arranged as (cx, cy, w, h).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
dn_meta (Dict[str, int]) – The dictionary saves information about
group collation, including ‘num_denoising_queries’ and
‘num_denoising_groups’. It will be used for split outputs of
denoising and matching parts and loss calculation.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
all_layers_denoising_cls_scores (Tensor) – Classification scores of
all decoder layers in denoising part, has shape (
num_decoder_layers, bs, num_denoising_queries,
cls_out_channels).
all_layers_denoising_bbox_preds (Tensor) – Regression outputs of all
decoder layers in denoising part. Each is a 4D-tensor with
normalized coordinate format (cx, cy, w, h) and has shape
(num_decoder_layers, bs, num_denoising_queries, 4).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
dn_meta (Dict[str, int]) – The dictionary saves information about
group collation, including ‘num_denoising_queries’ and
‘num_denoising_groups’. It will be used for split outputs of
denoising and matching parts and loss calculation.
Returns:
The loss_dn_cls, loss_dn_bbox, and loss_dn_iou
of each decoder layers.
Split outputs of the denoising part and the matching part.
For the total outputs of num_queries_total length, the former
num_denoising_queries outputs are from denoising queries, and
the rest num_matching_queries ones are from matching queries,
where num_queries_total is the sum of num_denoising_queries and
num_matching_queries.
Parameters:
all_layers_cls_scores (Tensor) – Classification scores of all
decoder layers, has shape (num_decoder_layers, bs,
num_queries_total, cls_out_channels).
all_layers_bbox_preds (Tensor) – Regression outputs of all decoder
layers. Each is a 4D-tensor with normalized coordinate format
(cx, cy, w, h) and has shape (num_decoder_layers, bs,
num_queries_total, 4).
dn_meta (Dict[str, int]) – The dictionary saves information about
group collation, including ‘num_denoising_queries’ and
‘num_denoising_groups’.
Returns:
a tuple containing the following outputs.
all_layers_matching_cls_scores (Tensor): Classification scores
of all decoder layers in matching part, has shape
(num_decoder_layers, bs, num_matching_queries, cls_out_channels).
all_layers_matching_bbox_preds (Tensor): Regression outputs of
all decoder layers in matching part. Each is a 4D-tensor with
normalized coordinate format (cx, cy, w, h) and has shape
(num_decoder_layers, bs, num_matching_queries, 4).
all_layers_denoising_cls_scores (Tensor): Classification scores
of all decoder layers in denoising part, has shape
(num_decoder_layers, bs, num_denoising_queries,
cls_out_channels).
all_layers_denoising_bbox_preds (Tensor): Regression outputs of
all decoder layers in denoising part. Each is a 4D-tensor with
normalized coordinate format (cx, cy, w, h) and has shape
(num_decoder_layers, bs, num_denoising_queries, 4).
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, bs, num_queries, dim).
references (list[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). The init_reference has shape (bs,
num_queries, 4) when as_two_stage of the detector is True,
otherwise (bs, num_queries, 2). Each inter_reference has
shape (bs, num_queries, 4) when with_box_refine of the
detector is True, otherwise (bs, num_queries, 2). The
coordinates are arranged as (cx, cy) when the last dimension is
2, and (cx, cy, w, h) when it is 4.
Returns:
results of head containing the following tensor.
all_layers_outputs_classes (Tensor): Outputs from the
classification head, has shape (num_decoder_layers, bs,
num_queries, cls_out_channels).
all_layers_outputs_coords (Tensor): Sigmoid outputs from the
regression head with normalized coordinate format (cx, cy, w,
h), has shape (num_decoder_layers, bs, num_queries, 4) with the
last dimension arranged as (cx, cy, w, h).
Perform forward propagation and loss calculation of the detection
head on the queries of the upstream network.
Parameters:
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, num_queries, bs, dim).
references (list[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). The init_reference has shape (bs,
num_queries, 4) when as_two_stage of the detector is True,
otherwise (bs, num_queries, 2). Each inter_reference has
shape (bs, num_queries, 4) when with_box_refine of the
detector is True, otherwise (bs, num_queries, 2). The
coordinates are arranged as (cx, cy) when the last dimension is
2, and (cx, cy, w, h) when it is 4.
enc_outputs_class (Tensor) – The score of each point on encode
feature map, has shape (bs, num_feat_points, cls_out_channels).
Only when as_two_stage is True it would be passed in,
otherwise it would be None.
enc_outputs_coord (Tensor) – The proposal generate from the encode
feature map, has shape (bs, num_feat_points, 4) with the last
dimension arranged as (cx, cy, w, h). Only when as_two_stage
is True it would be passed in, otherwise it would be None.
batch_data_samples (list[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
all_layers_cls_scores (Tensor) – Classification scores of all
decoder layers, has shape (num_decoder_layers, bs, num_queries,
cls_out_channels).
all_layers_bbox_preds (Tensor) – Regression outputs of all decoder
layers. Each is a 4D-tensor with normalized coordinate format
(cx, cy, w, h) and has shape (num_decoder_layers, bs,
num_queries, 4) with the last dimension arranged as
(cx, cy, w, h).
enc_cls_scores (Tensor) – The score of each point on encode
feature map, has shape (bs, num_feat_points, cls_out_channels).
Only when as_two_stage is True it would be passes in,
otherwise, it would be None.
enc_bbox_preds (Tensor) – The proposal generate from the encode
feature map, has shape (bs, num_feat_points, 4) with the last
dimension arranged as (cx, cy, w, h). Only when as_two_stage
is True it would be passed in, otherwise it would be None.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Perform forward propagation and loss calculation of the detection
head on the queries of the upstream network.
Parameters:
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, num_queries, bs, dim).
references (list[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). The init_reference has shape (bs,
num_queries, 4) when as_two_stage of the detector is True,
otherwise (bs, num_queries, 2). Each inter_reference has
shape (bs, num_queries, 4) when with_box_refine of the
detector is True, otherwise (bs, num_queries, 2). The
coordinates are arranged as (cx, cy) when the last dimension is
2, and (cx, cy, w, h) when it is 4.
batch_data_samples (list[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool, optional) – If True, return boxes in original
image space. Defaults to True.
Returns:
InstanceData]: Detection results of each image
after the post process.
Transform a batch of output features extracted from the head into
bbox results.
Parameters:
all_layers_cls_scores (Tensor) – Classification scores of all
decoder layers, has shape (num_decoder_layers, bs, num_queries,
cls_out_channels).
all_layers_bbox_preds (Tensor) – Regression outputs of all decoder
layers. Each is a 4D-tensor with normalized coordinate format
(cx, cy, w, h) and shape (num_decoder_layers, bs, num_queries,
4) with the last dimension arranged as (cx, cy, w, h).
batch_img_metas (list[dict]) – Meta information of each image.
rescale (bool, optional) – If True, return boxes in original
image space. Default False.
Returns:
InstanceData]: Detection results of each image
after the post process.
Unlike traditional RPNHead, this module does not need FPN input, but just
decode init_proposal_bboxes and expand the first dimension of
init_proposal_bboxes and init_proposal_features to the batch_size.
Parameters:
num_proposals (int) – Number of init_proposals. Defaults to 100.
proposal_feature_channel (int) – Channel number of
init_proposal_feature. Defaults to 256.
init_cfg (ConfigDict or dict or list[ConfigDict or dict]) – Initialization config dict. Defaults to None.
The FCOS head does not use anchor boxes. Instead bounding boxes are
predicted at each pixel and a centerness measure is used to suppress
low-quality predictions.
Here norm_on_bbox, centerness_on_reg, dcn_on_last_conv are training
tricks used in official repo, which will bring remarkable mAP gains
of up to 4.9. Please see https://github.com/tianzhi0549/FCOS for
more detail.
Parameters:
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
strides (Sequence[int] or Sequence[Tuple[int, int]]) – Strides of points
in multiple feature levels. Defaults to (4, 8, 16, 32, 64).
regress_ranges (Sequence[Tuple[int, int]]) – Regress range of multiple
level points.
center_sampling (bool) – If true, use center sampling.
Defaults to False.
center_sample_radius (float) – Radius of center sampling.
Defaults to 1.5.
norm_on_bbox (bool) – If true, normalize the regression targets with
FPN strides. Defaults to False.
conv_bias (bool or str) – If specified as auto, it will be decided by
the norm_cfg. Bias of conv will be set as True if norm_cfg is
None, otherwise False. Defaults to “auto”.
loss_cls (ConfigDict or dict) – Config of classification loss.
loss_bbox (ConfigDict or dict) – Config of localization loss.
loss_centerness (ConfigDict, or dict) – Config of centerness
loss.
norm_cfg (ConfigDict or dict) – dictionary to construct and
config norm layer. Defaults to
norm_cfg=dict(type='GN',num_groups=32,requires_grad=True).
cls_predictor_cfg (ConfigDict or dict) – dictionary to construct and
config conv_cls. Defaults to None.
init_cfg (ConfigDict or dict or list[ConfigDict or dict]) – Initialization config dict.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level,
each is a 4D-tensor, the channel number is
num_points * num_classes.
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level, each is a 4D-tensor, the channel number is
num_points * 4.
centernesses (list[Tensor]) – centerness for each scale level, each
is a 4D-tensor, the channel number is num_points * 1.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
The head contains two subnetworks. The first classifies anchor boxes and
the second regresses deltas for the anchors (num_anchors is 1 for anchor-
free methods)
score_threshold (float, optional) – The score_threshold to calculate
positive recall. If given, prediction scores lower than this value
is counted as incorrect prediction. Defaults to None.
init_cfg (ConfigDict or dict or list[ConfigDict or dict]) – Initialization config dict.
>>> importtorch>>> self=FSAFHead(11,7)>>> x=torch.rand(1,7,32,32)>>> cls_score,bbox_pred=self.forward_single(x)>>> # Each anchor predicts a score for each class except background>>> cls_per_anchor=cls_score.shape[1]/self.num_anchors>>> box_per_anchor=bbox_pred.shape[1]/self.num_anchors>>> assertcls_per_anchor==self.num_classes>>> assertbox_per_anchor==4
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_points * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_points * 4, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
assigned_gt_inds (Tensor) – The gt indices that each anchor bbox
is assigned to. -1 denotes a negative anchor, otherwise it is the
gt index (0-based). Shape: (num_anchors, ),
labels (Tensor) – Label assigned to anchors. Shape: (num_anchors, ).
level (int) – The current level index in the pyramid
(0-4 for RetinaNet)
min_levels (Tensor) – The best-matching level for each gt.
Shape: (num_gts, ),
Feature Adaption Module is implemented based on DCN v1.
It uses anchor shape prediction rather than feature map to
predict offsets of deform conv layer.
Parameters:
in_channels (int) – Number of channels in the input feature map.
out_channels (int) – Number of channels in the output feature map.
kernel_size (int) – Deformable conv kernel size. Defaults to 3.
deform_groups (int) – Deformable conv group size. Defaults to 4.
init_cfg (ConfigDict or list[ConfigDict] or dict or list[dict], optional) – Initialization config dict.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level,
each is a 4D-tensor, the channel number is
num_priors * num_classes.
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level, each is a 4D-tensor, the channel number is
num_priors * 4.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
stacked_convs (int) – Number of conv layers in cls and reg tower.
Defaults to 4.
conv_cfg (ConfigDict or dict, optional) – dictionary to
construct and config conv layer. Defaults to None.
norm_cfg (ConfigDict or dict, optional) – dictionary to
construct and config norm layer. Defaults to
norm_cfg=dict(type=’GN’, num_groups=32, requires_grad=True).
pre_anchor_topk (int) – Number of boxes that be token in each bag.
Defaults to 50
bbox_thr (float) – The threshold of the saturated linear function.
It is usually the same with the IoU threshold used in NMS.
Defaults to 0.6.
gamma (float) – Gamma parameter in focal loss. Defaults to 2.0.
alpha (float) – Alpha parameter in focal loss. Defaults to 0.5.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
has shape (N, num_anchors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
has shape (N, num_anchors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W).
shape_preds (list[Tensor]) – shape predictions for each scale
level with shape (N, 1, H, W).
loc_preds (list[Tensor]) – location predictions for each scale
level with shape (N, num_anchors * 2, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Generalized Focal Loss: Learning Qualified and Distributed Bounding
Boxes for Dense Object Detection.
GFL head structure is similar with ATSS, however GFL uses
1) joint representation for classification and localization quality, and
2) flexible General distribution for bounding box locations,
which are supervised by
Quality Focal Loss (QFL) and Distribution Focal Loss (DFL), respectively
x (tuple[Tensor]) – Features from the upstream network, each is
a 4D-tensor.
Returns:
Usually a tuple of classification scores and bbox prediction
cls_scores (list[Tensor]): Classification and quality (IoU)
joint scores for all scale levels, each is a 4D-tensor,
the channel number is num_classes.
bbox_preds (list[Tensor]): Box distribution logits for all
scale levels, each is a 4D-tensor, the channel number is
4*(n+1), n is max value of integral set.
This method is almost the same as AnchorHead.get_targets(). Besides
returning the targets as the parent method does, it also returns the
anchors as the first element of the returned tuple.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Cls and quality scores for each scale
level has shape (N, num_classes, H, W).
bbox_preds (list[Tensor]) – Box distribution logits for each scale
level with shape (N, 4*(n+1), H, W), n is max value of integral
set.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Calculate the loss of a single scale level based on the features
extracted by the detection head.
Parameters:
anchors (Tensor) – Box reference for each scale level with shape
(N, num_total_anchors, 4).
cls_score (Tensor) – Cls and quality joint scores for each scale
level has shape (N, num_classes, H, W).
bbox_pred (Tensor) – Box distribution logits for each scale
level with shape (N, 4*(n+1), H, W), n is max value of integral
set.
labels (Tensor) – Labels of each anchors with shape
(N, num_total_anchors).
label_weights (Tensor) – Label weights of each anchor with shape
(N, num_total_anchors)
bbox_targets (Tensor) – BBox regression targets of each anchor with
shape (N, num_total_anchors, 4).
stride (Tuple[int]) – Stride in this scale level.
avg_factor (int) – Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, bs, num_queries, dim).
references (List[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). The init_reference has shape (bs,
num_queries, 4) when as_two_stage of the detector is True,
otherwise (bs, num_queries, 2). Each inter_reference has
shape (bs, num_queries, 4) when with_box_refine of the
detector is True, otherwise (bs, num_queries, 2). The
coordinates are arranged as (cx, cy) when the last dimension is
2, and (cx, cy, w, h) when it is 4.
memory_text (Tensor) – Memory text. It has shape (bs, len_text,
text_embed_dims).
text_token_mask (Tensor) – Text token mask. It has shape (bs,
len_text).
Returns:
results of head containing the following tensor.
all_layers_outputs_classes (Tensor): Outputs from the
classification head, has shape (num_decoder_layers, bs,
num_queries, cls_out_channels).
all_layers_outputs_coords (Tensor): Sigmoid outputs from the
regression head with normalized coordinate format (cx, cy, w,
h), has shape (num_decoder_layers, bs, num_queries, 4) with the
last dimension arranged as (cx, cy, w, h).
Perform forward propagation and loss calculation of the detection
head on the queries of the upstream network.
Parameters:
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, bs, num_queries_total,
dim), where num_queries_total is the sum of
num_denoising_queries and num_matching_queries when
self.training is True, else num_matching_queries.
references (list[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). The init_reference has shape (bs,
num_queries_total, 4) and each inter_reference has shape
(bs, num_queries, 4) with the last dimension arranged as
(cx, cy, w, h).
memory_text (Tensor) – Memory text. It has shape (bs, len_text,
text_embed_dims).
enc_outputs_class (Tensor) – The score of each point on encode
feature map, has shape (bs, num_feat_points, cls_out_channels).
enc_outputs_coord (Tensor) – The proposal generate from the
encode feature map, has shape (bs, num_feat_points, 4) with the
last dimension arranged as (cx, cy, w, h).
batch_data_samples (list[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
dn_meta (Dict[str, int]) – The dictionary saves information about
group collation, including ‘num_denoising_queries’ and
‘num_denoising_groups’. It will be used for split outputs of
denoising and matching parts and loss calculation.
Loss function for outputs from a single decoder layer of a single
feature level.
Parameters:
cls_scores (Tensor) – Box score logits from a single decoder layer
for all images, has shape (bs, num_queries, cls_out_channels).
bbox_preds (Tensor) – Sigmoid outputs from a single decoder layer
for all images, with normalized coordinate (cx, cy, w, h) and
shape (bs, num_queries, 4).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
Returns:
A tuple including loss_cls, loss_box and
loss_iou.
Perform forward propagation and loss calculation of the detection
head on the queries of the upstream network.
Parameters:
hidden_states (Tensor) – Hidden states output from each decoder
layer, has shape (num_decoder_layers, num_queries, bs, dim).
references (List[Tensor]) – List of the reference from the decoder.
The first reference is the init_reference (initial) and the
other num_decoder_layers(6) references are inter_references
(intermediate). The init_reference has shape (bs,
num_queries, 4) when as_two_stage of the detector is True,
otherwise (bs, num_queries, 2). Each inter_reference has
shape (bs, num_queries, 4) when with_box_refine of the
detector is True, otherwise (bs, num_queries, 2). The
coordinates are arranged as (cx, cy) when the last dimension is
2, and (cx, cy, w, h) when it is 4.
memory_text (Tensor) – Memory text. It has shape (bs, len_text,
text_embed_dims).
text_token_mask (Tensor) – Text token mask. It has shape (bs,
len_text).
batch_data_samples (SampleList) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool, optional) – If True, return boxes in original
image space. Defaults to True.
Transform a batch of output features extracted from the head into
bbox results.
Parameters:
all_layers_cls_scores (Tensor) – Classification scores of all
decoder layers, has shape (num_decoder_layers, bs, num_queries,
cls_out_channels).
all_layers_bbox_preds (Tensor) – Regression outputs of all decoder
layers. Each is a 4D-tensor with normalized coordinate format
(cx, cy, w, h) and shape (num_decoder_layers, bs, num_queries,
4) with the last dimension arranged as (cx, cy, w, h).
Guided-Anchor-based head (GA-RPN, GA-RetinaNet, etc.).
This GuidedAnchorHead will predict high-quality feature guided
anchors and locations where anchors will be kept in inference.
There are mainly 3 categories of bounding-boxes.
Sampled 9 pairs for target assignment. (approxes)
The square boxes where the predicted anchors are based on. (squares)
in_channels (int) – Number of channels in the input feature map.
feat_channels (int) – Number of hidden channels. Defaults to 256.
approx_anchor_generator (ConfigDict or dict) – Config dict
for approx generator
square_anchor_generator (ConfigDict or dict) – Config dict
for square generator
anchor_coder (ConfigDict or dict) – Config dict for anchor coder
bbox_coder (ConfigDict or dict) – Config dict for bbox coder
reg_decoded_bbox (bool) – If true, the regression loss would be
applied directly on decoded bounding boxes, converting both
the predicted boxes and regression targets to absolute
coordinates format. Defaults to False. It should be True when
using IoULoss, GIoULoss, or DIoULoss in the bbox head.
deform_groups – (int): Group number of DCN in FeatureAdaption module.
Defaults to 4.
loc_filter_thr (float) – Threshold to filter out unconcerned regions.
Defaults to 0.01.
loss_loc (ConfigDict or dict) – Config of location loss.
loss_shape (ConfigDict or dict) – Config of anchor shape loss.
loss_cls (ConfigDict or dict) – Config of classification loss.
loss_bbox (ConfigDict or dict) – Config of bbox regression loss.
init_cfg (ConfigDict or list[ConfigDict] or dict or list[dict], optional) – Initialization config dict.
approx_list (list[list[Tensor]]) – Multi level approxs of each
image.
inside_flag_list (list[list[Tensor]]) – Multi level inside flags
of each image.
square_list (list[list[Tensor]]) – Multi level squares of each
image.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – unmap outputs or not. Defaults to None.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
has shape (N, num_anchors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W).
shape_preds (list[Tensor]) – shape predictions for each scale
level with shape (N, 1, H, W).
loc_preds (list[Tensor]) – location predictions for each scale
level with shape (N, num_anchors * 2, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
iou_preds (list[Tensor]) – iou_preds for each scale
level with shape (N, num_anchors * 1, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Returns:
Returns a tuple containing label assignment variables.
labels (Tensor): Labels of all anchors, each with
shape (num_anchors,).
labels_weight (Tensor): Label weights of all anchor.
each with shape (num_anchors,).
bboxes_target (Tensor): BBox targets of all anchors.
each with shape (num_anchors, 4).
bboxes_weight (Tensor): BBox weights of all anchors.
each with shape (num_anchors, 4).
pos_inds_flatten (Tensor): Contains all index of positive
sample in all anchor.
Forward train with the available label assignment (student receives
from teacher).
Parameters:
x (list[Tensor]) – Features from FPN.
label_assignment_results (tuple) – As the outputs defined in the
function self.get_label_assignment.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Returns:
(dict[str, Tensor]): A dictionary of loss components.
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
iou_preds (list[Tensor]) – iou_preds for each scale
level with shape (N, num_anchors * 1, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
label_assignment_results (tuple, optional) – As the outputs defined
in the function self.get_
label_assignment.
out_teacher (tuple[Tensor]) – The output of teacher.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Returns:
The loss components and proposals of each image.
losses (dict[str, Tensor]): A dictionary of loss components.
proposal_list (list[Tensor]): Proposals of each image.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
avg_factor (int) – Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
in_channels (list[int]) – Number of channels in the input feature map.
feat_channels (int) – Number of channels for features.
out_channels (int) – Number of channels for output.
num_things_classes (int) – Number of things.
num_stuff_classes (int) – Number of stuff.
num_queries (int) – Number of query in Transformer decoder.
pixel_decoder (ConfigDict or dict) – Config for pixel
decoder. Defaults to None.
enforce_decoder_input_project (bool, optional) – Whether to add
a layer to change the embed_dim of transformer encoder in
pixel decoder to the embed_dim of transformer decoder.
Defaults to False.
transformer_decoder (ConfigDict or dict) – Config for
transformer decoder. Defaults to None.
positional_encoding (ConfigDict or dict) – Config for
transformer decoder position encoding. Defaults to
dict(num_feats=128, normalize=True).
loss_cls (ConfigDict or dict) – Config of the classification
loss. Defaults to None.
loss_mask (ConfigDict or dict) – Config of the mask loss.
Defaults to None.
loss_dice (ConfigDict or dict) – Config of the dice loss.
Defaults to None.
train_cfg (ConfigDict or dict, optional) – Training config of
Mask2Former head.
test_cfg (ConfigDict or dict, optional) – Testing config of
Mask2Former head.
init_cfg (ConfigDict or dict or list[ConfigDict or dict], optional) – Initialization config dict. Defaults to None.
x (list[Tensor]) – Multi scale Features from the
upstream network, each is a 4D-tensor.
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
Returns:
A tuple contains two elements.
cls_pred_list (list[Tensor)]: Classification logits for each decoder layer. Each is a 3D-tensor with shape (batch_size, num_queries, cls_out_channels). Note cls_out_channels should includes background.
mask_pred_list (list[Tensor]): Mask logits for each decoder layer. Each with shape (batch_size, num_queries, h, w).
in_channels (list[int]) – Number of channels in the input feature map.
feat_channels (int) – Number of channels for feature.
out_channels (int) – Number of channels for output.
num_things_classes (int) – Number of things.
num_stuff_classes (int) – Number of stuff.
num_queries (int) – Number of query in Transformer.
pixel_decoder (ConfigDict or dict) – Config for pixel
decoder.
enforce_decoder_input_project (bool) – Whether to add a layer
to change the embed_dim of transformer encoder in pixel decoder to
the embed_dim of transformer decoder. Defaults to False.
transformer_decoder (ConfigDict or dict) – Config for
transformer decoder.
positional_encoding (ConfigDict or dict) – Config for
transformer decoder position encoding.
loss_cls (ConfigDict or dict) – Config of the classification
loss. Defaults to CrossEntropyLoss.
loss_mask (ConfigDict or dict) – Config of the mask loss.
Defaults to FocalLoss.
loss_dice (ConfigDict or dict) – Config of the dice loss.
Defaults to DiceLoss.
train_cfg (ConfigDict or dict, optional) – Training config of
MaskFormer head.
test_cfg (ConfigDict or dict, optional) – Testing config of
MaskFormer head.
init_cfg (ConfigDict or dict or list[ConfigDict or dict], optional) – Initialization config dict. Defaults to None.
x (tuple[Tensor]) – Features from the upstream network, each
is a 4D-tensor.
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
Returns:
a tuple contains two elements.
all_cls_scores (Tensor): Classification scores for each scale level. Each is a 4D-tensor with shape (num_decoder, batch_size, num_queries, cls_out_channels). Note cls_out_channels should includes background.
all_mask_preds (Tensor): Mask scores for each decoder layer. Each with shape (num_decoder, batch_size, num_queries, h, w).
Compute classification and mask targets for all images for a decoder
layer.
Parameters:
cls_scores_list (list[Tensor]) – Mask score logits from a single
decoder layer for all images. Each with shape (num_queries,
cls_out_channels).
mask_preds_list (list[Tensor]) – Mask logits from a single decoder
layer for all images. Each with shape (num_queries, h, w).
(list[obj (batch_gt_instances) – InstanceData]): each contains
labels and masks.
batch_img_metas (list[dict]) – List of image meta information.
return_sampling_results (bool) – Whether to return the sampling
results. Defaults to False.
Returns:
a tuple containing the following targets.
labels_list (list[Tensor]): Labels of all images. Each with shape (num_queries, ).
label_weights_list (list[Tensor]): Label weights of all images. Each with shape (num_queries, ).
mask_targets_list (list[Tensor]): Mask targets of all images. Each with shape (num_queries, h, w).
mask_weights_list (list[Tensor]): Mask weights of all images. Each with shape (num_queries, ).
avg_factor (int): Average factor that is used to average the loss. When using sampling method, avg_factor is
usually the sum of positive and negative priors. When
using MaskPseudoSampler, avg_factor is usually equal
to the number of positive priors.
additional_returns: This function enables user-defined returns from
self._get_targets_single. These returns are currently refined
to properties at each feature map (i.e. having HxW dimension).
The results will be concatenated after the end.
all_cls_scores (Tensor) – Classification scores for all decoder
layers with shape (num_decoder, batch_size, num_queries,
cls_out_channels). Note cls_out_channels should includes
background.
all_mask_preds (Tensor) – Mask scores for all decoder layers with
shape (num_decoder, batch_size, num_queries, h, w).
(list[obj (batch_gt_instances) – InstanceData]): each contains
labels and masks.
batch_img_metas (list[dict]) – List of image meta information.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes labels, each is
ground truth labels of each bbox, with shape (num_gts, )
and masks, each is ground truth masks of each instances
of a image, shape (num_gts, h, w).
gt_semantic_seg (list[Optional[PixelData]]) – Ground truth of
semantic segmentation, each with the shape (1, h, w).
[0, num_thing_class - 1] means things,
[num_thing_class, num_class-1] means stuff,
255 means VOID. It’s None when training instance segmentation.
Returns:
InstanceData]: each contains the following keys
labels (Tensor): Ground truth class indices for a image, with shape (n, ), n is the sum of number of stuff type and number of instance in a image.
masks (Tensor): Ground truth mask for a image, with shape (n, h, w).
It is quite similar with FCOS head, except for the searched structure of
classification branch and bbox regression branch, where a structure of
“dconv3x3, conv3x3, dconv3x3, conv1x1” is utilized instead.
Parameters:
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
strides (Sequence[int] or Sequence[Tuple[int, int]]) – Strides of points
in multiple feature levels. Defaults to (4, 8, 16, 32, 64).
regress_ranges (Sequence[Tuple[int, int]]) – Regress range of multiple
level points.
center_sampling (bool) – If true, use center sampling.
Defaults to False.
center_sample_radius (float) – Radius of center sampling.
Defaults to 1.5.
norm_on_bbox (bool) – If true, normalize the regression targets with
FPN strides. Defaults to False.
conv_bias (bool or str) – If specified as auto, it will be decided by
the norm_cfg. Bias of conv will be set as True if norm_cfg is
None, otherwise False. Defaults to “auto”.
loss_cls (ConfigDict or dict) – Config of classification loss.
loss_bbox (ConfigDict or dict) – Config of localization loss.
loss_centerness (ConfigDict, or dict) – Config of centerness
loss.
norm_cfg (ConfigDict or dict) – dictionary to construct and
config norm layer. Defaults to
norm_cfg=dict(type='GN',num_groups=32,requires_grad=True).
init_cfg (ConfigDict or dict or list[ConfigDict or dict], optional) – Initialization config dict.
topk (int) – Select topk samples with smallest loss in
each level.
score_voting (bool) – Whether to use score voting in post-process.
covariance_type –
String describing the type of covariance parameters
to be used in sklearn.mixture.GaussianMixture.
It must be one of:
’full’: each component has its own general covariance matrix
’tied’: all components share the same general covariance matrix
’diag’: each component has its own diagonal covariance matrix
’spherical’: each component has its own single variance
Default: ‘diag’. From ‘full’ to ‘spherical’, the gmm fitting
process is faster yet the performance could be influenced. For most
cases, ‘diag’ should be a good choice.
This method is almost the same as AnchorHead.get_targets(). We direct
return the results from _get_targets_single instead map it to levels
by images_to_levels function.
Parameters:
anchor_list (list[list[Tensor]]) – Multi level anchors of each
image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, 4).
valid_flag_list (list[list[Tensor]]) – Multi level valid flags of
each image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, )
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors. Defaults to True.
Returns:
Usually returns a tuple containing learning targets.
labels (list[Tensor]): Labels of all anchors, each with
shape (num_anchors,).
label_weights (list[Tensor]): Label weights of all anchor.
each with shape (num_anchors,).
bbox_targets (list[Tensor]): BBox targets of all anchors.
each with shape (num_anchors, 4).
bbox_weights (list[Tensor]): BBox weights of all anchors.
each with shape (num_anchors, 4).
pos_inds (list[Tensor]): Contains all index of positive
sample in all anchor.
gt_inds (list[Tensor]): Contains all gt_index of positive
It separates a GMM distribution of candidate samples into three
parts, 0 1 and uncertain areas, and you can implement other
separation schemes by rewriting this function.
Parameters:
gmm_assignment (Tensor) – The prediction of GMM which is of shape
(num_samples,). The 0/1 value indicates the distribution
that each sample comes from.
scores (Tensor) – The probability of sample coming from the
fit GMM distribution. The tensor is of shape (num_samples,).
pos_inds_gmm (Tensor) – All the indexes of samples which are used
to fit GMM model. The tensor is of shape (num_samples,)
Returns:
The indices of positive and ignored samples.
pos_inds_temp (Tensor): Indices of positive samples.
ignore_inds_temp (Tensor): Indices of ignore samples.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
iou_preds (list[Tensor]) – iou_preds for each scale
level with shape (N, num_anchors * 1, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Returns:
Loss dict, comprise classification loss, regression loss and
carl loss.
num_classes (int) – Number of categories excluding the background
category.
in_channels (Sequence[int]) – Number of channels in the input feature
map.
stacked_convs (int) – Number of conv layers in cls and reg tower.
Defaults to 0.
feat_channels (int) – Number of hidden channels when stacked_convs
> 0. Defaults to 256.
use_depthwise (bool) – Whether to use DepthwiseSeparableConv.
Defaults to False.
conv_cfg (ConfigDict or dict, Optional) – Dictionary to construct
and config conv layer. Defaults to None.
norm_cfg (ConfigDict or dict, Optional) – Dictionary to construct
and config norm layer. Defaults to None.
act_cfg (ConfigDict or dict, Optional) – Dictionary to construct
and config activation layer. Defaults to None.
anchor_generator (ConfigDict or dict) – Config dict for anchor
generator.
bbox_coder (ConfigDict or dict) – Config of bounding box coder.
reg_decoded_bbox (bool) – If true, the regression loss would be
applied directly on decoded bounding boxes, converting both
the predicted boxes and regression targets to absolute
coordinates format. Defaults to False. It should be True when
using IoULoss, GIoULoss, or DIoULoss in the bbox head.
train_cfg (ConfigDict or dict, Optional) – Training config of
anchor head.
test_cfg (ConfigDict or dict, Optional) – Testing config of
anchor head.
init_cfg (ConfigDict or dict or list[ConfigDict or dict], Optional) – Initialization config dict.
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Returns:
A dictionary of loss
components. the dict has components below:
loss_cls (list[Tensor]): A list containing each feature map classification loss.
loss_bbox (list[Tensor]): A list containing each feature map regression loss.
cls_score (Tensor): Cls scores for a single scale level the channels number is num_base_priors * num_classes.
bbox_pred (Tensor): Box energies / deltas for a single scale level, the channels number is num_base_priors * 4.
Compute regression and classification targets for anchors in
multiple images.
Parameters:
cls_scores (Tensor) – Classification predictions of images,
a 3D-Tensor with shape [num_imgs, num_priors, num_classes].
bbox_preds (Tensor) – Decoded bboxes predictions of one image,
a 3D-Tensor with shape [num_imgs, num_priors, 4] in [tl_x,
tl_y, br_x, br_y] format.
anchor_list (list[list[Tensor]]) – Multi level anchors of each
image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, 4).
valid_flag_list (list[list[Tensor]]) – Multi level valid flags of
each image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, )
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors. Defaults to True.
Returns:
a tuple containing learning targets.
anchors_list (list[list[Tensor]]): Anchors of each level.
labels_list (list[Tensor]): Labels of each level.
label_weights_list (list[Tensor]): Label weights of each
level.
bbox_targets_list (list[Tensor]): BBox targets of each level.
assign_metrics_list (list[Tensor]): alignment metrics of each
level.
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Decoded box for each scale
level with shape (N, num_anchors * 4, H, W) in
[tl_x, tl_y, br_x, br_y] format.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Decoded box for each scale
level with shape (N, num_anchors * 4, H, W) in
[tl_x, tl_y, br_x, br_y] format.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Transform a batch of output features extracted from the head into
bbox results.
Note: When score_factors is not None, the cls_scores are
usually multiplied by it then obtain the real score used in NMS,
such as CenterNess in FCOS, IoU branch in ATSS.
Parameters:
cls_scores (list[Tensor]) – Classification scores for all
scale levels, each is a 4D-tensor, has shape
(batch_size, num_priors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for all
scale levels, each is a 4D-tensor, has shape
(batch_size, num_priors * 4, H, W).
kernel_preds (list[Tensor]) – Kernel predictions of dynamic
convs for all scale levels, each is a 4D-tensor, has shape
(batch_size, num_params, H, W).
mask_feat (Tensor) – Mask prototype features extracted from the
mask head, has shape (batch_size, num_prototypes, H, W).
score_factors (list[Tensor], optional) – Score factor for
all scale level, each is a 4D-tensor, has shape
(batch_size, num_priors * 1, H, W). Defaults to None.
batch_img_metas (list[dict], Optional) – Batch image meta info.
Defaults to None.
cfg (ConfigDict, optional) – Test / postprocessing
configuration, if None, test_cfg would be used.
Defaults to None.
rescale (bool) – If True, return boxes in original image space.
Defaults to False.
with_nms (bool) – If True, do nms before return boxes.
Defaults to True.
Returns:
Object detection results of each image
after the post process. Each item usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
masks (Tensor): Has a shape (num_instances, h, w).
Compute corresponding GT box and classification targets for
proposals.
Parameters:
proposals_list (list[Tensor]) – Multi level points/bboxes of each
image.
valid_flag_list (list[Tensor]) – Multi level valid flags of each
image.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
stage (str) – ‘init’ or ‘refine’. Generate target for init stage or
refine stage.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors.
return_sampling_results (bool) – Whether to return the sampling
results. Defaults to False.
Returns:
labels_list (list[Tensor]): Labels of each level.
label_weights_list (list[Tensor]): Label weights of each
level.
bbox_gt_list (list[Tensor]): Ground truth bbox of each level.
proposals_list (list[Tensor]): Proposals(points/bboxes) of
each level.
proposal_weights_list (list[Tensor]): Proposal weights of
each level.
avg_factor (int): Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level,
each is a 4D-tensor, of shape (batch_size, num_classes, h, w).
pts_preds_init (list[Tensor]) – Points for each scale level, each is
a 3D-tensor, of shape (batch_size, h_i * w_i, num_points * 2).
pts_preds_refine (list[Tensor]) – Points refined for each scale
level, each is a 3D-tensor, of shape
(batch_size, h_i * w_i, num_points * 2).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
pts (Tensor) – the input points sets (fields), each points
set (fields) is represented as 2n scalar.
y_first (bool) – if y_first=True, the point set is
represented as [y1, x1, y2, x2 … yn, xn], otherwise
the point set is represented as
[x1, y1, x2, y2 … xn, yn]. Defaults to True.
Returns:
each points set is converting to a bbox [x1, y1, x2, y2].
The head contains two subnetworks. The first classifies anchor boxes and
the second regresses deltas for the anchors.
Example
>>> importtorch>>> self=RetinaHead(11,7)>>> x=torch.rand(1,7,32,32)>>> cls_score,bbox_pred=self.forward_single(x)>>> # Each anchor predicts a score for each class except background>>> cls_per_anchor=cls_score.shape[1]/self.num_anchors>>> box_per_anchor=bbox_pred.shape[1]/self.num_anchors>>> assertcls_per_anchor==(self.num_classes)>>> assertbox_per_anchor==4
In RetinaHead, conv/norm layers are shared across different FPN levels,
while in RetinaSepBNHead, conv layers are shared across different FPN
levels, but BN layers are separated.
in_channels (int) – Number of channels in the input feature map.
stacked_convs (int) – Number of Convs for classification and
regression branches. Defaults to 4.
feat_channels (int) – Number of hidden channels. Defaults to 256.
approx_anchor_generator (ConfigType or dict) – Config dict for
approx generator.
square_anchor_generator (ConfigDict or dict) – Config dict for
square generator.
conv_cfg (ConfigDict or dict, optional) – Config dict for
ConvModule. Defaults to None.
norm_cfg (ConfigDict or dict, optional) – Config dict for
Norm Layer. Defaults to None.
bbox_coder (ConfigDict or dict) – Config dict for bbox coder.
reg_decoded_bbox (bool) – If true, the regression loss would be
applied directly on decoded bounding boxes, converting both
the predicted boxes and regression targets to absolute
coordinates format. Default False. It should be True when
using IoULoss, GIoULoss, or DIoULoss in the bbox head.
train_cfg (ConfigDict or dict, optional) – Training config of
SABLRetinaHead.
test_cfg (ConfigDict or dict, optional) – Testing config of
SABLRetinaHead.
loss_cls (ConfigDict or dict) – Config of classification loss.
loss_bbox_cls (ConfigDict or dict) – Config of classification
loss for bbox branch.
loss_bbox_reg (ConfigDict or dict) – Config of regression loss
for bbox branch.
init_cfg (ConfigDict or dict or list[ConfigDict or dict], optional) – Initialization config dict.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
approx_list (list[list[Tensor]]) – Multi level approxs of each
image.
inside_flag_list (list[list[Tensor]]) – Multi level inside flags of
each image.
square_list (list[list[Tensor]]) – Multi level squares of each
image.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors. Defaults to True.
Returns:
Returns a tuple containing learning targets.
labels_list (list[Tensor]): Labels of each level.
label_weights_list (list[Tensor]): Label weights of each level.
bbox_cls_targets_list (list[Tensor]): BBox cls targets of each level.
bbox_cls_weights_list (list[Tensor]): BBox cls weights of each level.
bbox_reg_targets_list (list[Tensor]): BBox reg targets of each level.
bbox_reg_weights_list (list[Tensor]): BBox reg weights of each level.
num_total_pos (int): Number of positive samples in all images.
num_total_neg (int): Number of negative samples in all images.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
has shape (N, num_anchors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Transform a batch of output features extracted from the head into
bbox results.
Note: When score_factors is not None, the cls_scores are
usually multiplied by it then obtain the real score used in NMS,
such as CenterNess in FCOS, IoU branch in ATSS.
Parameters:
cls_scores (list[Tensor]) – Classification scores for all
scale levels, each is a 4D-tensor, has shape
(batch_size, num_priors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for all
scale levels, each is a 4D-tensor, has shape
(batch_size, num_priors * 4, H, W).
batch_img_metas (list[dict], Optional) – Batch image meta info.
cfg (ConfigDict, optional) – Test / postprocessing
configuration, if None, test_cfg would be used.
Defaults to None.
rescale (bool) – If True, return boxes in original image space.
Defaults to False.
with_nms (bool) – If True, do nms before return boxes.
Defaults to True.
Returns:
Object detection results of each image
after the post process. Each item usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
feat_channels (int) – Number of hidden channels. Used in child classes.
Defaults to 256.
stacked_convs (int) – Number of stacking convs of the head.
Defaults to 4.
strides (tuple) – Downsample factor of each feature map.
scale_ranges (tuple[tuple[int, int]]) – Area range of multiple
level masks, in the format [(min1, max1), (min2, max2), …].
A range of (16, 64) means the area range between (16, 64).
pos_scale (float) – Constant scale factor to control the center region.
num_grids (list[int]) – Divided image into a uniform grids, each
feature map has a different grid value. The number of output
channels is grid ** 2. Defaults to [40, 36, 24, 16, 12].
cls_down_index (int) – The index of downsample operation in
classification branch. Defaults to 0.
loss_mask (dict) – Config of mask loss.
loss_cls (dict) – Config of classification loss.
norm_cfg (dict) – Dictionary to construct and config norm layer.
Defaults to norm_cfg=dict(type=’GN’, num_groups=32,
requires_grad=True).
train_cfg (dict) – Training config of head.
test_cfg (dict) – Testing config of head.
init_cfg (dict or list[dict], optional) – Initialization config dict.
mask_feature_head (dict) – Config of SOLOv2MaskFeatHead.
dynamic_conv_size (int) – Dynamic Conv kernel size. Defaults to 1.
dcn_cfg (dict) – Dcn conv configurations in kernel_convs and cls_conv.
Defaults to None.
dcn_apply_to_all_conv (bool) – Whether to use dcn in every layer of
kernel_convs and cls_convs, or only the last layer. It shall be set
True for the normal version of SOLOv2 and False for the
light-weight version. Defaults to True.
init_cfg (dict or list[dict], optional) – Initialization config dict.
x (tuple[Tensor]) – Features from the upstream network, each is
a 4D-tensor.
Returns:
A tuple of classification scores, mask prediction,
and mask features.
mlvl_kernel_preds (list[Tensor]): Multi-level dynamic kernel
prediction. The kernel is used to generate instance
segmentation masks by dynamic convolution. Each element in
the list has shape
(batch_size, kernel_out_channels, num_grids, num_grids).
mlvl_cls_preds (list[Tensor]): Multi-level scores. Each
element in the list has shape
(batch_size, num_classes, num_grids, num_grids).
mask_feats (Tensor): Unified mask feature map used to
generate instance segmentation masks by dynamic convolution.
Has shape (batch_size, mask_out_channels, h, w).
Calculate the loss based on the features extracted by the mask head.
Parameters:
mlvl_kernel_preds (list[Tensor]) – Multi-level dynamic kernel
prediction. The kernel is used to generate instance
segmentation masks by dynamic convolution. Each element in the
list has shape
(batch_size, kernel_out_channels, num_grids, num_grids).
mlvl_cls_preds (list[Tensor]) – Multi-level scores. Each element
in the list has shape
(batch_size, num_classes, num_grids, num_grids).
mask_feats (Tensor) – Unified mask feature map used to generate
instance segmentation masks by dynamic convolution. Has shape
(batch_size, mask_out_channels, h, w).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes, masks,
and labels attributes.
batch_img_metas (list[dict]) – Meta information of multiple images.
Transform a batch of output features extracted from the head into
mask results.
Parameters:
mlvl_kernel_preds (list[Tensor]) – Multi-level dynamic kernel
prediction. The kernel is used to generate instance
segmentation masks by dynamic convolution. Each element in the
list has shape
(batch_size, kernel_out_channels, num_grids, num_grids).
mlvl_cls_scores (list[Tensor]) – Multi-level scores. Each element
in the list has shape
(batch_size, num_classes, num_grids, num_grids).
mask_feats (Tensor) – Unified mask feature map used to generate
instance segmentation masks by dynamic convolution. Has shape
(batch_size, mask_out_channels, h, w).
batch_img_metas (list[dict]) – Meta information of all images.
Returns:
Processed results of multiple
images.Each InstanceData usually contains
following keys.
scores (Tensor): Classification scores, has shape
(num_instance,).
labels (Tensor): Has shape (num_instances,).
masks (Tensor): Processed mask results, has
shape (num_instances, h, w).
num_classes (int) – Number of categories excluding the background
category.
in_channels (Sequence[int]) – Number of channels in the input feature
map.
stacked_convs (int) – Number of conv layers in cls and reg tower.
Defaults to 0.
feat_channels (int) – Number of hidden channels when stacked_convs
> 0. Defaults to 256.
use_depthwise (bool) – Whether to use DepthwiseSeparableConv.
Defaults to False.
conv_cfg (ConfigDict or dict, Optional) – Dictionary to construct
and config conv layer. Defaults to None.
norm_cfg (ConfigDict or dict, Optional) – Dictionary to construct
and config norm layer. Defaults to None.
act_cfg (ConfigDict or dict, Optional) – Dictionary to construct
and config activation layer. Defaults to None.
anchor_generator (ConfigDict or dict) – Config dict for anchor
generator.
bbox_coder (ConfigDict or dict) – Config of bounding box coder.
reg_decoded_bbox (bool) – If true, the regression loss would be
applied directly on decoded bounding boxes, converting both
the predicted boxes and regression targets to absolute
coordinates format. Defaults to False. It should be True when
using IoULoss, GIoULoss, or DIoULoss in the bbox head.
train_cfg (ConfigDict or dict, Optional) – Training config of
anchor head.
test_cfg (ConfigDict or dict, Optional) – Testing config of
anchor head.
init_cfg (ConfigDict or dict or list[ConfigDict or dict], Optional) – Initialization config dict.
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Returns:
A dictionary of loss components. the dict
has components below:
loss_cls (list[Tensor]): A list containing each feature map classification loss.
loss_bbox (list[Tensor]): A list containing each feature map regression loss.
cls_score (Tensor) – Box scores for eachimage
Has shape (num_total_anchors, num_classes).
bbox_pred (Tensor) – Box energies / deltas for each image
level with shape (num_total_anchors, 4).
anchors (Tensor) – Box reference for each scale level with shape
(num_total_anchors, 4).
labels (Tensor) – Labels of each anchors with shape
(num_total_anchors,).
label_weights (Tensor) – Label weights of each anchor with shape
(num_total_anchors,)
bbox_targets (Tensor) – BBox regression targets of each anchor with
shape (num_total_anchors, 4).
bbox_weights (Tensor) – BBox regression loss weights of each anchor
with shape (num_total_anchors, 4).
avg_factor (int) – Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
Returns:
A tuple of cls loss and bbox loss of one
feature map.
Compute regression and classification targets for anchors.
Parameters:
anchor_list (list[list[Tensor]]) – Multi level anchors of each
image.
valid_flag_list (list[list[Tensor]]) – Multi level valid flags of
each image.
featmap_sizes (list[Tuple[int, int]]) – Feature map size each level.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
return_sampling_results (bool) – Whether to return the sampling
results. Defaults to False.
Returns:
labels_list (list[Tensor]): Labels of each level.
label_weights_list (list[Tensor]): Label weights of each
level.
bbox_targets_list (list[Tensor]): BBox targets of each level.
bbox_weights_list (list[Tensor]): BBox weights of each level.
avg_factor (int): Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the
number of positive priors.
anchor_list (list[list[Tensor]]) – Multi level anchors of each
image.
valid_flag_list (list[list[Tensor]]) – Multi level valid flags of
each image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, )
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Compute regression and classification targets for anchors when using
RegionAssigner.
Parameters:
anchor_list (list[list[Tensor]]) – Multi level anchors of each
image.
valid_flag_list (list[list[Tensor]]) – Multi level valid flags of
each image.
featmap_sizes (list[Tuple[int, int]]) – Feature map size each level.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Returns:
labels_list (list[Tensor]): Labels of each level.
label_weights_list (list[Tensor]): Label weights of each
level.
bbox_targets_list (list[Tensor]): BBox targets of each level.
bbox_weights_list (list[Tensor]): BBox weights of each level.
avg_factor (int): Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the
number of positive priors.
TOOD uses Task-aligned head (T-head) and is optimized by Task Alignment
Learning (TAL).
Parameters:
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
num_dcn (int) – Number of deformable convolution in the head.
Defaults to 0.
anchor_type (str) – If set to anchor_free, the head will use centers
to regress bboxes. If set to anchor_based, the head will
regress bboxes based on anchors. Defaults to anchor_free.
initial_loss_cls (ConfigDict or dict) – Config of initial loss.
Compute regression and classification targets for anchors in
multiple images.
Parameters:
cls_scores (list[list[Tensor]]) – Classification predictions of
images, a 3D-Tensor with shape [num_imgs, num_priors,
num_classes].
bbox_preds (list[list[Tensor]]) – Decoded bboxes predictions of one
image, a 3D-Tensor with shape [num_imgs, num_priors, 4] in
[tl_x, tl_y, br_x, br_y] format.
anchor_list (list[list[Tensor]]) – Multi level anchors of each
image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, 4).
valid_flag_list (list[list[Tensor]]) – Multi level valid flags of
each image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_anchors, )
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors.
Returns:
a tuple containing learning targets.
anchors_list (list[list[Tensor]]): Anchors of each level.
labels_list (list[Tensor]): Labels of each level.
label_weights_list (list[Tensor]): Label weights of each
level.
bbox_targets_list (list[Tensor]): BBox targets of each level.
norm_alignment_metrics_list (list[Tensor]): Normalized
alignment metrics of each level.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
Has shape (N, num_anchors * num_classes, H, W)
bbox_preds (list[Tensor]) – Decoded box for each scale
level with shape (N, num_anchors * 4, H, W) in
[tl_x, tl_y, br_x, br_y] format.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
The VFNet predicts IoU-aware classification scores which mix the
object presence confidence and object localization accuracy as the
detection score. It is built on the FCOS architecture and uses ATSS
for defining positive/negative training examples. The VFNet is trained
with Varifocal Loss and empolys star-shaped deformable convolution to
extract features for a bbox.
Parameters:
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
regress_ranges (Sequence[Tuple[int, int]]) – Regress range of multiple
level points.
center_sampling (bool) – If true, use center sampling. Defaults to False.
center_sample_radius (float) – Radius of center sampling. Defaults to 1.5.
sync_num_pos (bool) – If true, synchronize the number of positive
examples across GPUs. Defaults to True
gradient_mul (float) – The multiplier to gradients from bbox refinement
and recognition. Defaults to 0.1.
bbox_norm_type (str) – The bbox normalization type, ‘reg_denom’ or
‘stride’. Defaults to reg_denom
loss_cls_fl (ConfigDict or dict) – Config of focal loss.
use_vfl (bool) – If true, use varifocal loss for training.
Defaults to True.
loss_cls (ConfigDict or dict) – Config of varifocal loss.
loss_bbox (ConfigDict or dict) – Config of localization loss,
GIoU Loss.
loss_bbox – Config of localization
refinement loss, GIoU Loss.
norm_cfg (ConfigDict or dict) – dictionary to construct and
config norm layer. Defaults to norm_cfg=dict(type=’GN’,
num_groups=32, requires_grad=True).
use_atss (bool) – If true, use ATSS to define positive/negative
examples. Defaults to True.
anchor_generator (ConfigDict or dict) – Config of anchor
generator for ATSS.
:param init_cfg (ConfigDict or dict or list[dict] or: list[ConfigDict]): Initialization config dict.
A wrapper for computing ATSS targets for points in multiple images.
Parameters:
cls_scores (list[Tensor]) – Box iou-aware scores for each scale
level with shape (N, num_points * num_classes, H, W).
mlvl_points (list[Tensor]) – Points of each fpn level, each has
shape (num_points, 2).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Returns:
labels_list (list[Tensor]): Labels of each level.
label_weights (Tensor): Label weights of all levels.
bbox_targets_list (list[Tensor]): Regression targets of each
level, (l, t, r, b).
bbox_weights (Tensor): Bbox weights of all levels.
A wrapper for computing ATSS and FCOS targets for points in multiple
images.
Parameters:
cls_scores (list[Tensor]) – Box iou-aware scores for each scale
level with shape (N, num_points * num_classes, H, W).
mlvl_points (list[Tensor]) – Points of each fpn level, each has
shape (num_points, 2).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Returns:
labels_list (list[Tensor]): Labels of each level.
label_weights (Tensor/None): Label weights of all levels.
bbox_targets_list (list[Tensor]): Regression targets of each
level, (l, t, r, b).
bbox_weights (Tensor/None): Bbox weights of all levels.
cls_scores (list[Tensor]) – Box iou-aware scores for each scale
level, each is a 4D-tensor, the channel number is
num_points * num_classes.
bbox_preds (list[Tensor]) – Box offsets for each
scale level, each is a 4D-tensor, the channel number is
num_points * 4.
bbox_preds_refine (list[Tensor]) – Refined Box offsets for
each scale level, each is a 4D-tensor, the channel
number is num_points * 4.
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], Optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Compute loss of a single image. Similar to
func:SSDHead.loss_by_feat_single
Parameters:
cls_score (Tensor) – Box scores for eachimage
Has shape (num_total_anchors, num_classes).
bbox_pred (Tensor) – Box energies / deltas for each image
level with shape (num_total_anchors, 4).
anchors (Tensor) – Box reference for each scale level with shape
(num_total_anchors, 4).
labels (Tensor) – Labels of each anchors with shape
(num_total_anchors,).
label_weights (Tensor) – Label weights of each anchor with shape
(num_total_anchors,)
bbox_targets (Tensor) – BBox regression targets of each anchor with
shape (num_total_anchors, 4).
bbox_weights (Tensor) – BBox regression loss weights of each anchor
with shape (num_total_anchors, 4).
avg_factor (int) – Average factor that is used to average
the loss. When using sampling method, avg_factor is usually
the sum of positive and negative priors. When using
PseudoSampler, avg_factor is usually equal to the number
of positive priors.
Returns:
A tuple of cls loss and bbox loss of one
feature map.
Calculate the loss based on the features extracted by the bbox head.
When self.use_ohem==True, it functions like SSDHead.loss,
otherwise, it follows AnchorHead.loss.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
has shape (N, num_anchors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W).
coeff_preds (list[Tensor]) – Mask coefficients for each scale
level with shape (N, num_anchors * num_protos, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
in_channels (int) – Number of channels in the input feature map.
proto_channels (tuple[int]) – Output channels of protonet convs.
proto_kernel_sizes (tuple[int]) – Kernel sizes of protonet convs.
include_last_relu (bool) – If keep the last relu of protonet.
num_protos (int) – Number of prototypes.
num_classes (int) – Number of categories excluding the background
category.
loss_mask_weight (float) – Reweight the mask loss by this factor.
max_masks_to_train (int) – Maximum number of masks to train for
each image.
with_seg_branch (bool) – Whether to apply a semantic segmentation
branch and calculate loss during training to increase
performance with no speed penalty. Defaults to True.
loss_segm (ConfigDict or dict, optional) – Config of
semantic segmentation loss.
train_cfg (ConfigDict or dict, optional) – Training config
of head.
test_cfg (ConfigDict or dict, optional) – Testing config of
head.
:param init_cfg (ConfigDict or list[ConfigDict] or dict or: list[dict], optional): Initialization config dict.
Forward feature from the upstream network to get prototypes and
linearly combine the prototypes, using masks coefficients, into
instance masks. Finally, crop the instance masks with given bboxes.
Parameters:
x (Tuple[Tensor]) – Feature from the upstream network, which is
a 4D-tensor.
positive_infos (List[:obj:InstanceData]) – Positive information
that calculate from detect head.
Returns:
Predicted instance segmentation masks and
semantic segmentation map.
Sanitizes the input coordinates so that x1 < x2, x1 != x2, x1 >= 0,
and x2 <= image_size. Also converts from relative to absolute
coordinates and casts the results to long tensors.
Warning: this does things in-place behind the scenes so
copy if necessary.
normalized_cls_score (Tensor): Normalized Cls scores for a single scale level, the channels number is num_base_priors * num_classes.
bbox_reg (Tensor): Box energies / deltas for a single scale level, the channels number is num_base_priors * 4.
Compute regression and classification targets for anchors in
multiple images.
Parameters:
cls_scores_list (list[Tensor]) – Classification scores of
each image. each is a 4D-tensor, the shape is
(h * w, num_anchors * num_classes).
bbox_preds_list (list[Tensor]) – Bbox preds of each image.
each is a 4D-tensor, the shape is (h * w, num_anchors * 4).
anchor_list (list[Tensor]) – Anchors of each image. Each element of
is a tensor of shape (h * w * num_anchors, 4).
valid_flag_list (list[Tensor]) – Valid flags of each image. Each
element of is a tensor of shape (h * w * num_anchors, )
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
unmap_outputs (bool) – Whether to map outputs back to the original
set of anchors.
Returns:
Usually returns a tuple containing learning targets.
batch_labels (Tensor): Label of all images. Each element of is a tensor of shape (batch, h * w * num_anchors)
batch_label_weights (Tensor): Label weights of all images of is a tensor of shape (batch, h * w * num_anchors)
num_total_pos (int): Number of positive samples in all images.
num_total_neg (int): Number of negative samples in all images.
additional_returns: This function enables user-defined returns from
self._get_targets_single. These returns are currently refined
to properties at each feature map (i.e. having HxW dimension).
The results will be concatenated after the end
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (list[Tensor]) – Box scores for each scale level
has shape (N, num_anchors * num_classes, H, W).
bbox_preds (list[Tensor]) – Box energies / deltas for each scale
level with shape (N, num_anchors * 4, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Compute target maps for anchors in multiple images.
Parameters:
anchor_list (list[list[Tensor]]) – Multi level anchors of each
image. The outer list indicates images, and the inner list
corresponds to feature levels of the image. Each element of
the inner list is a tensor of shape (num_total_anchors, 4).
responsible_flag_list (list[list[Tensor]]) – Multi level responsible
flags of each image. Each element is a tensor of shape
(num_total_anchors, )
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
Returns:
Usually returns a tuple containing learning targets.
target_map_list (list[Tensor]): Target map of each level.
neg_map_list (list[Tensor]): Negative map of each level.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
pred_maps (list[Tensor]) – Prediction map for each scale level,
shape (N, num_anchors * num_attrib, H, W)
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
Calculate the loss of a single scale level based on the features
extracted by the detection head.
Parameters:
pred_map (Tensor) – Raw predictions for a single level.
target_map (Tensor) – The Ground-Truth target for a single level.
neg_map (Tensor) – The negative masks for a single level.
Returns:
loss_cls (Tensor): Classification loss.
loss_conf (Tensor): Confidence loss.
loss_xy (Tensor): Regression loss of x, y coordinate.
loss_wh (Tensor): Regression loss of w, h coordinate.
num_classes (int) – Number of categories excluding the background
category.
in_channels (int) – Number of channels in the input feature map.
feat_channels (int) – Number of hidden channels in stacking convs.
Defaults to 256
stacked_convs (int) – Number of stacking convs of the head.
Defaults to (8, 16, 32).
strides (Sequence[int]) – Downsample factor of each feature map.
Defaults to None.
use_depthwise (bool) – Whether to depthwise separable convolution in
blocks. Defaults to False.
dcn_on_last_conv (bool) – If true, use dcn in the last layer of
towers. Defaults to False.
conv_bias (bool or str) – If specified as auto, it will be decided by
the norm_cfg. Bias of conv will be set as True if norm_cfg is
None, otherwise False. Defaults to “auto”.
conv_cfg (ConfigDict or dict, optional) – Config dict for
convolution layer. Defaults to None.
norm_cfg (ConfigDict or dict) – Config dict for normalization
layer. Defaults to dict(type=’BN’, momentum=0.03, eps=0.001).
act_cfg (ConfigDict or dict) – Config dict for activation layer.
Defaults to None.
loss_cls (ConfigDict or dict) – Config of classification loss.
loss_bbox (ConfigDict or dict) – Config of localization loss.
loss_obj (ConfigDict or dict) – Config of objectness loss.
loss_l1 (ConfigDict or dict) – Config of L1 loss.
train_cfg (ConfigDict or dict, optional) – Training config of
anchor head. Defaults to None.
test_cfg (ConfigDict or dict, optional) – Testing config of
anchor head. Defaults to None.
:param init_cfg (ConfigDict or list[ConfigDict] or dict or: list[dict], optional): Initialization config dict.
Calculate the loss based on the features extracted by the detection
head.
Parameters:
cls_scores (Sequence[Tensor]) – Box scores for each scale level,
each is a 4D-tensor, the channel number is
num_priors * num_classes.
bbox_preds (Sequence[Tensor]) – Box energies / deltas for each scale
level, each is a 4D-tensor, the channel number is
num_priors * 4.
objectnesses (Sequence[Tensor]) – Score factor for
all scale level, each is a 4D-tensor, has shape
(batch_size, 1, H, W).
batch_gt_instances (list[InstanceData]) – Batch of
gt_instance. It usually includes bboxes and labels
attributes.
batch_img_metas (list[dict]) – Meta information of each image, e.g.,
image size, scaling factor, etc.
batch_gt_instances_ignore (list[InstanceData], optional) – Batch of gt_instances_ignore. It includes bboxes attribute
data that is ignored during training and testing.
Defaults to None.
The unified entry for a forward process in both training and test.
The method should accept three modes: “tensor”, “predict” and “loss”:
“tensor”: Forward the whole network and return tensor or tuple of
tensor without any post-processing, same as a common nn.Module.
- “predict”: Forward and return the predictions, which are fully
processed to a list of DetDataSample.
- “loss”: Forward and return a dict of losses according to the given
inputs and data samples.
Note that this method doesn’t handle either back propagation or
parameter update, which are supposed to be done in train_step().
Parameters:
inputs (torch.Tensor) – The input tensor with shape
(N, C, …) in general.
data_samples (list[DetDataSample], optional) – A batch of
data samples that contain annotations and predictions.
Defaults to None.
mode (str) – Return what kind of value. Defaults to ‘tensor’.
Returns:
The return type depends on mode.
If mode="tensor", return a tensor or a tuple of tensor.
If mode="predict", return a list of DetDataSample.
Prepare intermediate variables before entering Transformer decoder,
such as query, query_pos.
Parameters:
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
Returns:
The first dict contains the inputs of decoder
and the second dict contains the inputs of the bbox_head function.
decoder_inputs_dict (dict): The keyword args dictionary of
self.forward_decoder(), which includes ‘query’, ‘query_pos’,
‘memory’ and ‘reg_branches’.
head_inputs_dict (dict): The keyword args dictionary of the
bbox_head functions, which is usually empty, or includes
enc_outputs_class and enc_outputs_class when the detector
support ‘two stage’ or ‘query selection’ strategies.
Prepare intermediate variables before entering Transformer decoder,
such as query, memory, and reference_points.
Parameters:
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
memory_mask (Tensor) – ByteTensor, the padding mask of the memory,
has shape (bs, num_feat_points). Will only be used when
as_two_stage is True.
spatial_shapes (Tensor) – Spatial shapes of features in all levels.
With shape (num_levels, 2), last dimension represents (h, w).
Will only be used when as_two_stage is True.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The decoder_inputs_dict and head_inputs_dict.
decoder_inputs_dict (dict): The keyword dictionary args of
self.forward_decoder(), which includes ‘query’, ‘memory’,
reference_points, and dn_mask. The reference points of
decoder input here are 4D boxes, although it has points
in its name.
head_inputs_dict (dict): The keyword dictionary args of the
bbox_head functions, which includes topk_score, topk_coords,
dense_topk_score, dense_topk_coords,
and dn_meta, when self.training is True, else is empty.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
query (Tensor) – The queries of decoder inputs, has shape
(bs, num_queries, dim).
query_pos (Tensor) – The positional queries of decoder inputs,
has shape (bs, num_queries, dim).
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
memory_mask (Tensor) – ByteTensor, the padding mask of the memory,
has shape (bs, num_feat_points).
memory_pos (Tensor) – The positional embeddings of memory, has
shape (bs, num_feat_points, dim).
Returns:
The dictionary of decoder outputs, which includes the
hidden_states of the decoder output.
hidden_states (Tensor): Has shape
(num_decoder_layers, bs, num_queries, dim)
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
feat (Tensor) – Sequential features, has shape (bs, num_feat_points,
dim).
feat_mask (Tensor) – ByteTensor, the padding mask of the features,
has shape (bs, num_feat_points).
feat_pos (Tensor) – The positional embeddings of the features, has
shape (bs, num_feat_points, dim).
Returns:
The dictionary of encoder outputs, which includes the
memory of the encoder output.
Prepare intermediate variables before entering Transformer decoder,
such as query, query_pos.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
Returns:
The first dict contains the inputs of decoder
and the second dict contains the inputs of the bbox_head function.
decoder_inputs_dict (dict): The keyword args dictionary of
self.forward_decoder(), which includes ‘query’, ‘query_pos’,
‘memory’.
head_inputs_dict (dict): The keyword args dictionary of the
bbox_head functions, which is usually empty, or includes
enc_outputs_class and enc_outputs_class when the detector
support ‘two stage’ or ‘query selection’ strategies.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
img_feats (Tuple[Tensor]) – Tuple of features output from the neck,
has shape (bs, c, h, w).
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such as
gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The first dict contains the inputs of encoder
and the second dict contains the inputs of decoder.
encoder_inputs_dict (dict): The keyword args dictionary of
self.forward_encoder(), which includes ‘feat’, ‘feat_mask’,
and ‘feat_pos’.
decoder_inputs_dict (dict): The keyword args dictionary of
self.forward_decoder(), which includes ‘memory_mask’,
and ‘memory_pos’.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
query (Tensor) – The queries of decoder inputs, has shape
(bs, num_queries_total, dim), where num_queries_total is the
sum of num_denoising_queries and num_matching_queries when
self.training is True, else num_matching_queries.
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
memory_mask (Tensor) – ByteTensor, the padding mask of the memory,
has shape (bs, num_feat_points).
reference_points (Tensor) – The initial reference, has shape
(bs, num_queries_total, 4) with the last dimension arranged as
(cx, cy, w, h).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
dn_mask (Tensor, optional) – The attention mask to prevent
information leakage from different denoising groups and
matching parts, will be used as self_attn_mask of the
self.decoder, has shape (num_queries_total,
num_queries_total).
It is None when self.training is False.
Returns:
The dictionary of decoder outputs, which includes the
hidden_states of the decoder output and references including
the initial and intermediate reference_points.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
The difference is that the ground truth in batch_data_samples is
required for the pre_decoder to prepare the query of DINO.
Additionally, DINO inherits the pre_transformer method and the
forward_encoder method of DeformableDETR. More details about the
two methods can be found in mmdet/detector/deformable_detr.py.
Parameters:
img_feats (tuple[Tensor]) – Tuple of feature maps from neck. Each
feature map has shape (bs, dim, H, W).
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The dictionary of bbox_head function inputs, which always
includes the hidden_states of the decoder output and may contain
references including the initial and intermediate references.
Prepare intermediate variables before entering Transformer decoder,
such as query, query_pos, and reference_points.
Parameters:
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
memory_mask (Tensor) – ByteTensor, the padding mask of the memory,
has shape (bs, num_feat_points). Will only be used when
as_two_stage is True.
spatial_shapes (Tensor) – Spatial shapes of features in all levels.
With shape (num_levels, 2), last dimension represents (h, w).
Will only be used when as_two_stage is True.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The decoder_inputs_dict and head_inputs_dict.
decoder_inputs_dict (dict): The keyword dictionary args of
self.forward_decoder(), which includes ‘query’, ‘memory’,
reference_points, and dn_mask. The reference points of
decoder input here are 4D boxes, although it has points
in its name.
head_inputs_dict (dict): The keyword dictionary args of the
bbox_head functions, which includes topk_score, topk_coords,
and dn_meta when self.training is True, else is empty.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
query (Tensor) – The queries of decoder inputs, has shape
(bs, num_queries, dim).
query_pos (Tensor) – The positional queries of decoder inputs,
has shape (bs, num_queries, dim).
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
memory_mask (Tensor) – ByteTensor, the padding mask of the memory,
has shape (bs, num_feat_points).
reference_points (Tensor) – The initial reference, has shape
(bs, num_queries, 4) with the last dimension arranged as
(cx, cy, w, h) when as_two_stage is True, otherwise has
shape (bs, num_queries, 2) with the last dimension arranged as
(cx, cy).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
Returns:
The dictionary of decoder outputs, which includes the
hidden_states of the decoder output and references including
the initial and intermediate reference_points.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
feat (Tensor) – Sequential features, has shape (bs, num_feat_points,
dim).
feat_mask (Tensor) – ByteTensor, the padding mask of the features,
has shape (bs, num_feat_points).
feat_pos (Tensor) – The positional embeddings of the features, has
shape (bs, num_feat_points, dim).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
Returns:
The dictionary of encoder outputs, which includes the
memory of the encoder output.
proposals (Tensor) – Not normalized proposals, has shape
(bs, num_queries, 4) with the last dimension arranged as
(cx, cy, w, h).
num_pos_feats (int, optional) – The feature dimension for each
position along x, y, w, and h-axis. Note the final returned
dimension for each position is 4 times of num_pos_feats.
Default to 128.
temperature (int, optional) – The temperature used for scaling the
position embedding. Defaults to 10000.
Returns:
The position embedding of proposal, has shape
(bs, num_queries, num_pos_feats * 4), with the last dimension
arranged as (cx, cy, w, h)
|---> valid_W <---|
---+-----------------+-----+---
A | | | A
| | | | |
| | | | |
valid_H | | | |
| | | | H
| | | | |
V | | | |
---+-----------------+ | |
| | V
+-----------------------+---
|---------> W <---------|
The valid_ratios are defined as:
r_h = valid_H / H, r_w = valid_W / W
They are the factors to re-normalize the relative coordinates of the
image to the relative coordinates of the current level feature map.
Parameters:
mask (Tensor) – Binary mask of a feature map, has shape (bs, H, W).
Returns:
valid ratios [r_w, r_h] of a feature map, has shape (1, 2).
Prepare intermediate variables before entering Transformer decoder,
such as query, query_pos, and reference_points.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
memory_mask (Tensor) – ByteTensor, the padding mask of the memory,
has shape (bs, num_feat_points). It will only be used when
as_two_stage is True.
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
It will only be used when as_two_stage is True.
Returns:
The decoder_inputs_dict and head_inputs_dict.
decoder_inputs_dict (dict): The keyword dictionary args of
self.forward_decoder(), which includes ‘query’, ‘query_pos’,
‘memory’, and reference_points. The reference_points of
decoder input here are 4D boxes when as_two_stage is True,
otherwise 2D points, although it has points in its name.
The reference_points in encoder is always 2D points.
head_inputs_dict (dict): The keyword dictionary args of the
bbox_head functions, which includes enc_outputs_class and
enc_outputs_coord. They are both None when ‘as_two_stage’
is False. The dict is empty when self.training is False.
Process image features before feeding them to the transformer.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
mlvl_feats (tuple[Tensor]) – Multi-level features that may have
different resolutions, output from neck. Each feature has
shape (bs, dim, h_lvl, w_lvl), where ‘lvl’ means ‘layer’.
batch_data_samples (list[DetDataSample], optional) – The
batch data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The first dict contains the inputs of encoder and the
second dict contains the inputs of decoder.
encoder_inputs_dict (dict): The keyword args dictionary of
self.forward_encoder(), which includes ‘feat’, ‘feat_mask’,
and ‘feat_pos’.
decoder_inputs_dict (dict): The keyword args dictionary of
self.forward_decoder(), which includes ‘memory_mask’.
In Detection Transformer, an encoder is used to process output features of
neck, then several queries interact with the encoder features using a
decoder and do the regression and classification with the bounding box
head.
Parameters:
backbone (ConfigDict or dict) – Config of the backbone.
neck (ConfigDict or dict, optional) – Config of the neck.
Defaults to None.
encoder (ConfigDict or dict, optional) – Config of the
Transformer encoder. Defaults to None.
decoder (ConfigDict or dict, optional) – Config of the
Transformer decoder. Defaults to None.
bbox_head (ConfigDict or dict, optional) – Config for the
bounding box head module. Defaults to None.
positional_encoding (ConfigDict or dict, optional) – Config
of the positional encoding module. Defaults to None.
num_queries (int, optional) – Number of decoder query in Transformer.
Defaults to 100.
train_cfg (ConfigDict or dict, optional) – Training config of
the bounding box head module. Defaults to None.
test_cfg (ConfigDict or dict, optional) – Testing config of
the bounding box head module. Defaults to None.
data_preprocessor (dict or ConfigDict, optional) – The pre-process
config of BaseDataPreprocessor. it usually includes,
pad_size_divisor, pad_value, mean and std.
Defaults to None.
init_cfg (ConfigDict or dict, optional) – the config to control
the initialization. Defaults to None.
query (Tensor) – The queries of decoder inputs, has shape
(bs, num_queries, dim).
query_pos (Tensor) – The positional queries of decoder inputs,
has shape (bs, num_queries, dim).
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
Returns:
The dictionary of decoder outputs, which includes the
hidden_states of the decoder output, references including
the initial and intermediate reference_points, and other
algorithm-specific arguments.
Forward process of Transformer, which includes four steps:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’. We
summarized the parameters flow of the existing DETR-like detector,
which can be illustrated as follow:
img_feats & batch_data_samples
|
V
+-----------------+
| pre_transformer |
+-----------------+
| |
| V
| +-----------------+
| | forward_encoder |
| +-----------------+
| |
| V
| +---------------+
| | pre_decoder |
| +---------------+
| | |
V V |
+-----------------+ |
| forward_decoder | |
+-----------------+ |
| |
V V
head_inputs_dict
Parameters:
img_feats (tuple[Tensor]) – Tuple of feature maps from neck. Each
feature map has shape (bs, dim, H, W).
batch_data_samples (list[DetDataSample], optional) – The
batch data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The dictionary of bbox_head function inputs, which always
includes the hidden_states of the decoder output and may contain
references including the initial and intermediate references.
Calculate losses from a batch of inputs and data samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (bs, dim, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Prepare intermediate variables before entering Transformer decoder,
such as query, query_pos, and reference_points.
Parameters:
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
Returns:
The first dict contains the inputs of decoder
and the second dict contains the inputs of the bbox_head function.
decoder_inputs_dict (dict): The keyword dictionary args of
self.forward_decoder(), which includes ‘query’, ‘query_pos’,
‘memory’, and other algorithm-specific arguments.
head_inputs_dict (dict): The keyword dictionary args of the
bbox_head functions, which is usually empty, or includes
enc_outputs_class and enc_outputs_class when the detector
support ‘two stage’ or ‘query selection’ strategies.
Process image features before feeding them to the transformer.
Parameters:
img_feats (tuple[Tensor]) – Tuple of feature maps from neck. Each
feature map has shape (bs, dim, H, W).
batch_data_samples (list[DetDataSample], optional) – The
batch data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The first dict contains the inputs of encoder
and the second dict contains the inputs of decoder.
encoder_inputs_dict (dict): The keyword args dictionary of
self.forward_encoder(), which includes ‘feat’, ‘feat_mask’,
‘feat_pos’, and other algorithm-specific arguments.
decoder_inputs_dict (dict): The keyword args dictionary of
self.forward_decoder(), which includes ‘memory_mask’, and
other algorithm-specific arguments.
Predict results from a batch of inputs and data samples with post-
processing.
Parameters:
batch_inputs (Tensor) – Inputs, has shape (bs, dim, H, W).
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
rescale (bool) – Whether to rescale the results.
Defaults to True.
Returns:
Detection results of the input images.
Each DetDataSample usually contain ‘pred_instances’. And the
pred_instances usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
Wrapper of a Detectron2 model. Input/output formats of this class follow
MMDetection’s convention, so a Detectron2 model can be trained and
evaluated in MMDetection.
Parameters:
detector (ConfigDict or dict) – The module config of
Detectron2.
bgr_to_rgb (bool) – whether to convert image from BGR to RGB.
Defaults to False.
rgb_to_bgr (bool) – whether to convert image from RGB to BGR.
Defaults to False.
Calculate losses from a batch of inputs and data samples.
The inputs will first convert to the Detectron2 type and feed into
D2 models.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Predict results from a batch of inputs and data samples with post-
processing.
The inputs will first convert to the Detectron2 type and feed into
D2 models. And the results will convert back to the MMDet type.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Returns:
Detection results of the
input images. Each DetDataSample usually contain
‘pred_instances’. And the pred_instances usually
contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
Implementation of FoveaBox
:param backbone: The backbone config.
:type backbone: ConfigDict or dict
:param neck: The neck config.
:type neck: ConfigDict or dict
:param bbox_head: The bbox head config.
:type bbox_head: ConfigDict or dict
:param train_cfg: The training config
of FOVEA. Defaults to None.
Parameters:
test_cfg (ConfigDict or dict, optional) – The testing config
of FOVEA. Defaults to None.
data_preprocessor (ConfigDict or dict, optional) – Config of
DetDataPreprocessor to process the input data.
Defaults to None.
:param init_cfg (ConfigDict or list[ConfigDict] or dict or: list[dict], optional): Initialization config dict.
Implementation of GLIP
:param backbone: The backbone config.
:type backbone: ConfigDict or dict
:param neck: The neck config.
:type neck: ConfigDict or dict
:param bbox_head: The bbox head config.
:type bbox_head: ConfigDict or dict
:param language_model: The language model config.
:type language_model: ConfigDict or dict
:param train_cfg: The training config
of GLIP. Defaults to None.
Parameters:
test_cfg (ConfigDict or dict, optional) – The testing config
of GLIP. Defaults to None.
data_preprocessor (ConfigDict or dict, optional) – Config of
DetDataPreprocessor to process the input data.
Defaults to None.
:param init_cfg (ConfigDict or list[ConfigDict] or dict or: list[dict], optional): Initialization config dict.
Calculate losses from a batch of inputs and data samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
Parameters:
feat (Tensor) – Sequential features, has shape (bs, num_feat_points,
dim).
feat_mask (Tensor) – ByteTensor, the padding mask of the features,
has shape (bs, num_feat_points).
feat_pos (Tensor) – The positional embeddings of the features, has
shape (bs, num_feat_points, dim).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
Returns:
The dictionary of encoder outputs, which includes the
memory of the encoder output.
The forward procedure of the transformer is defined as:
‘pre_transformer’ -> ‘encoder’ -> ‘pre_decoder’ -> ‘decoder’
More details can be found at TransformerDetector.forward_transformer
in mmdet/detector/base_detr.py.
The difference is that the ground truth in batch_data_samples is
required for the pre_decoder to prepare the query of DINO.
Additionally, DINO inherits the pre_transformer method and the
forward_encoder method of DeformableDETR. More details about the
two methods can be found in mmdet/detector/deformable_detr.py.
Parameters:
img_feats (tuple[Tensor]) – Tuple of feature maps from neck. Each
feature map has shape (bs, dim, H, W).
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The dictionary of bbox_head function inputs, which always
includes the hidden_states of the decoder output and may contain
references including the initial and intermediate references.
Get the tokens positive and prompts for the caption.
Parameters:
original_caption (str) – The original caption, e.g. ‘bench . car .’
custom_entities (bool, optional) – Whether to use custom entities.
If True, the original_caption should be a list of
strings, each of which is a word. Defaults to False.
Returns:
The dict is a mapping from each entity
id, which is numbered from 1, to its positive token id.
The str represents the prompts.
Calculate losses from a batch of inputs and data samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (bs, dim, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Prepare intermediate variables before entering Transformer decoder,
such as query, query_pos, and reference_points.
Parameters:
memory (Tensor) – The output embeddings of the Transformer encoder,
has shape (bs, num_feat_points, dim).
memory_mask (Tensor) – ByteTensor, the padding mask of the memory,
has shape (bs, num_feat_points). Will only be used when
as_two_stage is True.
spatial_shapes (Tensor) – Spatial shapes of features in all levels.
With shape (num_levels, 2), last dimension represents (h, w).
Will only be used when as_two_stage is True.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Defaults to None.
Returns:
The decoder_inputs_dict and head_inputs_dict.
decoder_inputs_dict (dict): The keyword dictionary args of
self.forward_decoder(), which includes ‘query’, ‘memory’,
reference_points, and dn_mask. The reference points of
decoder input here are 4D boxes, although it has points
in its name.
head_inputs_dict (dict): The keyword dictionary args of the
bbox_head functions, which includes topk_score, topk_coords,
and dn_meta when self.training is True, else is empty.
Predict results from a batch of inputs and data samples with post-
processing.
Parameters:
batch_inputs (Tensor) – Inputs, has shape (bs, dim, H, W).
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
rescale (bool) – Whether to rescale the results.
Defaults to True.
Returns:
Detection results of the input images.
Each DetDataSample usually contain ‘pred_instances’. And the
pred_instances usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Detection results of the
input images. Each DetDataSample usually contain
‘pred_instances’ and pred_panoptic_seg. And the
pred_instances usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
masks (Tensor): Has a shape (num_instances, H, W).
And the pred_panoptic_seg contains the following key
sem_seg (Tensor): panoptic segmentation mask, has a
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Predict results from a batch of inputs and data samples with post-
processing.
Parameters:
batch_inputs (Tensor) – Inputs with shape (N, C, H, W).
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool) – Whether to rescale the results.
Defaults to True.
Returns:
Detection results of the
input images. Each DetDataSample usually contain
‘pred_instances’ and pred_panoptic_seg. And the
pred_instances usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
masks (Tensor): Has a shape (num_instances, H, W).
And the pred_panoptic_seg contains the following key
sem_seg (Tensor): panoptic segmentation mask, has a
Calculate losses from a batch of inputs and data samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Semi-supervised detectors typically consisting of a teacher model
updated by exponential moving average and a student model updated
by gradient descent.
Parameters:
detector (ConfigDict or dict) – The detector config.
semi_train_cfg (ConfigDict or dict, optional) – The semi-supervised training config.
semi_test_cfg (ConfigDict or dict, optional) – The semi-supervised testing config.
data_preprocessor (ConfigDict or dict, optional) – Config of
DetDataPreprocessor to process the input data.
Defaults to None.
:param init_cfg (ConfigDict or list[ConfigDict] or dict or: list[dict], optional): Initialization config dict.
Calculate losses from multi-branch inputs and data samples.
Parameters:
multi_batch_inputs (Dict[str, Tensor]) – The dict of multi-branch
input images, each value with shape (N, C, H, W).
Each value should usually be mean centered and std scaled.
multi_batch_data_samples (Dict[str, List[DetDataSample]]) – The dict of multi-branch data samples.
Calculate losses from a batch of inputs and ground-truth data
samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Calculate losses from a batch of inputs and pseudo data samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg,
which are pseudo_instance or pseudo_panoptic_seg
or pseudo_sem_seg in fact.
batch_info (dict) – Batch information of teacher model
forward propagation process. Defaults to None.
Predict results from a batch of inputs and data samples with post-
processing.
Parameters:
batch_inputs (Tensor) – Inputs with shape (N, C, H, W).
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool) – Whether to rescale the results.
Defaults to True.
Returns:
Return the detection results of the
input images. The returns value is DetDataSample,
which usually contain ‘pred_instances’. And the
pred_instances usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
masks (Tensor): Has a shape (num_instances, H, W).
Calculate losses from a batch of inputs and data samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
x (tuple[Tensor]) – List of multi-level img features.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg,
which are pseudo_instance or pseudo_panoptic_seg
or pseudo_sem_seg in fact.
Calculate losses from a batch of inputs and pseudo data samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg,
which are pseudo_instance or pseudo_panoptic_seg
or pseudo_sem_seg in fact.
batch_info (dict) – Batch information of teacher model
forward propagation process. Defaults to None.
Calculate classification loss from a batch of inputs and pseudo data
samples.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
unsup_rpn_results_list (list[InstanceData]) – List of region proposals.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg,
which are pseudo_instance or pseudo_panoptic_seg
or pseudo_sem_seg in fact.
batch_info (dict) – Batch information of teacher model
forward propagation process.
Calculate rcnn regression loss from a batch of inputs and pseudo
data samples.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
unsup_rpn_results_list (list[InstanceData]) – List of region proposals.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg,
which are pseudo_instance or pseudo_panoptic_seg
or pseudo_sem_seg in fact.
Calculate rpn loss from a batch of inputs and pseudo data samples.
Parameters:
x (tuple[Tensor]) – Features from FPN.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg,
which are pseudo_instance or pseudo_panoptic_seg
or pseudo_sem_seg in fact.
Calculate losses from a batch of inputs and data samples.
Parameters:
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (List[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Predict results from a batch of inputs and data samples with post-
processing.
Parameters:
batch_inputs (Tensor) – Inputs with shape (N, C, H, W).
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool) – Whether to rescale the results.
Defaults to True.
Returns:
Return the detection results of the
input images. The returns value is DetDataSample,
which usually contain ‘pred_instances’. And the
pred_instances usually contains following keys.
scores (Tensor): Classification scores, has a shape
(num_instance, )
labels (Tensor): Labels of bboxes, has a shape
(num_instances, ).
bboxes (Tensor): Has a shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
masks (Tensor): Has a shape (num_instances, H, W).
data_samples (list[DetDataSample]) – The
annotation data of every samples.
results_list (List[PixelData]) – Panoptic segmentation results of
each image.
Returns:
Return the packed panoptic segmentation
results of input images. Each DetDataSample usually contains
‘pred_panoptic_seg’. And the ‘pred_panoptic_seg’ has a key
sem_seg, which is a tensor of shape (1, h, w).
batch_inputs (Tensor) – Input images of shape (N, C, H, W).
These should usually be mean centered and std scaled.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Predict results from a batch of inputs and data samples with post-
processing.
Parameters:
batch_inputs (Tensor) – Inputs with shape (N, C, H, W).
batch_data_samples (List[DetDataSample]) – The Data
Samples. It usually includes information such as
gt_instance, gt_panoptic_seg and gt_sem_seg.
rescale (bool) – Whether to rescale the results.
Defaults to True.
Returns:
Return the packed panoptic segmentation
results of input images. Each DetDataSample usually contains
‘pred_panoptic_seg’. And the ‘pred_panoptic_seg’ has a key
sem_seg, which is a tensor of shape (1, h, w).
backbone (ConfigDict or dict) – The backbone module.
neck (ConfigDict or dict) – The neck module.
bbox_head (ConfigDict or dict) – The bbox head module.
train_cfg (ConfigDict or dict, optional) – The training config
of YOLOF. Defaults to None.
test_cfg (ConfigDict or dict, optional) – The testing config
of YOLOF. Defaults to None.
data_preprocessor (ConfigDict or dict, optional) – Model preprocessing config for processing the input data.
it usually includes to_rgb, pad_size_divisor,
pad_value, mean and std. Defaults to None.
init_cfg (ConfigDict or dict, optional) – the config to control
the initialization. Defaults to None.
backbone (ConfigDict or dict) – The backbone module.
neck (ConfigDict or dict) – The neck module.
bbox_head (ConfigDict or dict) – The bbox head module.
train_cfg (ConfigDict or dict, optional) – The training config
of YOLOX. Default: None.
test_cfg (ConfigDict or dict, optional) – The testing config
of YOLOX. Default: None.
data_preprocessor (ConfigDict or dict, optional) – Model preprocessing config for processing the input data.
it usually includes to_rgb, pad_size_divisor,
pad_value, mean and std. Defaults to None.
init_cfg (ConfigDict or dict, optional) – the config to control
the initialization. Defaults to None.
Applies padding to input (if needed) so that input can get fully covered
by filter you specified. It support two modes “same” and “corner”. The
“same” mode is same with “SAME” padding mode in TensorFlow, pad zero around
input. The “corner” mode would pad zero to bottom right.
Parameters:
kernel_size (int | tuple) – Size of the kernel:
stride (int | tuple) – Stride of the filter. Default: 1:
padding (str) – Support “same” and “corner”, “corner” mode
would pad zero to bottom right, and “same” mode would
pad zero around input. Default: “corner”.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
embed_dims (int) – The embedding dimensions of the generated queries.
num_matching_queries (int) – The queries number of the matching part.
Used for generating dn_mask.
label_noise_scale (float) – The scale of label noise, defaults to 0.5.
box_noise_scale (float) – The scale of box noise, defaults to 1.0.
group_cfg (ConfigDict or dict, optional) – The config of the
denoising queries grouping, includes dynamic, num_dn_queries,
and num_groups. Two grouping strategies, ‘static dn groups’ and
‘dynamic dn groups’, are supported. When dynamic is False,
the num_groups should be set, and the number of denoising query
groups will always be num_groups. When dynamic is True, the
num_dn_queries should be set, and the group number will be
dynamic to ensure that the denoising queries number will not exceed
num_dn_queries to prevent large fluctuations of memory. Defaults
to None.
input_label_query (Tensor) – The generated label queries of all
targets, has shape (num_target_total, embed_dims) where
num_target_total = sum(num_target_list).
input_bbox_query (Tensor) – The generated bbox queries of all
targets, has shape (num_target_total, 4) with the last
dimension arranged as (cx, cy, w, h).
batch_idx (Tensor) – The batch index of the corresponding sample
for each target, has shape (num_target_total).
batch_size (int) – The size of the input batch.
num_groups (int) – The number of denoising query groups.
Returns:
Output batched label and bbox queries.
- batched_label_query (Tensor): The output batched label queries,
has shape (batch_size, max_num_target, embed_dims).
batched_bbox_query (Tensor): The output batched bbox queries,
has shape (batch_size, max_num_target, 4) with the last dimension
arranged as (cx, cy, w, h).
The strategy for generating noisy bboxes is as follow:
+--------------------+
| negative |
| +----------+ |
| | positive | |
| | +-----|----+------------+
| | | | | |
| +----+-----+ | |
| | | |
+---------+----------+ |
| |
| gt bbox |
| |
| +---------+----------+
| | | |
| | +----+-----+ |
| | | | | |
+-------------|--- +----+ | |
| | positive | |
| +----------+ |
| negative |
+--------------------+
The random noise is added to the top-left and down-right point
positions, hence, normalized (x, y, x, y) format of bboxes are
required. The noisy bboxes of positive queries have the points
both within the inner square, while those of negative queries
have the points both between the inner and outer squares.
Besides, the length of outer square is twice as long as that of
the inner square, i.e., self.box_noise_scale * w_or_h / 2.
NOTE The noise is added to all the bboxes. Moreover, there is still
unconsidered case when one point is within the positive square and
the others is between the inner and outer squares.
Parameters:
gt_bboxes (Tensor) – The concatenated gt bboxes of all samples
in the batch, has shape (num_target_total, 4) with the last
dimension arranged as (cx, cy, w, h) where
num_target_total = sum(num_target_list).
num_groups (int) – The number of denoising query groups.
Returns:
The output noisy bboxes, which are embedded by normalized
(cx, cy, w, h) format bboxes going through inverse_sigmoid, has
shape (num_noisy_targets, 4) with the last dimension arranged as
(cx, cy, w, h), where
num_noisy_targets = num_target_total * num_groups * 2.
The strategy for generating noisy labels is: Randomly choose labels of
self.label_noise_scale * 0.5 proportion and override each of them
with a random object category label.
NOTE Not add noise to all labels. Besides, the self.label_noise_scale
* 0.5 arg is the ratio of the chosen positions, which is higher than
the actual proportion of noisy labels, because the labels to override
may be correct. And the gap becomes larger as the number of target
categories decreases. The users should notice this and modify the scale
arg or the corresponding logic according to specific dataset.
Parameters:
gt_labels (Tensor) – The concatenated gt labels of all samples
in the batch, has shape (num_target_total, ) where
num_target_total = sum(num_target_list).
num_groups (int) – The number of denoising query groups.
Returns:
The query embeddings of noisy labels, has shape
(num_noisy_targets, embed_dims), where num_noisy_targets =
num_target_total * num_groups * 2.
max_num_target (int) – The max target number of the input batch
samples.
num_groups (int) – The number of denoising query groups.
(obj (device) – device or str): The device of generated mask.
Returns:
The attention mask to prevent information leakage from
different denoising groups and matching parts, will be used as
self_attn_mask of the decoder, has shape (num_queries_total,
num_queries_total), where num_queries_total is the sum of
num_denoising_queries and num_matching_queries.
Two grouping strategies, ‘static dn groups’ and ‘dynamic dn groups’,
are supported. When self.dynamic_dn_groups is False, the number
of denoising query groups will always be self.num_groups. When
self.dynamic_dn_groups is True, the group number will be dynamic,
ensuring the denoising queries number will not exceed
self.num_dn_queries to prevent large fluctuations of memory.
NOTE The num_group is shared for different samples in a batch. When
the target numbers in the samples varies, the denoising queries of the
samples containing fewer targets are padded to the max length.
Parameters:
max_num_target (int, optional) – The max target number of the batch
samples. It will only be used when self.dynamic_dn_groups is
True. Defaults to None.
query (Tensor) – The input query with shape [bs, num_queries,
embed_dims].
key (Tensor) – The key tensor with shape [bs, num_keys,
embed_dims].
If None, the query will be used. Defaults to None.
query_pos (Tensor) – The positional encoding for query in self
attention, with the same shape as x. If not None, it will
be added to x before forward function.
Defaults to None.
query_sine_embed (Tensor) – The positional encoding for query in
cross attention, with the same shape as x. If not None, it
will be added to x before forward function.
Defaults to None.
key_pos (Tensor) – The positional encoding for key, with the
same shape as key. Defaults to None. If not None, it will
be added to key before forward function. If None, and
query_pos has the same shape as key, then query_pos
will be used for key_pos. Defaults to None.
attn_mask (Tensor) – ByteTensor mask with shape [num_queries,
num_keys]. Same in nn.MultiheadAttention.forward.
Defaults to None.
key_padding_mask (Tensor) – ByteTensor with shape [bs, num_keys].
Defaults to None.
is_first (bool) – A indicator to tell whether the current layer
is the first layer of the decoder.
Defaults to False.
Returns:
forwarded results with shape
[bs, num_queries, embed_dims].
query (Tensor) – The input query with shape [bs, num_queries,
embed_dims].
key (Tensor) – The key tensor with shape [bs, num_keys,
embed_dims].
If None, the query will be used. Defaults to None.
value (Tensor) – The value tensor with same shape as key.
Same in nn.MultiheadAttention.forward. Defaults to None.
If None, the key will be used.
attn_mask (Tensor) – ByteTensor mask with shape [num_queries,
num_keys]. Same in nn.MultiheadAttention.forward.
Defaults to None.
key_padding_mask (Tensor) – ByteTensor with shape [bs, num_keys].
Defaults to None.
Returns:
Attention outputs of shape \((N, L, E)\),
where \(N\) is the batch size, \(L\) is the target
sequence length , and \(E\) is the embedding dimension
embed_dim. Attention weights per head of shape :math:`
(num_heads, L, S)`. where \(N\) is batch size, \(L\)
is target sequence length, and \(S\) is the source sequence
length.
query (Tensor) – The input query with shape
(bs, num_queries, dim).
key (Tensor) – The input key with shape (bs, num_keys, dim) If
None, the query will be used. Defaults to None.
query_pos (Tensor) – The positional encoding for query, with the
same shape as query. If not None, it will be added to
query before forward function. Defaults to None.
key_pos (Tensor) – The positional encoding for key, with the
same shape as key. If not None, it will be added to
key before forward function. If None, and query_pos
has the same shape as key, then query_pos will be used
as key_pos. Defaults to None.
key_padding_mask (Tensor) – ByteTensor with shape (bs, num_keys).
Defaults to None.
Returns:
forwarded results with shape (num_decoder_layers,
bs, num_queries, dim) if return_intermediate is True, otherwise
with shape (1, bs, num_queries, dim). References with shape
(bs, num_queries, 2).
query (Tensor) – The input query, has shape (bs, num_queries, dim)
key (Tensor, optional) – The input key, has shape (bs, num_keys,
dim). If None, the query will be used. Defaults to None.
query_pos (Tensor, optional) – The positional encoding for query,
has the same shape as query. If not None, it will be
added to query before forward function. Defaults to None.
ref_sine_embed (Tensor) – The positional encoding for query in
cross attention, with the same shape as x. Defaults to None.
key_pos (Tensor, optional) – The positional encoding for key, has
the same shape as key. If not None, it will be added to
key before forward function. If None, and query_pos has
the same shape as key, then query_pos will be used for
key_pos. Defaults to None.
self_attn_masks (Tensor, optional) – ByteTensor mask, has shape
(num_queries, num_keys), Same in nn.MultiheadAttention.
forward. Defaults to None.
cross_attn_masks (Tensor, optional) – ByteTensor mask, has shape
(num_queries, num_keys), Same in nn.MultiheadAttention.
forward. Defaults to None.
key_padding_mask (Tensor, optional) – ByteTensor, has shape
(bs, num_keys). Defaults to None.
is_first (bool) – A indicator to tell whether the current layer
is the first layer of the decoder. Defaults to False.
Returns:
Forwarded results, has shape (bs, num_queries, dim).
There are several ConvModule layers. In the first few layers, upsampling
will be applied after each layer of convolution. The number of upsampling
must be no more than the number of ConvModule layers.
Parameters:
in_channels (int) – Number of channels in the input feature map.
inner_channels (int) – Number of channels produced by the convolution.
num_layers (int) – Number of convolution layers.
num_upsample (int | optional) – Number of upsampling layer. Must be no
more than num_layers. Upsampling will be applied after the first
num_upsample layers of convolution. Default: num_layers.
conv_cfg (dict) – Config dict for convolution layer. Default: None,
which means using conv2d.
norm_cfg (dict) – Config dict for normalization layer. Default: None.
init_cfg (dict) – Config dict for initialization. Default: None.
kwargs (key word augments) – Other augments used in ConvModule.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
query (Tensor) – The input query, has shape (bs, num_queries,
dims).
value (Tensor) – The input values, has shape (bs, num_value, dim).
key_padding_mask (Tensor) – The key_padding_mask of cross_attn
input. ByteTensor, has shape (bs, num_value).
self_attn_mask (Tensor) – The attention mask to prevent information
leakage from different denoising groups, distinct queries and
dense queries, has shape (num_queries_total,
num_queries_total). It will be updated for distinct queries
selection in this forward function. It is None when
self.training is False.
reference_points (Tensor) – The initial reference, has shape
(bs, num_queries, 4) with the last dimension arranged as
(cx, cy, w, h).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
reg_branches – (obj:nn.ModuleList): Used for refining the
regression results.
Returns:
Output queries and references of Transformer
decoder
query (Tensor): Output embeddings of the last decoder, has
shape (bs, num_queries, embed_dims) when return_intermediate
is False. Otherwise, Intermediate output embeddings of all
decoder layers, has shape (num_decoder_layers, bs, num_queries,
embed_dims).
reference_points (Tensor): The reference of the last decoder
layer, has shape (bs, num_queries, 4) when return_intermediate
is False. Otherwise, Intermediate references of all decoder
layers, has shape (1 + num_decoder_layers, bs, num_queries, 4).
The coordinates are arranged as (cx, cy, w, h).
query (Tensor) – The input queries, has shape (bs, num_queries,
dim).
query_pos (Tensor) – The input positional query, has shape
(bs, num_queries, dim). It will be added to query before
forward function.
value (Tensor) – The input values, has shape (bs, num_value, dim).
key_padding_mask (Tensor) – The key_padding_mask of cross_attn
input. ByteTensor, has shape (bs, num_value).
reference_points (Tensor) – The initial reference, has shape
(bs, num_queries, 4) with the last dimension arranged as
(cx, cy, w, h) when as_two_stage is True, otherwise has
shape (bs, num_queries, 2) with the last dimension arranged
as (cx, cy).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
reg_branches – (obj:nn.ModuleList, optional): Used for refining
the regression results. Only would be passed when
with_box_refine is True, otherwise would be None.
Returns:
Outputs of Deformable Transformer Decoder.
output (Tensor): Output embeddings of the last decoder, has
shape (num_queries, bs, embed_dims) when return_intermediate
is False. Otherwise, Intermediate output embeddings of all
decoder layers, has shape (num_decoder_layers, num_queries, bs,
embed_dims).
reference_points (Tensor): The reference of the last decoder
layer, has shape (bs, num_queries, 4) when return_intermediate
is False. Otherwise, Intermediate references of all decoder
layers, has shape (num_decoder_layers, bs, num_queries, 4). The
coordinates are arranged as (cx, cy, w, h)
query (Tensor) – The input query, has shape (bs, num_queries, dim).
query_pos (Tensor) – The positional encoding for query, has shape
(bs, num_queries, dim).
key_padding_mask (Tensor) – The key_padding_mask of self_attn
input. ByteTensor, has shape (bs, num_queries).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
Returns:
Output queries of Transformer encoder, which is also
called ‘encoder output embeddings’ or ‘memory’, has shape
(bs, num_queries, dim)
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
(obj (device) – device or str): The device acquired by the
reference_points.
Returns:
Reference points used in decoder, has shape (bs, length,
num_levels, 2).
Forward function of decoder
:param query: The input query, has shape (bs, num_queries, dim).
:type query: Tensor
:param key: The input key, has shape (bs, num_keys, dim).
:type key: Tensor
:param value: The input value with the same shape as key.
:type value: Tensor
:param query_pos: The positional encoding for query, with the
same shape as query.
Parameters:
key_pos (Tensor) – The positional encoding for key, with the
same shape as key.
key_padding_mask (Tensor) – The key_padding_mask of cross_attn
input. ByteTensor, has shape (bs, num_value).
Returns:
The forwarded results will have shape
(num_decoder_layers, bs, num_queries, dim) if
return_intermediate is True else (1, bs, num_queries, dim).
query (Tensor) – The input query, has shape (bs, num_queries, dim).
key (Tensor, optional) – The input key, has shape (bs, num_keys,
dim). If None, the query will be used. Defaults to None.
value (Tensor, optional) – The input value, has the same shape as
key, as in nn.MultiheadAttention.forward. If None, the
key will be used. Defaults to None.
query_pos (Tensor, optional) – The positional encoding for query,
has the same shape as query. If not None, it will be added
to query before forward function. Defaults to None.
key_pos (Tensor, optional) – The positional encoding for key, has
the same shape as key. If not None, it will be added to
key before forward function. If None, and query_pos has the
same shape as key, then query_pos will be used for
key_pos. Defaults to None.
self_attn_mask (Tensor, optional) – ByteTensor mask, has shape
(num_queries, num_keys), as in nn.MultiheadAttention.forward.
Defaults to None.
cross_attn_mask (Tensor, optional) – ByteTensor mask, has shape
(num_queries, num_keys), as in nn.MultiheadAttention.forward.
Defaults to None.
key_padding_mask (Tensor, optional) – The key_padding_mask of
self_attn input. ByteTensor, has shape (bs, num_value).
Defaults to None.
Returns:
forwarded results, has shape (bs, num_queries, dim).
query (Tensor) – The input query, has shape (num_queries, bs, dim).
value (Tensor) – The input values, has shape (num_value, bs, dim).
key_padding_mask (Tensor) – The key_padding_mask of self_attn
input. ByteTensor, has shape (num_queries, bs).
self_attn_mask (Tensor) – The attention mask to prevent information
leakage from different denoising groups and matching parts, has
shape (num_queries_total, num_queries_total). It is None when
self.training is False.
reference_points (Tensor) – The initial reference, has shape
(bs, num_queries, 4) with the last dimension arranged as
(cx, cy, w, h).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
reg_branches – (obj:nn.ModuleList): Used for refining the
regression results.
Returns:
Output queries and references of Transformer
decoder
query (Tensor): Output embeddings of the last decoder, has
shape (num_queries, bs, embed_dims) when return_intermediate
is False. Otherwise, Intermediate output embeddings of all
decoder layers, has shape (num_decoder_layers, num_queries, bs,
embed_dims).
reference_points (Tensor): The reference of the last decoder
layer, has shape (bs, num_queries, 4) when return_intermediate
is False. Otherwise, Intermediate references of all decoder
layers, has shape (num_decoder_layers, bs, num_queries, 4). The
coordinates are arranged as (cx, cy, w, h)
To print customized extra information, you should re-implement
this method in your own modules. Both single-line and multi-line
strings are acceptable.
channels (int) – The input (and output) channels of DyReLU module.
ratio (int) – Squeeze ratio in Squeeze-and-Excitation-like module,
the intermediate channel will be int(channels/ratio).
Defaults to 4.
conv_cfg (None or dict) – Config dict for convolution layer.
Defaults to None, which means using conv2d.
act_cfg (dict or Sequence[dict]) – Config dict for activation layer.
If act_cfg is a dict, two activation layers will be configured
by this dict. If act_cfg is a sequence of dicts, the first
activation layer will be configured by the first dict and the
second activation layer will be configured by the second dict.
Defaults to (dict(type=’ReLU’), dict(type=’HSigmoid’, bias=3.0,
divisor=6.0))
init_cfg (dict or list[dict], optional) – Initialization config dict.
Defaults to None
gamma (int) – Use a larger momentum early in training and gradually
annealing to a smaller value to update the ema model smoothly. The
momentum is calculated as
(1 - momentum) * exp(-(1 + steps) / gamma) + momentum.
Defaults to 2000.
interval (int) – Interval between two updates. Defaults to 1.
device (torch.device, optional) – If provided, the averaged model will
be stored on the device. Defaults to None.
update_buffers (bool) – if True, it will compute running averages for
both the parameters and the buffers of the model. Defaults to
False.
BatchNorm2d where the batch statistics and the affine parameters are
fixed.
It contains non-trainable buffers called
“weight” and “bias”, “running_mean”, “running_var”,
initialized to perform identity transformation.
:param num_features: \(C\) from an expected input of size
\((N, C, H, W)\).
Parameters:
eps (float) – a value added to the denominator for numerical stability.
Default: 1e-5
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Position embedding with learnable embedding weights.
Parameters:
num_feats (int) – The feature dimension for each position
along x-axis or y-axis. The final returned dimension for
each position is 2 times of this value.
row_num_embed (int, optional) – The dictionary size of row embeddings.
Defaults to 50.
col_num_embed (int, optional) – The dictionary size of col embeddings.
Defaults to 50.
init_cfg (dict or list[dict], optional) – Initialization config dict.
mask (Tensor) – ByteTensor mask. Non-zero values representing
ignored positions, while zero values means valid positions
for this image. Shape [bs, h, w].
query (Tensor) – The input query, has shape (bs, num_queries, dim).
key (Tensor, optional) – The input key, has shape (bs, num_keys,
dim). If None, the query will be used. Defaults to None.
value (Tensor, optional) – The input value, has the same shape as
key, as in nn.MultiheadAttention.forward. If None, the
key will be used. Defaults to None.
query_pos (Tensor, optional) – The positional encoding for query,
has the same shape as query. If not None, it will be added
to query before forward function. Defaults to None.
key_pos (Tensor, optional) – The positional encoding for key, has
the same shape as key. If not None, it will be added to
key before forward function. If None, and query_pos has the
same shape as key, then query_pos will be used for
key_pos. Defaults to None.
self_attn_mask (Tensor, optional) – ByteTensor mask, has shape
(num_queries, num_keys), as in nn.MultiheadAttention.forward.
Defaults to None.
cross_attn_mask (Tensor, optional) – ByteTensor mask, has shape
(num_queries, num_keys), as in nn.MultiheadAttention.forward.
Defaults to None.
key_padding_mask (Tensor, optional) – The key_padding_mask of
self_attn input. ByteTensor, has shape (bs, num_value).
Defaults to None.
Returns:
forwarded results, has shape (bs, num_queries, dim).
query (Tensor) – The input query, has shape (bs, num_queries, dim).
query_pos (Tensor) – The positional encoding for query, has shape
(bs, num_queries, dim). If not None, it will be added to the
query before forward function. Defaults to None.
key_padding_mask (Tensor) – The key_padding_mask of self_attn
input. ByteTensor, has shape (bs, num_queries).
spatial_shapes (Tensor) – Spatial shapes of features in all levels,
has shape (num_levels, 2), last dimension represents (h, w).
level_start_index (Tensor) – The start index of each level.
A tensor has shape (num_levels, ) and can be represented
as [0, h_0*w_0, h_0*w_0+h_1*w_1, …].
valid_ratios (Tensor) – The ratios of the valid width and the valid
height relative to the width and the height of features in all
levels, has shape (bs, num_levels, 2).
reference_points (Tensor) – The initial reference, has shape
(bs, num_queries, 2) with the last dimension arranged
as (cx, cy).
Returns:
Output queries of Transformer encoder, which is also
called ‘encoder output embeddings’ or ‘memory’, has shape
(bs, num_queries, dim)
in_channels (int) – The num of input channels. Default: 3
embed_dims (int) – The dimensions of embedding. Default: 768
conv_type (str) – The config dict for embedding
conv layer type selection. Default: “Conv2d.
kernel_size (int) – The kernel_size of embedding conv. Default: 16.
stride (int) – The slide stride of embedding conv.
Default: None (Would be set as kernel_size).
padding (int | tuple | string) – The padding length of
embedding conv. When it is a string, it means the mode
of adaptive padding, support “same” and “corner” now.
Default: “corner”.
dilation (int) – The dilation rate of embedding conv. Default: 1.
input_size (int | tuple | None) – The size of input, which will be
used to calculate the out size. Only work when dynamic_size
is False. Default: None.
init_cfg (mmengine.ConfigDict, optional) – The Config for
initialization. Default: None.
This layer groups feature map by kernel_size, and applies norm and linear
layers to the grouped feature map. Our implementation uses nn.Unfold to
merge patch, which is about 25% faster than original implementation.
Instead, we need to modify pretrained models for compatibility.
Parameters:
in_channels (int) – The num of input channels.
to gets fully covered by filter and stride you specified..
Default: True.
out_channels (int) – The num of output channels.
kernel_size (int | tuple, optional) – the kernel size in the unfold
layer. Defaults to 2.
stride (int | tuple, optional) – the stride of the sliding blocks in the
unfold layer. Default: None. (Would be set as kernel_size)
padding (int | tuple | string) – The padding length of
embedding conv. When it is a string, it means the mode
of adaptive padding, support “same” and “corner” now.
Default: “corner”.
dilation (int | tuple, optional) – dilation parameter in the unfold
layer. Default: 1.
bias (bool, optional) – Whether to add bias in linear layer or not.
Defaults: False.
in_channels (list[int] | tuple[int]) – Number of channels in the
input feature maps.
feat_channels (int) – Number channels for feature.
out_channels (int) – Number channels for output.
norm_cfg (ConfigDict or dict) – Config for normalization.
Defaults to dict(type=’GN’, num_groups=32).
act_cfg (ConfigDict or dict) – Config for activation.
Defaults to dict(type=’ReLU’).
encoder (ConfigDict or dict) – Config for transorformer
encoder.Defaults to None.
positional_encoding (ConfigDict or dict) – Config for
transformer encoder position encoding. Defaults to
dict(type=’SinePositionalEncoding’, num_feats=128,
normalize=True).
init_cfg (ConfigDict or dict or list[ConfigDict or dict], optional) – Initialization config dict. Defaults to None.
channels (int) – The input (and output) channels of the SE layer.
ratio (int) – Squeeze ratio in SELayer, the intermediate channel will be
int(channels/ratio). Defaults to 16.
conv_cfg (None or dict) – Config dict for convolution layer.
Defaults to None, which means using conv2d.
act_cfg (dict or Sequence[dict]) – Config dict for activation layer.
If act_cfg is a dict, two activation layers will be configured
by this dict. If act_cfg is a sequence of dicts, the first
activation layer will be configured by the first dict and the
second activation layer will be configured by the second dict.
Defaults to (dict(type=’ReLU’), dict(type=’Sigmoid’))
init_cfg (dict or list[dict], optional) – Initialization config dict.
Defaults to None
num_feats (int) – The feature dimension for each position
along x-axis or y-axis. Note the final returned dimension
for each position is 2 times of this value.
temperature (int, optional) – The temperature used for scaling
the position embedding. Defaults to 10000.
normalize (bool, optional) – Whether to normalize the position
embedding. Defaults to False.
scale (float, optional) – A scale factor that scales the position
embedding. The scale will be used only when normalize is True.
Defaults to 2*pi.
eps (float, optional) – A value added to the denominator for
numerical stability. Defaults to 1e-6.
offset (float) – offset add to embed when do the normalization.
Defaults to 0.
init_cfg (dict or list[dict], optional) – Initialization config dict.
Defaults to None
mask (Tensor) – ByteTensor mask. Non-zero values representing
ignored positions, while zero values means valid positions
for this image. Shape [bs, h, w].
num_feats (int) – The feature dimension for each position
along x-axis or y-axis. Note the final returned dimension
for each position is 2 times of this value.
temperature (int, optional) – The temperature used for scaling
the position embedding. Defaults to 10000.
normalize (bool, optional) – Whether to normalize the position
embedding. Defaults to False.
scale (float, optional) – A scale factor that scales the position
embedding. The scale will be used only when normalize is True.
Defaults to 2*pi.
eps (float, optional) – A value added to the denominator for
numerical stability. Defaults to 1e-6.
offset (float) – offset add to embed when do the normalization.
Defaults to 0.
init_cfg (dict or list[dict], optional) – Initialization config dict.
Defaults to None.
mask (Tensor) – ByteTensor mask. Non-zero values representing
ignored positions, while zero values means valid positions
for this image. Shape [bs, t, h, w].
coord_tensor (Tensor) – Coordinate tensor to be converted to
positional encoding. With the last dimension as 2 or 4.
num_feats (int, optional) – The feature dimension for each position
along x-axis or y-axis. Note the final returned dimension
for each position is 2 times of this value. Defaults to 128.
temperature (int, optional) – The temperature used for scaling
the position embedding. Defaults to 10000.
scale (float, optional) – A scale factor that scales the position
embedding. The scale will be used only when normalize is True.
Defaults to 2*pi.
Fast NMS allows already-removed detections to suppress other detections so
that every instance can be decided to be kept or discarded in parallel,
which is not possible in traditional NMS. This relaxation allows us to
implement Fast NMS entirely in standard GPU-accelerated matrix operations.
Parameters:
multi_bboxes (Tensor) – shape (n, #class*4) or (n, 4)
multi_scores (Tensor) – shape (n, #class+1), where the last column
contains scores of the background class, but this will be ignored.
labels (Tensor) – Labels of corresponding masks,
has shape (num_instances,).
scores (Tensor) – Mask scores of corresponding masks,
has shape (num_instances).
filter_thr (float) – Score threshold to filter the masks
after matrix nms. Default: -1, which means do not
use filter_thr.
nms_pre (int) – The max number of instances to do the matrix nms.
Default: -1, which means do not use nms_pre.
max_num (int, optional) – If there are more than max_num masks after
matrix, only top max_num will be kept. Default: -1, which means
do not use max_num.
kernel (str) – ‘linear’ or ‘gaussian’.
sigma (float) – std in gaussian method.
mask_area (Tensor) – The sum of seg_masks.
Returns:
Processed mask results.
scores (Tensor): Updated scores, has shape (n,).
labels (Tensor): Remained labels, has shape (n,).
masks (Tensor): Remained masks, has shape (n, w, h).
keep_inds (Tensor): The indices number of
the remaining mask in the input mask, has shape (n,).
alpha (float) – The denominator alpha in the balanced L1 loss.
Defaults to 0.5.
gamma (float) – The gamma in the balanced L1 loss. Defaults to 1.5.
beta (float, optional) – The loss is a piecewise function of prediction
and target. beta serves as a threshold for the difference
between the prediction and target. Defaults to 1.0.
reduction (str, optional) – The method that reduces the loss to a
scalar. Options are “none”, “mean” and “sum”.
loss_weight (float, optional) – The weight of the loss. Defaults to 1.0
pred (torch.Tensor) – The prediction with shape (N, 4).
target (torch.Tensor) – The learning target of the prediction with
shape (N, 4).
weight (torch.Tensor, optional) – Sample-wise loss weight with
shape (N, ).
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Options are “none”, “mean” and “sum”.
pred (Tensor) – Predicted bboxes of format (x1, y1, x2, y2),
shape (n, 4).
target (Tensor) – The learning target of the prediction,
shape (n, 4).
weight (Optional[Tensor], optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (Optional[int], optional) – Average factor that is used
to average the loss. Defaults to None.
reduction_override (Optional[str], optional) – The reduction method
used to override the original reduction method of the loss.
Defaults to None. Options are “none”, “mean” and “sum”.
pred (Tensor) – Predicted bboxes of format (x1, y1, x2, y2),
shape (n, 4).
target (Tensor) – The learning target of the prediction,
shape (n, 4).
weight (Optional[Tensor], optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (Optional[int], optional) – Average factor that is used
to average the loss. Defaults to None.
reduction_override (Optional[str], optional) – The reduction method
used to override the original reduction method of the loss.
Defaults to None. Options are “none”, “mean” and “sum”.
bbox_preds (Tensor) – Predicted unnormalized bbox coordinates,
has shape (bs, num_dense_queries, 4) with the last
dimension arranged as (x1, y1, x2, y2).
gt_bboxes_list (List[Tensor]) – List of unnormalized ground truth
bboxes for each image, each has shape (num_gt, 4) with the
last dimension arranged as (x1, y1, x2, y2).
NOTE: num_gt is dynamic for each image.
img_metas (list[dict]) – Meta information for one image,
e.g., image size, scaling factor, etc.
gt_labels_list (list[Tensor]) – List of ground truth classification
index for each image, each has shape (num_gt,).
NOTE: num_gt is dynamic for each image.
Default: None.
Returns:
a tuple containing the following targets.
all_labels (list[Tensor]): Labels for all images.
all_label_weights (list[Tensor]): Label weights for all images.
all_bbox_targets (list[Tensor]): Bbox targets for all images.
bbox_preds (Tensor) – Predicted unnormalized bbox coordinates,
has shape (bs, num_dense_queries, 4) with the last
dimension arranged as (x1, y1, x2, y2).
gt_bboxes (list[Tensor]) – List of unnormalized ground truth
bboxes for each image, each has shape (num_gt, 4) with the
last dimension arranged as (x1, y1, x2, y2).
NOTE: num_gt is dynamic for each image.
gt_labels (list[Tensor]) – List of ground truth classification
index for each image, each has shape (num_gt,).
NOTE: num_gt is dynamic for each image.
img_metas (list[dict]) – Meta information for one image,
e.g., image size, scaling factor, etc.
Calculate auxiliary branches loss for dense queries for one image.
Parameters:
cls_score (Tensor) – Predicted normalized classification
scores for one image, has shape (num_dense_queries,
cls_out_channels).
bbox_pred (Tensor) – Predicted unnormalized bbox coordinates
for one image, has shape (num_dense_queries, 4) with the
last dimension arranged as (x1, y1, x2, y2).
labels (Tensor) – Labels for one image.
label_weights (Tensor) – Label weights for one image.
bbox_targets (Tensor) – Bbox targets for one image.
alignment_metrics (Tensor) – Normalized alignment metrics for one
image.
pred (Tensor) – Predicted bboxes of format (x1, y1, x2, y2),
shape (n, 4).
target (Tensor) – The learning target of the prediction,
shape (n, 4).
weight (Optional[Tensor], optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (Optional[int], optional) – Average factor that is used
to average the loss. Defaults to None.
reduction_override (Optional[str], optional) – The reduction method
used to override the original reduction method of the loss.
Defaults to None. Options are “none”, “mean” and “sum”.
pred (torch.Tensor) – The prediction, has a shape (n, *).
target (torch.Tensor) – The label of the prediction,
shape (n, *), same shape of pred.
weight (torch.Tensor, optional) – The weight of loss for each
prediction, has a shape (n,). Defaults to None.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Options are “none”, “mean” and “sum”.
pred (torch.Tensor) – Predicted general distribution of bounding
boxes (before softmax) with shape (N, n+1), n is the max value
of the integral set {0, …, n} in paper.
target (torch.Tensor) – Target distance label for bounding boxes
with shape (N,).
weight (torch.Tensor, optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Defaults to None.
pred (Tensor) – Predicted bboxes of format (x1, y1, x2, y2),
shape (n, 4).
target (Tensor) – The learning target of the prediction,
shape (n, 4).
weight (Optional[Tensor], optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (Optional[int], optional) – Average factor that is used
to average the loss. Defaults to None.
reduction_override (Optional[str], optional) – The reduction method
used to override the original reduction method of the loss.
Defaults to None. Options are “none”, “mean” and “sum”.
cls_score (Tensor) – The prediction with shape (N, C), C is the
number of classes.
label (Tensor) – The ground truth label of the predicted target with
shape (N, C), C is the number of classes.
weight (Tensor, optional) – The weight of loss for each prediction.
Defaults to None.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Options are “none”, “mean” and “sum”.
target (torch.Tensor) – The learning label of the prediction.
weight (torch.Tensor, optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Options are “none”, “mean” and “sum”.
target (torch.Tensor) – The learning label of the prediction.
The target shape support (N,C) or (N,), (N,C) means
one-hot form.
weight (torch.Tensor, optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Options are “none”, “mean” and “sum”.
pred (float tensor of size [batch_num, 4 (* class_num)]) – The prediction of box regression layer. Channel number can be 4
or 4 * class_num depending on whether it is class-agnostic.
target (float tensor of size [batch_num, 4 (* class_num)]) – The target regression values with the same size of pred.
label_weight (float tensor of size [batch_num, 4 (* class_num)]) – The weight of each sample, 0 if ignored.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Defaults to None.
pred (Tensor) – Predicted bboxes of format (x1, y1, x2, y2),
shape (n, 4).
target (Tensor) – The learning target of the prediction,
shape (n, 4).
weight (Optional[Tensor], optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (Optional[int], optional) – Average factor that is used
to average the loss. Defaults to None.
reduction_override (Optional[str], optional) – The reduction method
used to override the original reduction method of the loss.
Defaults to None. Options are “none”, “mean” and “sum”.
More details can be found in the paper
Code is modified from kp_utils.py # noqa: E501
Please notice that the target in GaussianFocalLoss is a gaussian heatmap,
not 0/1 binary target.
Parameters:
alpha (float) – Power of prediction.
gamma (float) – Power of target for negative samples.
reduction (str) – Options are “none”, “mean” and “sum”.
loss_weight (float) – Loss weight of current loss.
pos_weight (float) – Positive sample loss weight. Defaults to 1.0.
neg_weight (float) – Negative sample loss weight. Defaults to 1.0.
If you want to manually determine which positions are
positive samples, you can set the pos_index and pos_label
parameter. Currently, only the CenterNet update version uses
the parameter.
Parameters:
pred (torch.Tensor) – The prediction. The shape is (N, num_classes).
target (torch.Tensor) – The learning target of the prediction
in gaussian distribution. The shape is (N, num_classes).
pos_inds (torch.Tensor) – The positive sample index.
Defaults to None.
pos_labels (torch.Tensor) – The label corresponding to the positive
sample index. Defaults to None.
weight (torch.Tensor, optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (int, float, optional) – Average factor that is used to
average the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Defaults to None.
pred (Tensor) – Predicted bboxes of format (x1, y1, x2, y2),
shape (n, 4).
target (Tensor) – The learning target of the prediction,
shape (n, 4).
weight (Tensor, optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Defaults to None. Options are “none”, “mean” and “sum”.
use_sigmoid (bool) – Whether sigmoid operation is conducted in QFL.
Defaults to True.
beta (float) – The beta parameter for calculating the modulating factor.
Defaults to 2.0.
reduction (str) – Options are “none”, “mean” and “sum”.
loss_weight (float) – Loss weight of current loss.
activated (bool, optional) – Whether the input is activated.
If True, it means the input has been activated and can be
treated as probabilities. Else, it should be treated as logits.
Defaults to False.
pred (torch.Tensor) – Predicted joint representation of
classification and quality (IoU) estimation with shape (N, C),
C is the number of classes.
target (Union(tuple([torch.Tensor]),Torch.Tensor)) – The type is
tuple, it should be included Target category label with
shape (N,) and target quality label with shape (N,).The type
is torch.Tensor, the target should be one-hot form with
soft weights.
weight (torch.Tensor, optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Defaults to None.
pred (Tensor) – Predicted bboxes of format (x1, y1, x2, y2),
shape (n, 4).
target (Tensor) – The learning target of the prediction,
shape (n, 4).
weight (Optional[Tensor], optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (Optional[int], optional) – Average factor that is used
to average the loss. Defaults to None.
reduction_override (Optional[str], optional) – The reduction method
used to override the original reduction method of the loss.
Defaults to None. Options are “none”, “mean” and “sum”.
pred (Tensor) – The prediction with shape (N, C), C is the
number of classes.
target (Tensor) – The learning target of the iou-aware
classification score with shape (N, C), C is
the number of classes.
weight (Tensor, optional) – The weight of loss for each
prediction. Defaults to None.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
reduction_override (str, optional) – The reduction method used to
override the original reduction method of the loss.
Options are “none”, “mean” and “sum”.
Calculate accuracy according to the prediction and target.
Parameters:
pred (torch.Tensor) – The model prediction, shape (N, num_class)
target (torch.Tensor) – The target of each prediction, shape (N, )
topk (int | tuple[int], optional) – If the predictions in topk
matches the target, the predictions will be regarded as
correct ones. Defaults to 1.
thresh (float, optional) – If not None, predictions with scores under
this threshold are considered incorrect. Default to None.
Returns:
If the input topk is a single integer,
the function will return a single float as accuracy. If
topk is a tuple containing multiple integers, the
function will return a tuple containing accuracies of
each topk number.
pred (torch.Tensor) – The prediction with shape (N, 4).
target (torch.Tensor) – The learning target of the prediction with
shape (N, 4).
beta (float) – The loss is a piecewise function of prediction and target
and beta serves as a threshold for the difference between the
prediction and target. Defaults to 1.0.
alpha (float) – The denominator alpha in the balanced L1 loss.
Defaults to 0.5.
gamma (float) – The gamma in the balanced L1 loss.
Defaults to 1.5.
reduction (str, optional) – The method that reduces the loss to a
scalar. Options are “none”, “mean” and “sum”.
pred (torch.Tensor) – The prediction with shape (N, 1) or (N, ).
When the shape of pred is (N, 1), label will be expanded to
one-hot format, and when the shape of pred is (N, ), label
will not be expanded to one-hot format.
label (torch.Tensor) – The learning label of the prediction,
with shape (N, ).
weight (torch.Tensor, optional) – Sample-wise loss weight.
reduction (str, optional) – The method used to reduce the loss.
Options are “none”, “mean” and “sum”.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
class_weight (list[float], optional) – The weight for each class.
ignore_index (int | None) – The label index to be ignored.
If None, it will be set to default value. Default: -100.
avg_non_ignore (bool) – The flag decides to whether the loss is
only averaged over non-ignored targets. Default: False.
pred (torch.Tensor) – The prediction with shape (N, C, *), C is the
number of classes. The trailing * indicates arbitrary shape.
target (torch.Tensor) – The learning label of the prediction.
label (torch.Tensor) – label indicates the class label of the mask
corresponding object. This will be used to select the mask in the
of the class which the object belongs to when the mask prediction
if not class-agnostic.
reduction (str, optional) – The method used to reduce the loss.
Options are “none”, “mean” and “sum”.
avg_factor (int, optional) – Average factor that is used to average
the loss. Defaults to None.
class_weight (list[float], optional) – The weight for each class.
ignore_index (None) – Placeholder, to be consistent with other loss.
Default: None.
Create a weighted version of a given loss function.
To use this decorator, the loss function must have the signature like
loss_func(pred, target, **kwargs). The function only needs to compute
element-wise loss without any reduction. This decorator will add weight
and reduction arguments to the function. The decorated function will have
the signature like loss_func(pred, target, weight=None, reduction=’mean’,
avg_factor=None, **kwargs).
BFP takes multi-level features as inputs and gather them into a single one,
then refine the gathered feature and scatter the refined results to
multi-level features. This module is used in Libra R-CNN (CVPR 2019), see
the paper Libra R-CNN: Towards Balanced Learning for Object Detection for details.
Parameters:
in_channels (int) – Number of input channels (feature maps of all levels
should have the same channels).
num_levels (int) – Number of input feature levels.
refine_level (int) – Index of integration and refine level of BSF in
multi-level features from bottom to top.
refine_type (str) – Type of the refine op, currently support
[None, ‘conv’, ‘non_local’].
conv_cfg (ConfigDict or dict, optional) – The config dict for
convolution layers.
norm_cfg (ConfigDict or dict, optional) – The config dict for
normalization layers.
:param init_cfg (ConfigDict or dict or list[ConfigDict or: dict], optional): Initialization config dict.
Channel Mapper to reduce/increase channels of backbone features.
This is used to reduce/increase channels of backbone features.
Parameters:
in_channels (List[int]) – Number of input channels per scale.
out_channels (int) – Number of output channels (used at each scale).
kernel_size (int, optional) – kernel_size for reducing channels (used
at each scale). Default: 3.
conv_cfg (ConfigDict or dict, optional) – Config dict for
convolution layer. Default: None.
norm_cfg (ConfigDict or dict, optional) – Config dict for
normalization layer. Default: None.
act_cfg (ConfigDict or dict, optional) – Config dict for
activation layer in ConvModule. Default: dict(type=’ReLU’).
bias (bool | str) – If specified as auto, it will be decided by the
norm_cfg. Bias will be set as True if norm_cfg is None, otherwise
False. Default: “auto”.
num_outs (int, optional) – Number of output feature maps. There would
be extra_convs when num_outs larger than the length of in_channels.
:param init_cfg (ConfigDict or dict or list[ConfigDict or dict]: optional): Initialization config dict.
:param : optional): Initialization config dict.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Implementation of Feature Pyramid Grids (FPG).
This implementation only gives the basic structure stated in the paper.
But users can implement different type of transitions to fully explore the
the potential power of the structure of FPG.
Parameters:
in_channels (int) – Number of input channels (feature maps of all levels
should have the same channels).
out_channels (int) – Number of output channels (used at each scale)
num_outs (int) – Number of output scales.
stack_times (int) – The number of times the pyramid architecture will
be stacked.
paths (list[str]) – Specify the path order of each stack level.
Each element in the list should be either ‘bu’ (bottom-up) or
‘td’ (top-down).
inter_channels (int) – Number of inter channels.
same_up_trans (dict) – Transition that goes down at the same stage.
same_down_trans (dict) – Transition that goes up at the same stage.
output_trans (dict) – Transition that trans the output of the
last stage.
start_level (int) – Index of the start input backbone level used to
build the feature pyramid. Default: 0.
end_level (int) – Index of the end input backbone level (exclusive) to
build the feature pyramid. Default: -1, which means the last level.
add_extra_convs (bool) – It decides whether to add conv
layers on top of the original feature maps. Default to False.
If True, its actual mode is specified by extra_convs_on_inputs.
norm_cfg (dict) – Config dict for normalization layer. Default: None.
init_cfg (dict or list[dict], optional) – Initialization config dict.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
in_channels (list[int]) – Number of input channels per scale.
out_channels (int) – Number of output channels (used at each scale).
num_outs (int) – Number of output scales.
start_level (int) – Index of the start input backbone level used to
build the feature pyramid. Defaults to 0.
end_level (int) – Index of the end input backbone level (exclusive) to
build the feature pyramid. Defaults to -1, which means the
last level.
add_extra_convs (bool | str) –
If bool, it decides whether to add conv
layers on top of the original feature maps. Defaults to False.
If True, it is equivalent to add_extra_convs=’on_input’.
If str, it specifies the source feature map of the extra convs.
Only the following options are allowed
’on_input’: Last feat map of neck inputs (i.e. backbone feature).
’on_lateral’: Last feature map after lateral convs.
’on_output’: The last output feature map after fpn convs.
relu_before_extra_convs (bool) – Whether to apply relu before the extra
conv. Defaults to False.
no_norm_on_lateral (bool) – Whether to apply norm on lateral.
Defaults to False.
conv_cfg (ConfigDict or dict, optional) – Config dict for
convolution layer. Defaults to None.
norm_cfg (ConfigDict or dict, optional) – Config dict for
normalization layer. Defaults to None.
act_cfg (ConfigDict or dict, optional) – Config dict for
activation layer in ConvModule. Defaults to None.
upsample_cfg (ConfigDict or dict, optional) – Config dict
for interpolate layer. Defaults to dict(mode=’nearest’).
:param init_cfg (ConfigDict or dict or list[ConfigDict or : dict]): Initialization config dict.
FPN_CARAFE is a more flexible implementation of FPN. It allows more
choice for upsample methods during the top-down pathway.
It can reproduce the performance of ICCV 2019 paper
CARAFE: Content-Aware ReAssembly of FEatures
Please refer to https://arxiv.org/abs/1905.02188 for more details.
Parameters:
in_channels (list[int]) – Number of channels for each input feature map.
out_channels (int) – Output channels of feature pyramids.
num_outs (int) – Number of output stages.
start_level (int) – Start level of feature pyramids.
(Default: 0)
end_level (int) – End level of feature pyramids.
(Default: -1 indicates the last level).
norm_cfg (dict) – Dictionary to construct and config norm layer.
activate (str) – Type of activation function in ConvModule
(Default: None indicates w/o activation).
order (dict) – Order of components in ConvModule.
upsample (str) – Type of upsample layer.
upsample_cfg (dict) – Dictionary to construct and config upsample layer.
in_channels (List[int]) – Number of input channels per scale.
out_channels (int) – Number of output channels (used at each scale)
num_outs (int) – Number of output scales.
start_level (int) – Index of the start input backbone level used to
build the feature pyramid. Default: 0.
end_level (int) – Index of the end input backbone level (exclusive) to
build the feature pyramid. Default: -1, which means the last level.
add_extra_convs (bool) – It decides whether to add conv
layers on top of the original feature maps. Default to False.
If True, its actual mode is specified by extra_convs_on_inputs.
conv_cfg (dict) – dictionary to construct and config conv layer.
norm_cfg (dict) – dictionary to construct and config norm layer.
in_channels (List[int]) – Number of input channels per scale.
out_channels (int) – Number of output channels (used at each scale)
num_outs (int) – Number of output scales.
start_level (int) – Index of the start input backbone level used to
build the feature pyramid. Default: 0.
end_level (int) – Index of the end input backbone level (exclusive) to
build the feature pyramid. Default: -1, which means the last level.
add_extra_convs (bool | str) –
If bool, it decides whether to add conv
layers on top of the original feature maps. Default to False.
If True, it is equivalent to add_extra_convs=’on_input’.
If str, it specifies the source feature map of the extra convs.
Only the following options are allowed
’on_input’: Last feat map of neck inputs (i.e. backbone feature).
’on_lateral’: Last feature map after lateral convs.
’on_output’: The last output feature map after fpn convs.
relu_before_extra_convs (bool) – Whether to apply relu before the extra
conv. Default: False.
no_norm_on_lateral (bool) – Whether to apply norm on lateral.
Default: False.
conv_cfg (dict) – Config dict for convolution layer. Default: None.
norm_cfg (dict) – Config dict for normalization layer. Default: None.
act_cfg (str) – Config dict for activation layer in ConvModule.
Default: None.
init_cfg (dict or list[dict], optional) – Initialization config dict.
This is an implementation of RFP in DetectoRS. Different from standard FPN, the
input of RFP should be multi level features along with origin input image
of backbone.
Parameters:
rfp_steps (int) – Number of unrolled steps of RFP.
rfp_backbone (dict) – Configuration of the backbone for RFP.
aspp_out_channels (int) – Number of output channels of ASPP module.
aspp_dilations (tuple[int]) – Dilation rates of four branches.
Default: (1, 3, 6, 1)
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
It can be treated as a simplified version of FPN. It
will take the result from Darknet backbone and do some upsampling and
concatenation. It will finally output the detection result.
Note
The input feats should be from top to bottom.
i.e., from high-lvl to low-lvl
But YOLOV3Neck will process them in reversed order.
i.e., from bottom (high-lvl) to top (low-lvl)
Parameters:
num_scales (int) – The number of scales / stages.
in_channels (List[int]) – The number of input channels per scale.
out_channels (List[int]) – The number of output channels per scale.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
Calculate the ground truth for all samples in a batch according to
the sampling_results.
Almost the same as the implementation in bbox_head, we passed
additional parameters pos_inds_list and neg_inds_list to
_get_targets_single function.
Parameters:
(List[obj (sampling_results) – SamplingResult]): Assign results of
all images in a batch after sampling.
(obj (rcnn_train_cfg) – ConfigDict): train_cfg of RCNN.
concat (bool) – Whether to concatenate the results of all
the images in a single batch.
Returns:
Ground truth for proposals in a single image.
Containing the following list of Tensors:
labels (list[Tensor],Tensor): Gt_labels for all
proposals in a batch, each tensor in list has
shape (num_proposals,) when concat=False, otherwise
just a single tensor has shape (num_all_proposals,).
label_weights (list[Tensor]): Labels_weights for
all proposals in a batch, each tensor in list has
shape (num_proposals,) when concat=False, otherwise
just a single tensor has shape (num_all_proposals,).
for all proposals in a batch, each tensor in list
has shape (num_proposals, 4) when concat=False,
otherwise just a single tensor has shape
(num_all_proposals, 4), the last dimension 4 represents
[tl_x, tl_y, br_x, br_y].
bbox_weights (list[tensor],Tensor): Regression weights for
all proposals in a batch, each tensor in list has shape
(num_proposals, 4) when concat=False, otherwise just a
single tensor has shape (num_all_proposals, 4).
Calculate the loss based on the network predictions and targets.
Parameters:
cls_score (Tensor) – Classification prediction
results of all class, has shape
(batch_size * num_proposals_single_image, num_classes)
bbox_pred (Tensor) – Regression prediction results,
has shape
(batch_size * num_proposals_single_image, 4), the last
dimension 4 represents [tl_x, tl_y, br_x, br_y].
rois (Tensor) – RoIs with the shape
(batch_size * num_proposals_single_image, 5) where the first
column indicates batch id of each RoI.
labels (Tensor) – Gt_labels for all proposals in a batch, has
shape (batch_size * num_proposals_single_image, ).
label_weights (Tensor) – Labels_weights for all proposals in a
batch, has shape (batch_size * num_proposals_single_image, ).
bbox_targets (Tensor) – Regression target for all proposals in a
batch, has shape (batch_size * num_proposals_single_image, 4),
the last dimension 4 represents [tl_x, tl_y, br_x, br_y].
bbox_weights (Tensor) – Regression weights for all proposals in a
batch, has shape (batch_size * num_proposals_single_image, 4).
reduction_override (str, optional) – The reduction
method used to override the original reduction
method of the loss. Options are “none”,
“mean” and “sum”. Defaults to None,
Calculate the loss based on the features extracted by the bbox head.
Parameters:
cls_score (Tensor) – Classification prediction
results of all class, has shape
(batch_size * num_proposals_single_image, num_classes)
bbox_pred (Tensor) – Regression prediction results,
has shape
(batch_size * num_proposals_single_image, 4), the last
dimension 4 represents [tl_x, tl_y, br_x, br_y].
rois (Tensor) – RoIs with the shape
(batch_size * num_proposals_single_image, 5) where the first
column indicates batch id of each RoI.
(List[obj (sampling_results) – SamplingResult]): Assign results of
all images in a batch after sampling.
(obj (rcnn_train_cfg) – ConfigDict): train_cfg of RCNN.
concat (bool) – Whether to concatenate the results of all
the images in a single batch. Defaults to True.
reduction_override (str, optional) – The reduction
method used to override the original reduction
method of the loss. Options are “none”,
“mean” and “sum”. Defaults to None,
SamplingResult is the real sampling results
calculate from bbox_head, while InstanceData is
fake sampling results, e.g., in Sparse R-CNN or QueryInst, etc.
Parameters:
bbox_results (dict) –
Usually is a dictionary with keys:
cls_score (Tensor): Classification scores.
bbox_pred (Tensor): Box energies / deltas.
rois (Tensor): RoIs with the shape (n, 5) where the first
column indicates batch id of each RoI.
bbox_targets (tuple): Ground truth for proposals in a
single image. Containing the following list of Tensors:
(labels, label_weights, bbox_targets, bbox_weights)
batch_img_metas (List[dict]) – List of image information.
Returns:
Refined bboxes of each image.
Return type:
list[InstanceData]
Example
>>> # xdoctest: +REQUIRES(module:kwarray)>>> importnumpyasnp>>> frommmdet.models.task_modules.samplers.sampling_result>>>importrandom_boxes>>> frommmdet.models.task_modules.samplersimportSamplingResult>>> self=BBoxHead(reg_class_agnostic=True)>>> n_roi=2>>> n_img=4>>> scale=512>>> rng=np.random.RandomState(0)... batch_img_metas=[{'img_shape':(scale,scale)}>>> for_inrange(n_img)]>>> sampling_results=[SamplingResult.random(rng=10)... for_inrange(n_img)]>>> # Create rois in the expected format>>> roi_boxes=random_boxes(n_roi,scale=scale,rng=rng)>>> img_ids=torch.randint(0,n_img,(n_roi,))>>> img_ids=img_ids.float()>>> rois=torch.cat([img_ids[:,None],roi_boxes],dim=1)>>> # Create other args>>> labels=torch.randint(0,81,(scale,)).long()>>> bbox_preds=random_boxes(n_roi,scale=scale,rng=rng)>>> cls_score=torch.randn((scale,81))... # For each image, pretend random positive boxes are gts>>> bbox_targets=(labels,None,None,None)... bbox_results=dict(rois=rois,bbox_pred=bbox_preds,... cls_score=cls_score,... bbox_targets=bbox_targets)>>> bboxes_list=self.refine_bboxes(sampling_results,... bbox_results,... batch_img_metas)>>> print(bboxes_list)
Regress the bbox for the predicted class. Used in Cascade R-CNN.
Parameters:
priors (Tensor) – Priors from rpn_head or last stage
bbox_head, has shape (num_proposals, 4).
label (Tensor) – Only used when self.reg_class_agnostic
is False, has shape (num_proposals, ).
bbox_pred (Tensor) – Regression prediction of
current stage bbox_head. When self.reg_class_agnostic
is False, it has shape (n, num_classes * 4), otherwise
it has shape (n, 4).
Build RoI operator to extract feature from each level feature map.
Parameters:
layer_cfg (ConfigDict or dict) – Dictionary to construct and
config RoI layer operation. Options are modules under
mmcv/ops such as RoIAlign.
featmap_strides (list[int]) – The stride of input feature map w.r.t
to the original image size, which would be used to scale RoI
coordinate (original image coordinate system) to feature
coordinate system.
loss_bbox (dict): A dictionary of bbox loss components.
rois (Tensor): RoIs with the shape (n, 5) where the first
column indicates batch id of each RoI.
bbox_targets (tuple): Ground truth for proposals in a
single image. Containing the following list of Tensors:
(labels, label_weights, bbox_targets, bbox_weights)
Perform forward propagation and loss calculation of the detection
roi on the features of the upstream network.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (list[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Calculate the ground truth for all samples in a batch according to
the sampling_results.
Almost the same as the implementation in bbox_head, we passed
additional parameters pos_inds_list and neg_inds_list to
_get_targets_single function.
Parameters:
(List[obj (sampling_results) – SamplingResult]): Assign results of
all images in a batch after sampling.
(obj (rcnn_train_cfg) – ConfigDict): train_cfg of RCNN.
concat (bool) – Whether to concatenate the results of all
the images in a single batch.
Returns:
Ground truth for proposals in a single image.
Containing the following list of Tensors:
labels (list[Tensor],Tensor): Gt_labels for all
proposals in a batch, each tensor in list has
shape (num_proposals,) when concat=False, otherwise just
a single tensor has shape (num_all_proposals,).
label_weights (list[Tensor]): Labels_weights for
all proposals in a batch, each tensor in list has shape
(num_proposals,) when concat=False, otherwise just a
single tensor has shape (num_all_proposals,).
bbox_targets (list[Tensor],Tensor): Regression target
for all proposals in a batch, each tensor in list has
shape (num_proposals, 4) when concat=False, otherwise
just a single tensor has shape (num_all_proposals, 4),
the last dimension 4 represents [tl_x, tl_y, br_x, br_y].
bbox_weights (list[tensor],Tensor): Regression weights for
all proposals in a batch, each tensor in list has shape
(num_proposals, 4) when concat=False, otherwise just a
single tensor has shape (num_all_proposals, 4).
Calculate the loss based on the features extracted by the DIIHead.
Parameters:
cls_score (Tensor) – Classification prediction
results of all class, has shape
(batch_size * num_proposals_single_image, num_classes)
bbox_pred (Tensor) – Regression prediction results, has shape
(batch_size * num_proposals_single_image, 4), the last
dimension 4 represents [tl_x, tl_y, br_x, br_y].
(List[obj (sampling_results) – SamplingResult]): Assign results of
all images in a batch after sampling.
(obj (rcnn_train_cfg) – ConfigDict): train_cfg of RCNN.
imgs_whwh (Tensor) – imgs_whwh (Tensor): Tensor with shape (batch_size, num_proposals, 4), the last
dimension means
[img_width,img_height, img_width, img_height].
concat (bool) – Whether to concatenate the results of all
the images in a single batch. Defaults to True.
reduction_override (str, optional) – The reduction
method used to override the original reduction
method of the loss. Options are “none”,
“mean” and “sum”. Defaults to None.
Returns:
A dictionary of loss and targets components.
The targets are only used for cascade rcnn.
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (list[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Perform forward propagation and loss calculation of the detection
roi on the features of the upstream network.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (list[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
stage_loss_weights (list[float]) – Loss weight for every stage.
semantic_roi_extractor (ConfigDict or dict, optional) – Config of semantic roi extractor. Defaults to None.
Semantic_head (ConfigDict or dict, optional) – Config of semantic head. Defaults to None.
interleaved (bool) – Whether to interleaves the box branch and mask
branch. If True, the mask branch can take the refined bounding
box predictions. Defaults to True.
mask_info_flow (bool) – Whether to turn on the mask information flow,
which means that feeding the mask features of the preceding stage
to the current stage. Defaults to True.
semantic_feat (Tensor, optional) – Semantic feature. Defaults to
None.
Returns:
Usually returns a dictionary with keys:
cls_score (Tensor): Classification scores.
bbox_pred (Tensor): Box energies / deltas.
bbox_feats (Tensor): Extract bbox RoI features.
loss_bbox (dict): A dictionary of bbox loss components.
rois (Tensor): RoIs with the shape (n, 5) where the first
column indicates batch id of each RoI.
bbox_targets (tuple): Ground truth for proposals in a
single image. Containing the following list of Tensors:
(labels, label_weights, bbox_targets, bbox_weights)
Perform forward propagation and loss calculation of the detection
roi on the features of the upstream network.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (list[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Mask IoU target is the IoU of the predicted mask (inside a bbox) and
the gt mask of corresponding gt mask (the whole instance).
The intersection area is computed inside the bbox, and the gt mask area
is computed with two steps, firstly we compute the gt area inside the
bbox, then divide it by the area ratio of gt area inside the bbox and
the gt area of the whole instance.
MaskPointHead use shared multi-layer perceptron (equivalent to
nn.Conv1d) to predict the logit of input points. The fine-grained feature
and coarse feature will be concatenate together for predication.
Parameters:
num_fcs (int) – Number of fc layers in the head. Defaults to 3.
in_channels (int) – Number of input channels. Defaults to 256.
fc_channels (int) – Number of fc channels. Defaults to 256.
num_classes (int) – Number of classes for logits. Defaults to 80.
class_agnostic (bool) – Whether use class agnostic classification.
If so, the output channels of logits will be 1. Defaults to False.
coarse_pred_each_layer (bool) – Whether concatenate coarse feature with
the output of each fc layer. Defaults to True.
conv_cfg (ConfigDict or dict) – Dictionary to construct
and config conv layer. Defaults to dict(type=’Conv1d’)).
norm_cfg (ConfigDict or dict, optional) – Dictionary to construct
and config norm layer. Defaults to None.
loss_point (ConfigDict or dict) – Dictionary to construct and
config loss layer of point head. Defaults to
dict(type=’CrossEntropyLoss’, use_mask=True, loss_weight=1.0).
init_cfg (ConfigDict or dict or list[ConfigDict or dict], optional) – Initialization config dict.
mask_preds (Tensor) – A tensor of shape (num_rois, num_classes,
mask_height, mask_width) for class-specific or class-agnostic
prediction.
label_preds (Tensor) – The predication class for each instance.
cfg (ConfigDict or dict) – Testing config of point head.
Returns:
point_indices (Tensor): A tensor of shape (num_rois, num_points)
that contains indices from [0, mask_height x mask_width) of the
most uncertain points.
point_coords (Tensor): A tensor of shape (num_rois, num_points,
2) that contains [0, 1] x [0, 1] normalized coordinates of the
most uncertain points from the [mask_height, mask_width] grid.
Get num_points most uncertain points with random points during
train.
Sample points in [0, 1] x [0, 1] coordinate space based on their
uncertainty. The uncertainties are calculated for each point using
‘_get_uncertainty()’ function that takes point’s logit prediction as
input.
Parameters:
mask_preds (Tensor) – A tensor of shape (num_rois, num_classes,
mask_height, mask_width) for class-specific or class-agnostic
prediction.
labels (Tensor) – The ground truth class for each instance.
cfg (ConfigDict or dict) – Training config of point head.
Returns:
A tensor of shape (num_rois, num_points, 2)
that contains the coordinates sampled points.
Perform forward propagation and loss calculation of the detection
roi on the features of the upstream network.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (list[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Perform forward propagation and loss calculation of the detection
roi on the features of the upstream network.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (list[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Although the recipe for forward pass needs to be defined within
this function, one should call the Module instance afterwards
instead of this since the former takes care of running the
registered hooks while the latter silently ignores them.
This has an effect only on certain modules. See the documentation of
particular modules for details of their behaviors in training/evaluation
mode, i.e., whether they are affected, e.g. Dropout, BatchNorm,
etc.
Parameters:
mode (bool) – whether to set training mode (True) or evaluation
mode (False). Default: True.
Side-Aware Boundary Localization (SABL) for RoI-Head.
Side-Aware features are extracted by conv layers
with an attention mechanism.
Boundary Localization with Bucketing and Bucketing Guided Rescoring
are implemented in BucketingBBoxCoder.
Calculate the loss based on the network predictions and targets.
Parameters:
cls_score (Tensor) – Classification prediction
results of all class, has shape
(batch_size * num_proposals_single_image, num_classes)
bbox_pred (Tensor) – A tuple of regression prediction results
containing bucket_cls_preds and bucket_offset_preds.
rois (Tensor) – RoIs with the shape
(batch_size * num_proposals_single_image, 5) where the first
column indicates batch id of each RoI.
labels (Tensor) – Gt_labels for all proposals in a batch, has
shape (batch_size * num_proposals_single_image, ).
label_weights (Tensor) – Labels_weights for all proposals in a
batch, has shape (batch_size * num_proposals_single_image, ).
bbox_targets (Tuple[Tensor, Tensor]) – A tuple of regression target
containing bucket_cls_targets and bucket_offset_targets.
the last dimension 4 represents [tl_x, tl_y, br_x, br_y].
bbox_weights (Tuple[Tensor, Tensor]) – A tuple of regression
weights containing bucket_cls_weights and
bucket_offset_weights.
reduction_override (str, optional) – The reduction
method used to override the original reduction
method of the loss. Options are “none”,
“mean” and “sum”. Defaults to None,
rois (Tensor): RoIs with the shape (n, 5) where the first
column indicates batch id of each RoI.
bbox_targets (tuple): Ground truth for proposals in a
single image. Containing the following list of Tensors:
(labels, label_weights, bbox_targets, bbox_weights)
batch_img_metas (List[dict]) – List of image information.
semantic_feat (Tensor) – Semantic feature. Defaults to None.
glbctx_feat (Tensor) – Global context feature. Defaults to None.
Returns:
Usually returns a dictionary with keys:
cls_score (Tensor): Classification scores.
bbox_pred (Tensor): Box energies / deltas.
bbox_feats (Tensor): Extract bbox RoI features.
loss_bbox (dict): A dictionary of bbox loss components.
rois (Tensor): RoIs with the shape (n, 5) where the first
column indicates batch id of each RoI.
bbox_targets (tuple): Ground truth for proposals in a
single image. Containing the following list of Tensors:
(labels, label_weights, bbox_targets, bbox_weights)
Perform forward propagation and loss calculation of the detection
roi on the features of the upstream network.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (list[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
x (Tuple[Tensor]) – Tuple of multi-level img features.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Network forward process. Usually includes backbone, neck and head
forward without any post-processing.
Parameters:
x (List[Tensor]) – Multi-level features that may have different
resolutions.
rpn_results_list (List[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Returns
tuple: A tuple of features from bbox_head and mask_head
forward.
Perform forward propagation and loss calculation of the detection
roi on the features of the upstream network.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (List[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
Perform forward propagation and loss calculation of the detection
roi on the features of the upstream network.
Parameters:
x (tuple[Tensor]) – List of multi-level img features.
rpn_results_list (list[InstanceData]) – List of region
proposals.
batch_data_samples (list[DetDataSample]) – The batch
data samples. It usually includes information such
as gt_instance or gt_panoptic_seg or gt_sem_seg.
num_branch (int) – Number of branches in TridentNet.
test_branch_idx (int) – In inference, all 3 branches will be used
if test_branch_idx==-1, otherwise only branch with index
test_branch_idx will be used.
data_samples_list (List[List[DetDataSample]]) – List of predictions
of all enhanced data. The outer list indicates images, and the
inner list corresponds to the different views of one image.
Each element of the inner list is a DetDataSample.
aug_proposals (list[Tensor]) – proposals from different testing
schemes, shape (n, 5). Note that they are not rescaled to the
original image size.
img_metas (list[dict]) – list of image info dict where each dict has:
‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain
‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’.
For details on the values of these keys see
mmdet/datasets/pipelines/formatting.py:Collect.
cfg (dict) – rpn test config.
Returns:
shape (n, 4), proposals corresponding to original image scale.
Merge augmented detection results, only bboxes corresponding score under
flipping and multi-scale resizing can be processed now.
Parameters:
(list[list[[obj (aug_batch_results) –
InstanceData]]):
Detection results of multiple images with
different augmentations.
The outer list indicate the augmentation . The inter
list indicate the batch dimension.
Each item usually contains the following keys.
scores (Tensor): Classification scores, in shape
(num_instance,)
labels (Tensor): Labels of bboxes, in shape
(num_instances,).
bboxes (Tensor): In shape (num_instances, 4),
the last dimension 4 arrange as (x1, y1, x2, y2).
aug_batch_img_metas (list[list[dict]]) – The outer list
indicates test-time augs (multiscale, flip, etc.)
and the inner list indicates
images in a batch. Each dict in the list contains
information of an image in the batch.
Returns:
InstanceData]): Same with
the input aug_results except that all bboxes have
been mapped to the original scale.
Note: If instance_results is not None, it will be modified
in place internally, and then return instance_results
Parameters:
batch_img_metas (list[dict]) – List of image information.
device (torch.device) – Device of tensor.
task_type (str) – Expected returned task type. it currently
supports bbox and mask.
instance_results (list[InstanceData]) – List of instance
results.
mask_thr_binary (int, float) – mask binarization threshold.
Defaults to 0.
box_type (str or type) – The empty box type. Defaults to hbox.
use_box_type (bool) – Whether to warp boxes with the box type.
Defaults to False.
num_classes (int) – num_classes of bbox_head. Defaults to 80.
score_per_cls (bool) – Whether to generate classwise score for
the empty instance. score_per_cls will be True when the model
needs to produce raw results without nms. Defaults to False.
Given min_overlap, radius could computed by a quadratic equation
according to Vieta’s formulas.
There are 3 cases for computing gaussian radius, details are following:
Explanation of figure: lt and br indicates the left-top and
bottom-right corner of ground truth box. x indicates the
generated corner at the limited position when radius=r.
Case1: one corner is inside the gt box and the other is outside.
Get num_points most uncertain points with random points during
train.
Sample points in [0, 1] x [0, 1] coordinate space based on their
uncertainty. The uncertainties are calculated for each point using
‘get_uncertainty()’ function that takes point’s logit prediction as
input.
Parameters:
mask_preds (Tensor) – A tensor of shape (num_rois, num_classes,
mask_height, mask_width) for class-specific or class-agnostic
prediction.
labels (Tensor) – The ground truth class for each instance.
num_points (int) – The number of points to sample.
[feature_level0, feature_level1…] -> [feature_image0, feature_image1…]
Convert the shape of each element in mlvl_tensor from (N, C, H, W) to
(N, H*W , C), then split the element to N elements with shape (H*W, C), and
concat elements in same image of all level along first dimension.
Parameters:
mlvl_tensor (list[Tensor]) – list of Tensor which collect from
corresponding level. Each element is of shape (N, C, H, W)
This function applies the func to multiple inputs and
map the multiple outputs of the func into different
list. Each list contains the same type of outputs corresponding
to different inputs.
Parameters:
func (Function) – A function that will be applied to a list of
arguments
Returns:
A tuple containing multiple list, each list contains a kind of returned results by the function
gt_labels (Tensor) – Ground truth labels of each bbox,
with shape (num_gts, ).
gt_masks (BitmapMasks) – Ground truth masks of each instances
of a image, shape (num_gts, h, w).
gt_semantic_seg (Tensor | None) – Ground truth of semantic
segmentation with the shape (1, h, w).
[0, num_thing_class - 1] means things,
[num_thing_class, num_class-1] means stuff,
255 means VOID. It’s None when training instance segmentation.
Returns:
a tuple containing the following targets.
labels (Tensor): Ground truth class indices for a
image, with shape (n, ), n is the sum of number
of stuff type and number of instance in a image.
masks (Tensor): Ground truth mask for a image, with
shape (n, h, w). Contains stuff and things when training
panoptic segmentation, and things only when training
instance segmentation.
Extract a multi-scale single image tensor from a multi-scale batch
tensor based on batch index.
Note: The default value of detach is True, because the proposal gradient
needs to be detached during the training of the two-stage model. E.g
Cascade Mask R-CNN.
Parameters:
mlvl_tensors (list[Tensor]) – Batch tensor for all scale levels,
each is a 4D-tensor.
weighted boxes fusion <https://arxiv.org/abs/1910.13302> is a method for
fusing predictions from different object detection models, which utilizes
confidence scores of all proposed bounding boxes to construct averaged
boxes.
Parameters:
bboxes_list (list) – list of boxes predictions from each model,
each box is 4 numbers.
scores_list (list) – list of scores for each model
labels_list (list) – list of labels for each model
weights – list of weights for each model.
Default: None, which means weight == 1 for each model
iou_thr – IoU value for boxes to be a match
skip_box_thr – exclude boxes with score lower than this variable.
conf_type –
how to calculate confidence in weighted boxes.
‘avg’: average value,
‘max’: maximum value,
‘box_and_model_avg’: box and model wise hybrid weighted average,
‘absent_model_aware_avg’: weighted average that takes into
account the absent model.
allows_overflow – false if we want confidence score not exceed 1.0.
A data structure interface of tracking task in MMDetection. It is used
as interfaces between different components.
This data structure can be viewd as a wrapper of multiple DetDataSample to
some extent. Specifically, it only contains a property:
video_data_samples which is a list of DetDataSample, each of which
corresponds to a single frame. If you want to get the property of a single
frame, you must first get the corresponding DetDataSample by indexing
and then get the property of the frame, such as gt_instances,
pred_instances and so on. As for metainfo, it differs from
DetDataSample in that each value corresponds to the metainfo key is a
list where each element corresponds to information of a single frame.
Examples
>>> importtorch>>> frommmengine.structuresimportInstanceData>>> frommmdet.structuresimportDetDataSample,TrackDataSample>>> track_data_sample=TrackDataSample()>>> # set the 1st frame>>> frame1_data_sample=DetDataSample(metainfo=dict(... img_shape=(100,100),frame_id=0))>>> frame1_gt_instances=InstanceData()>>> frame1_gt_instances.bbox=torch.zeros([2,4])>>> frame1_data_sample.gt_instances=frame1_gt_instances>>> # set the 2nd frame>>> frame2_data_sample=DetDataSample(metainfo=dict(... img_shape=(100,100),frame_id=1))>>> frame2_gt_instances=InstanceData()>>> frame2_gt_instances.bbox=torch.ones([3,4])>>> frame2_data_sample.gt_instances=frame2_gt_instances>>> track_data_sample.video_data_samples=[frame1_data_sample,... frame2_data_sample]>>> # set metainfo for track_data_sample>>> track_data_sample.set_metainfo(dict(key_frames_inds=[0]))>>> track_data_sample.set_metainfo(dict(ref_frames_inds=[1]))>>> print(track_data_sample)<TrackDataSample(
META INFORMATION
key_frames_inds: [0]
ref_frames_inds: [1]
In __init__ , BaseBoxes verifies the validity of the data shape
w.r.t box_dim. The tensor with the dimension >= 2 and the length
of the last dimension being box_dim will be regarded as valid.
BaseBoxes will restore them at the field tensor. It’s necessary
to override box_dim in subclass to guarantee the data shape is
correct.
There are many basic tensor-like functions implemented in BaseBoxes.
In most cases, users can operate BaseBoxes instance like a normal
tensor. To protect the validity of data shape, All tensor-like functions
cannot modify the last dimension of self.tensor.
When creating a new box type, users need to inherit from BaseBoxes
and override abstract methods and specify the box_dim. Then, register
the new box type by using the decorator register_box_type.
Parameters:
data (Tensor or np.ndarray or Sequence) – The box data with shape
(…, box_dim).
dtype (torch.dtype, Optional) – data type of boxes. Defaults to None.
device (str or torch.device, Optional) – device of boxes.
Default to None.
clone (bool) – Whether clone boxes or not. Defaults to True.
Find inside box points. Boxes dimension must be 2.
Parameters:
points (Tensor) – Points coordinates. Has shape of (m, 2).
is_aligned (bool) – Whether points has been aligned with boxes
or not. If True, the length of boxes and points should be
the same. Defaults to False.
Returns:
A BoolTensor indicating whether a point is inside
boxes. Assuming the boxes has shape of (n, box_dim), if
is_aligned is False. The index has shape of (m, n). If
is_aligned is True, m should be equal to n and the index has
shape of (m, ).
Both rescale_ and resize_ will enlarge or shrink boxes
w.r.t scale_facotr. The difference is that resize_ only
changes the width and the height of boxes, but rescale_ also
rescales the box centers simultaneously.
Parameters:
scale_factor (Tuple[float, float]) – factors for scaling boxes.
The length should be 2.
Resize the box width and height w.r.t scale_factor in-place.
Note
Both rescale_ and resize_ will enlarge or shrink boxes
w.r.t scale_facotr. The difference is that resize_ only
changes the width and the height of boxes, but rescale_ also
rescales the box centers simultaneously.
Parameters:
scale_factor (Tuple[float, float]) – factors for scaling box
shapes. The length should be 2.
The horizontal box class used in MMDetection by default.
The box_dim of HorizontalBoxes is 4, which means the length of
the last dimension of the data should be 4. Two modes of box data are
supported in HorizontalBoxes:
‘xyxy’: Each row of data indicates (x1, y1, x2, y2), which are the
coordinates of the left-top and right-bottom points.
‘cxcywh’: Each row of data indicates (x, y, w, h), where (x, y) are the
coordinates of the box centers and (w, h) are the width and height.
HorizontalBoxes only restores ‘xyxy’ mode of data. If the the data is
in ‘cxcywh’ mode, users need to input in_mode='cxcywh' and The code
will convert the ‘cxcywh’ data to ‘xyxy’ automatically.
Parameters:
data (Tensor or np.ndarray or Sequence) – The box data with shape of
(…, 4).
dtype (torch.dtype, Optional) – data type of boxes. Defaults to None.
device (str or torch.device, Optional) – device of boxes.
Default to None.
clone (bool) – Whether clone boxes or not. Defaults to True.
mode (str, Optional) – the mode of boxes. If it is ‘cxcywh’, the
data will be converted to ‘xyxy’ mode. Defaults to None.
Find inside box points. Boxes dimension must be 2.
Parameters:
points (Tensor) – Points coordinates. Has shape of (m, 2).
is_aligned (bool) – Whether points has been aligned with boxes
or not. If True, the length of boxes and points should be
the same. Defaults to False.
Returns:
A BoolTensor indicating whether a point is inside
boxes. Assuming the boxes has shape of (n, 4), if is_aligned
is False. The index has shape of (m, n). If is_aligned is
True, m should be equal to n and the index has shape of (m, ).
Both rescale_ and resize_ will enlarge or shrink boxes
w.r.t scale_facotr. The difference is that resize_ only
changes the width and the height of boxes, but rescale_ also
rescales the box centers simultaneously.
Parameters:
scale_factor (Tuple[float, float]) – factors for scaling boxes.
The length should be 2.
Resize the box width and height w.r.t scale_factor in-place.
Note
Both rescale_ and resize_ will enlarge or shrink boxes
w.r.t scale_facotr. The difference is that resize_ only
changes the width and the height of boxes, but rescale_ also
rescales the box centers simultaneously.
Parameters:
scale_factor (Tuple[float, float]) – factors for scaling box
shapes. The length should be 2.
bbox_list (List[Union[Tensor, BaseBoxes]) – a list of bboxes
corresponding to a batch of images.
Returns:
shape (n, box_dim + 1), where box_dim depends on the
different box types. For example, If the box type in bbox_list
is HorizontalBoxes, the output shape is (n, 5). Each row of data
indicates [batch_ind, x1, y1, x2, y2].
If is_aligned is False, then calculate the overlaps between each
bbox of bboxes1 and bboxes2, otherwise the overlaps between each aligned
pair of bboxes1 and bboxes2.
Parameters:
bboxes1 (Tensor) – shape (B, m, 4) in <x1, y1, x2, y2> format or empty.
bboxes2 (Tensor) – shape (B, n, 4) in <x1, y1, x2, y2> format or empty.
B indicates the batch dim, in shape (B1, B2, …, Bn).
If is_aligned is True, then m and n must be equal.
mode (str) – “iou” (intersection over union), “iof” (intersection over
foreground) or “giou” (generalized intersection over union).
Default “iou”.
is_aligned (bool, optional) – If True, then m and n must be equal.
Default False.
eps (float, optional) – A value added to the denominator for numerical
stability. Default 1e-6.
Returns:
shape (m, n) if is_aligned is False else shape (m,)
distance (Tensor) – Distance from the given point to 4
boundaries (left, top, right, bottom). Shape (B, N, 4) or (N, 4)
(Union[Sequence[int] (max_shape) – optional): Maximum bounds for boxes, specifies
(H, W, C) or (H, W). If priors shape is (B, N, 4), then
the max_shape should be a Sequence[Sequence[int]]
and the length of max_shape should also be B.
Tensor – optional): Maximum bounds for boxes, specifies
(H, W, C) or (H, W). If priors shape is (B, N, 4), then
the max_shape should be a Sequence[Sequence[int]]
and the length of max_shape should also be B.
Sequence[Sequence[int]]] – optional): Maximum bounds for boxes, specifies
(H, W, C) or (H, W). If priors shape is (B, N, 4), then
the max_shape should be a Sequence[Sequence[int]]
and the length of max_shape should also be B.
:paramoptional): Maximum bounds for boxes, specifies
(H, W, C) or (H, W). If priors shape is (B, N, 4), then
the max_shape should be a Sequence[Sequence[int]]
and the length of max_shape should also be B.
boxes (Tensor or BaseBoxes) – boxes with type of tensor or box type.
If its type is a tensor, the boxes will be directly returned.
If its type is a box type, the boxes.tensor will be returned.
A record will be added to bbox_types, whose key is the box type name
and value is the box type itself. Simultaneously, a reverse dictionary
_box_type_to_name will be updated. It can be used as a decorator or
a normal function.
Parameters:
name (str) – The name of box type.
bbox_type (type, Optional) – Box type class to be registered.
Defaults to None.
force (bool) – Whether to override the existing box type with the same
name. Defaults to False.
A record will be added to box_converter, whose key is
‘{src_type_name}2{dst_type_name}’ and value is the convert function.
It can be used as a decorator or a normal function.
Parameters:
src_type (str or type) – source box type name or class.
dst_type (str or type) – destination box type name or class.
converter (Callable) – Convert function. Defaults to None.
force (bool) – Whether to override the existing box type with the same
name. Defaults to False.
Examples
>>> frommmdet.structures.bboximportregister_box_converter>>> # as a decorator>>> @register_box_converter('hbox','rbox')>>> defconverter_A(boxes):>>> pass
>>> # as a normal function>>> defconverter_B(boxes):>>> pass>>> register_box_converter('rbox','hbox',converter_B)
This function is mainly used in mask targets computation.
It firstly align mask to bboxes by assigned_inds, then crop mask by the
assigned bbox and resize to the size of (mask_h, mask_w)
Parameters:
bboxes (Tensor) – Bboxes in format [x1, y1, x2, y2], shape (N, 4)
out_shape (tuple[int]) – Target (h, w) of resized mask
inds (ndarray) – Indexes to assign masks to each bbox,
shape (N,) and values should be between [0, num_masks - 1].
device (str) – Device of bboxes
interpolation (str) – See mmcv.imresize
binarize (bool) – if True fractional values are rounded to 0 or 1
after the resize operation. if False and unsupported an error
will be raised. Defaults to True.
>>> frommmdet.data_elements.mask.structuresimportBitmapMasks>>> self=BitmapMasks.random(dtype=np.uint8)>>> out_shape=(32,32)>>> offset=4>>> direction='horizontal'>>> border_value=0>>> interpolation='bilinear'>>> # Note, There seem to be issues when:>>> # * the mask dtype is not supported by cv2.AffineWarp>>> new=self.translate(out_shape,offset,direction,>>> border_value,interpolation)>>> assertlen(new)==len(self)>>> assertnew.height,new.width==out_shape
This class represents masks in the form of polygons.
Polygons is a list of three levels. The first level of the list
corresponds to objects, the second level to the polys that compose the
object, the third level to the poly coordinates
Parameters:
masks (list[list[ndarray]]) – The first level of the list
corresponds to objects, the second level to the polys that
compose the object, the third level to the poly coordinates
Compute mask target for positive proposals in multiple images.
Parameters:
pos_proposals_list (list[Tensor]) – Positive proposals in multiple
images, each has shape (num_pos, 4).
pos_assigned_gt_inds_list (list[Tensor]) – Assigned GT indices for each
positive proposals, each has shape (num_pos,).
gt_masks_list (list[BaseInstanceMasks]) – Ground truth masks of
each image.
cfg (dict) – Config dict that specifies the mask size.
Returns:
Mask target of each image, has shape (num_pos, w, h).
Return type:
Tensor
Example
>>> frommmengine.configimportConfig>>> importmmdet>>> frommmdet.data_elements.maskimportBitmapMasks>>> frommmdet.data_elements.mask.mask_targetimport*>>> H,W=17,18>>> cfg=Config({'mask_size':(13,14)})>>> rng=np.random.RandomState(0)>>> # Positive proposals (tl_x, tl_y, br_x, br_y) for each image>>> pos_proposals_list=[>>> torch.Tensor([>>> [7.2425,5.5929,13.9414,14.9541],>>> [7.3241,3.6170,16.3850,15.3102],>>> ]),>>> torch.Tensor([>>> [4.8448,6.4010,7.0314,9.7681],>>> [5.9790,2.6989,7.4416,4.8580],>>> [0.0000,0.0000,0.1398,9.8232],>>> ]),>>> ]>>> # Corresponding class index for each proposal for each image>>> pos_assigned_gt_inds_list=[>>> torch.LongTensor([7,0]),>>> torch.LongTensor([5,4,1]),>>> ]>>> # Ground truth mask for each true object for each image>>> gt_masks_list=[>>> BitmapMasks(rng.rand(8,H,W),height=H,width=W),>>> BitmapMasks(rng.rand(6,H,W),height=H,width=W),>>> ]>>> mask_targets=mask_target(>>> pos_proposals_list,pos_assigned_gt_inds_list,>>> gt_masks_list,cfg)>>> assertmask_targets.shape==(5,)+cfg['mask_size']
A mask is represented as a list of polys, and a poly is represented as
a 1-D array. In dataset, all masks are concatenated into a single 1-D
tensor. Here we need to split the tensor into original representations.
Parameters:
polys (list) – a list (length = image num) of 1-D tensors
poly_lens (list) – a list (length = image num) of poly length
polys_per_mask (list) – a list (length = image num) of poly number
of each mask
Returns:
a list (length = image num) of list (length = mask num) of list (length = poly num) of numpy array.
If GT and prediction are plotted at the same time, they are
displayed in a stitched image where the left image is the
ground truth and the right image is the prediction.
- If show is True, all storage backends are ignored, and
the images will be displayed in a local window.
- If out_file is specified, the drawn image will be
saved to out_file. t is usually used when the display
is not available.
Parameters:
name (str) – The image identifier.
image (np.ndarray) – The image to draw.
data_sample (DetDataSample, optional) – A data
sample that contain annotations and predictions.
Defaults to None.
draw_gt (bool) – Whether to draw GT DetDataSample. Default to True.
draw_pred (bool) – Whether to draw Prediction DetDataSample.
Defaults to True.
show (bool) – Whether to display the drawn image. Default to False.
wait_time (float) – The interval of show (s). Defaults to 0.
out_file (str) – Path to output file. Defaults to None.
pred_score_thr (float) – The threshold to visualize the bboxes
and masks. Defaults to 0.3.
step (int) – Global step value to record. Defaults to 0.
If GT and prediction are plotted at the same time, they are
displayed in a stitched image where the left image is the
ground truth and the right image is the prediction.
- If show is True, all storage backends are ignored, and
the images will be displayed in a local window.
- If out_file is specified, the drawn image will be
saved to out_file. t is usually used when the display
is not available.
:param name: The image identifier.
:type name: str
:param image: The image to draw.
:type image: np.ndarray
:param data_sample: A data
sample that contain annotations and predictions.
Defaults to None.
Parameters:
draw_gt (bool) – Whether to draw GT TrackDataSample.
Default to True.
draw_pred (bool) – Whether to draw Prediction TrackDataSample.
Defaults to True.
show (bool) – Whether to display the drawn image. Default to False.
wait_time (int) – The interval of show (s). Defaults to 0.
out_file (str) – Path to output file. Defaults to None.
pred_score_thr (float) – The threshold to visualize the bboxes
and masks. Defaults to 0.3.
step (int) – Global step value to record. Defaults to 0.
Try to convert inputs to FP16 and CPU if got a PyTorch’s CUDA Out of
Memory error. It will do the following steps:
First retry after calling torch.cuda.empty_cache().
If that still fails, it will then retry by converting inputs
to FP16.
If that still fails trying to convert inputs to CPUs.
In this case, it expects the function to dispatch to
CPU implementation.
Parameters:
to_cpu (bool) – Whether to convert outputs to CPU if get an OOM
error. This will slow down the code significantly.
Defaults to True.
test (bool) – Skip _ignore_torch_cuda_oom operate that can use
lightweight data in unit test, only used in
test unit. Defaults to False.
Examples
>>> frommmdet.utils.memoryimportAvoidOOM>>> AvoidCUDAOOM=AvoidOOM()>>> output=AvoidOOM.retry_if_cuda_oom(>>> some_torch_function)(input1,input2)>>> # To use as a decorator>>> # from mmdet.utils import AvoidCUDAOOM>>> @AvoidCUDAOOM.retry_if_cuda_oom>>> deffunction(*args,**kwargs):>>> returnNone
Register all modules in mmdet into the registries.
Parameters:
init_default_scope (bool) – Whether initialize the mmdet default scope.
When init_default_scope=True, the global default scope will be
set to mmdet, and all registries will build modules from mmdet’s
registry node. To understand more about the registry, please refer
to https://github.com/vbti-development/onedl-mmengine/blob/main/docs/en/tutorials/registry.md
Defaults to True.
Note: Due to the dynamic shape of the loss calculation and
post-processing parts in the object detection algorithm, these
functions must be compiled every time they are run.
Setting a large value for torch._dynamo.config.cache_size_limit
may result in repeated compilation, which can slow down training
and testing speed. Therefore, we need to set the default value of
cache_size_limit smaller. An empirical value is 4.
img (Tensor) – of shape (N, C, H, W) encoding input images.
Typically these should be mean centered and std scaled.
img_metas (list[dict]) – List of image info dict where each dict
has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain
‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’.
For details on the values of these keys, see
mmdet.datasets.pipelines.Collect.
kwargs (dict) – Specific to concrete implementation.
Returns:
a dict that data_batch split by tags,
such as ‘sup’, ‘unsup_teacher’, and ‘unsup_student’.
All workers must call this function, otherwise it will deadlock.
This method is generally used in DistributedSampler,
because the seed should be identical across all processes
in the distributed group.
In distributed sampling, different ranks should sample non-overlapped
data in the dataset. Therefore, this function is used to make sure that
each rank shuffles the data indices in the same order based
on the same seed. Then different ranks could use different indices
to select non-overlapped data from the same data list.
Parameters:
seed (int, Optional) – The seed. Default to None.
device (str) – The device where the seed will be put on.
Default to ‘cuda’.