model

A PyTorch implementation of the YOLOX object detection model based on OpenMMLab’s implementation in the mmdetection library.
model_type = MODEL_TYPES[0]
model_type
'yolox_tiny'

source

ConvModule

 ConvModule (in_channels:int, out_channels:int, kernel_size:int,
             stride:int=1, padding:int=0, bias:bool=True, eps:float=1e-05,
             momentum:float=0.1, affine:bool=True,
             track_running_stats:bool=True, activation_function:Type[torch
             .nn.modules.module.Module]=<class
             'torch.nn.modules.activation.SiLU'>)

*Configurable block used for Convolution2d-Normalization-Activation blocks.

Pseudocode

Function forward(input x):

1. Pass the input (x) through the convolutional layer and store the result back to x.
2. Pass the output from the convolutional layer (now stored in x) through the batch normalization layer and store the result back to x.
3. Apply the activation function to the output of the batch normalization layer (x) and return the result.*
Type Default Details
in_channels int Number of channels in the input image
out_channels int Number of channels produced by the convolution
kernel_size int Size of the convolving kernel
stride int 1 Stride of the convolution.
padding int 0 Zero-padding added to both sides of the input.
bias bool True If set to False, the layer will not learn an additive bias.
eps float 1e-05 A value added to the denominator for numerical stability in BatchNorm2d.
momentum float 0.1 The value used for the running_mean and running_var computation in BatchNorm2d.
affine bool True If set to True, this module has learnable affine parameters.
track_running_stats bool True If set to True, this module tracks the running mean and variance.
activation_function Type SiLU The activation function to be applied after batch normalization.

source

DarknetBottleneck

 DarknetBottleneck (in_channels:int, out_channels:int, eps:float=0.001,
                    momentum:float=0.03, affine:bool=True,
                    track_running_stats:bool=True, add_identity:bool=True)

*Basic Darknet bottleneck block used in Darknet.

This class represents a basic bottleneck block used in Darknet, which consists of two convolutional layers with a possible identity shortcut.

Based on OpenMMLab’s implementation in the mmdetection library:

Type Default Details
in_channels int The number of input channels to the block.
out_channels int The number of output channels from the block.
eps float 0.001 A value added to the denominator for numerical stability in the ConvModule’s BatchNorm layer.
momentum float 0.03 The value used for the running_mean and running_var computation in the ConvModule’s BatchNorm layer.
affine bool True A flag that when set to True, gives the ConvModule’s BatchNorm layer learnable affine parameters.
track_running_stats bool True If True, the ConvModule’s BatchNorm layer will track the running mean and variance.
add_identity bool True If True, add an identity shortcut (also known as skip connection) to the output.
Returns None

source

CSPLayer

 CSPLayer (in_channels:int, out_channels:int, num_blocks:int,
           kernel_size:int=1, stride:int=1, padding:int=0,
           eps:float=0.001, momentum:float=0.03, affine:bool=True,
           track_running_stats:bool=True, add_identity:bool=True)

*Cross Stage Partial Layer (CSPLayer).

This layer consists of a series of convolutions, blocks of transformations, and a final convolution. The inputs are processed via two paths: a main path with blocks and a shortcut path. The results from both paths are concatenated and further processed before returning the final output.

The blocks are instances of the DarknetBottleneck class which perform additional transformations.

Based on OpenMMLab’s implementation in the mmdetection library:

Type Default Details
in_channels int Number of input channels.
out_channels int Number of output channels.
num_blocks int Number of blocks in the bottleneck.
kernel_size int 1 Size of the convolving kernel.
stride int 1 Stride of the convolution.
padding int 0 Zero-padding added to both sides of the input.
eps float 0.001 A value added to the denominator for numerical stability.
momentum float 0.03 The value used for the running_mean and running_var computation.
affine bool True A flag that when set to True, gives the layer learnable affine parameters.
track_running_stats bool True Whether or not to track the running mean and variance during training.
add_identity bool True Whether or not to add an identity shortcut connection if the input and output are the same size.
Returns None

source

Focus

 Focus (in_channels:int, out_channels:int, kernel_size:int=1,
        stride:int=1, bias:bool=False, eps:float=0.001,
        momentum:float=0.03, affine:bool=True,
        track_running_stats:bool=True)

*Focus width and height information into channel space.

Based on OpenMMLab’s implementation in the mmdetection library:

Type Default Details
in_channels int Number of input channels.
out_channels int Number of output channels.
kernel_size int 1 Size of the convolving kernel.
stride int 1 Stride of the convolution.
bias bool False If set to False, the layer will not learn an additive bias.
eps float 0.001 A value added to the denominator for numerical stability in the ConvModule’s BatchNorm layer.
momentum float 0.03 The value used for the running_mean and running_var computation in the ConvModule’s BatchNorm layer.
affine bool True A flag that when set to True, gives the ConvModule’s BatchNorm layer learnable affine parameters.
track_running_stats bool True Whether or not to track the running mean and variance during training.

source

SPPBottleneck

 SPPBottleneck (in_channels:int, out_channels:int,
                pool_sizes:List[int]=[5, 9, 13], eps:float=0.001,
                momentum:float=0.03, affine:bool=True,
                track_running_stats:bool=True)

*Spatial Pyramid Pooling layer used in YOLOv3-SPP

Based on OpenMMLab’s implementation in the mmdetection library:

Type Default Details
in_channels int The number of input channels.
out_channels int The number of output channels.
pool_sizes List [5, 9, 13] The sizes of the pooling areas.
eps float 0.001 A value added to the denominator for numerical stability in the BatchNorm layer.
momentum float 0.03 The value used for the running_mean and running_var computation in the BatchNorm layer.
affine bool True A flag that when set to True, gives the BatchNorm layer learnable affine parameters.
track_running_stats bool True Whether to keep track of running mean and variance in BatchNorm.
Returns None

source

CSPDarknet

 CSPDarknet (arch='P5', deepen_factor=1.0, widen_factor=1.0,
             out_indices=(2, 3, 4), spp_kernal_sizes=(5, 9, 13),
             momentum=0.03, eps=0.001)

*The CSPDarknet class implements a CSPDarknet backbone, a convolutional neural network (CNN) used in various image recognition tasks. The CSPDarknet backbone forms an integral part of the YOLOX object detection model.

Based on OpenMMLab’s implementation in the mmdetection library:

Type Default Details
arch str P5 Architecture configuration, ‘P5’ or ‘P6’.
deepen_factor float 1.0 Factor to adjust the number of channels in each layer.
widen_factor float 1.0 Factor to adjust the number of blocks in CSP layer.
out_indices tuple (2, 3, 4) Indices of the stages to output.
spp_kernal_sizes tuple (5, 9, 13) Sizes of the pooling operations in the Spatial Pyramid Pooling.
momentum float 0.03 Momentum for the moving average in batch normalization.
eps float 0.001 Epsilon for batch normalization to avoid numerical instability.
csp_darknet_cfg = CSP_DARKNET_CFGS[model_type]
csp_darknet = CSPDarknet(**csp_darknet_cfg)

backbone_inp = torch.randn(1, 3, 256, 256)

with torch.no_grad():
    backbone_out = csp_darknet(backbone_inp)
[out.shape for out in backbone_out]
[torch.Size([1, 96, 32, 32]),
 torch.Size([1, 192, 16, 16]),
 torch.Size([1, 384, 8, 8])]

source

YOLOXPAFPN

 YOLOXPAFPN (in_channels, out_channels, num_csp_blocks=3,
             upsample_cfg={'scale_factor': 2, 'mode': 'nearest'},
             momentum=0.03, eps=0.001)

*Path Aggregation Feature Pyramid Network (PAFPN) used in YOLOX.

In object detection tasks, this class merges the feature maps from different layers of the backbone network. It helps in aggregating multi-scale feature maps to enhance the detection of objects of various sizes.

Based on OpenMMLab’s implementation in the mmdetection library:

pafpn_cfg = PAFPN_CFGS[model_type]
yolox_pafpn = YOLOXPAFPN(**pafpn_cfg)

with torch.no_grad():
    neck_out = yolox_pafpn(backbone_out)
[out.shape for out in neck_out]
[torch.Size([1, 96, 32, 32]),
 torch.Size([1, 96, 16, 16]),
 torch.Size([1, 96, 8, 8])]

source

YOLOXHead

 YOLOXHead (num_classes:int, in_channels:int, feat_channels=256,
            stacked_convs=2, strides=[8, 16, 32], momentum=0.03,
            eps=0.001)

*The YOLOXHead class is a PyTorch module that implements the head of a YOLOX model https://arxiv.org/abs/2107.08430, used for bounding box prediction.

The head takes as input feature maps at multiple scale levels (e.g., from a feature pyramid network) and outputs predicted class scores, bounding box coordinates, and objectness scores for each scale level.

Based on OpenMMLab’s implementation in the mmdetection library:

Type Default Details
num_classes int The number of target classes.
in_channels int The number of input channels.
feat_channels int 256 The number of feature channels.
stacked_convs int 2 The number of convolution layers to stack.
strides list [8, 16, 32] The stride of each scale level in the feature pyramid.
momentum float 0.03 The momentum for the moving average in batch normalization.
eps float 0.001 The epsilon to avoid division by zero in batch normalization.
head_cfg = HEAD_CFGS[model_type]
yolox_head = YOLOXHead(num_classes=80, **head_cfg)

with torch.no_grad():
    cls_scores, bbox_preds, objectness = yolox_head(neck_out)    
print(f"cls_scores: {[cls_score.shape for cls_score in cls_scores]}")
print(f"bbox_preds: {[bbox_pred.shape for bbox_pred in bbox_preds]}")
print(f"objectness: {[objectness.shape for objectness in objectness]}")
cls_scores: [torch.Size([1, 80, 32, 32]), torch.Size([1, 80, 16, 16]), torch.Size([1, 80, 8, 8])]
bbox_preds: [torch.Size([1, 4, 32, 32]), torch.Size([1, 4, 16, 16]), torch.Size([1, 4, 8, 8])]
objectness: [torch.Size([1, 1, 32, 32]), torch.Size([1, 1, 16, 16]), torch.Size([1, 1, 8, 8])]

source

YOLOX

 YOLOX (backbone:__main__.CSPDarknet, neck:__main__.YOLOXPAFPN,
        bbox_head:__main__.YOLOXHead)

*Implementation of YOLOX: Exceeding YOLO Series in 2021

Pseudocode

Function forward(input_tensor x):

  1. Pass x through the backbone module. The backbone module performs feature extraction from the input images. Store the output as ‘x’.
  2. Pass the updated x through the neck module. The neck module performs feature aggregation of the extracted features. Update ‘x’ with the new output.
  3. Pass the updated x through the bbox_head module. The bbox_head module predicts bounding boxes for potential objects in the images using the aggregated features. Update ‘x’ with the new output.
  4. Return ‘x’ as the final output. The final ‘x’ represents the model’s predictions for object locations within the input images.*
Type Details
backbone CSPDarknet Backbone module for feature extraction.
neck YOLOXPAFPN Neck module for feature aggregation.
bbox_head YOLOXHead Bbox head module for predicting bounding boxes.
yolox = YOLOX(csp_darknet, yolox_pafpn, yolox_head)

with torch.no_grad():
    cls_scores, bbox_preds, objectness = yolox(backbone_inp)    
print(f"cls_scores: {[cls_score.shape for cls_score in cls_scores]}")
print(f"bbox_preds: {[bbox_pred.shape for bbox_pred in bbox_preds]}")
print(f"objectness: {[objectness.shape for objectness in objectness]}")
cls_scores: [torch.Size([1, 80, 32, 32]), torch.Size([1, 80, 16, 16]), torch.Size([1, 80, 8, 8])]
bbox_preds: [torch.Size([1, 4, 32, 32]), torch.Size([1, 4, 16, 16]), torch.Size([1, 4, 8, 8])]
objectness: [torch.Size([1, 1, 32, 32]), torch.Size([1, 1, 16, 16]), torch.Size([1, 1, 8, 8])]

source

init_head

 init_head (head:__main__.YOLOXHead, num_classes:int)

*Initialize the YOLOXHead with appropriate class outputs and convolution layers.

This function configures the output channels in the YOLOX head to match the number of classes in the dataset. It also initializes multiple level convolutional layers for each stride in the YOLOX head.*

Type Details
head YOLOXHead The YOLOX head to be initialized.
num_classes int The number of classes in the dataset.
Returns None
yolox_head.multi_level_conv_cls
ModuleList(
  (0-2): 3 x Conv2d(96, 80, kernel_size=(1, 1), stride=(1, 1))
)
init_head(yolox_head, 19)
yolox_head.multi_level_conv_cls
ModuleList(
  (0-2): 3 x Conv2d(96, 19, kernel_size=(1, 1), stride=(1, 1))
)

source

build_model

 build_model (model_type:str, num_classes:int, pretrained:bool=True,
              checkpoint_dir:str='./pretrained_checkpoints/')

Builds a YOLOX model based on the given parameters.

Type Default Details
model_type str Type of the model to be built.
num_classes int Number of classes for the model.
pretrained bool True Whether to load pretrained weights.
checkpoint_dir str ./pretrained_checkpoints/ Directory to store checkpoints.
Returns YOLOX The built YOLOX model.
yolox = build_model(model_type, 19, pretrained=True)

test_inp = torch.randn(1, 3, 256, 256)

with torch.no_grad():
    cls_scores, bbox_preds, objectness = yolox(test_inp)
    
print(f"cls_scores: {[cls_score.shape for cls_score in cls_scores]}")
print(f"bbox_preds: {[bbox_pred.shape for bbox_pred in bbox_preds]}")
print(f"objectness: {[objectness.shape for objectness in objectness]}")