model

A PyTorch implementation of the YOLOX object detection model based on OpenMMLab’s implementation in the mmdetection library.

model_type = MODEL_TYPES[0]
model_type

'yolox_tiny'

source

ConvModule

 ConvModule (in_channels:int, out_channels:int, kernel_size:int,
             stride:int=1, padding:int=0, bias:bool=True, eps:float=1e-05,
             momentum:float=0.1, affine:bool=True,
             track_running_stats:bool=True, activation_function:Type[torch
             .nn.modules.module.Module]=<class
             'torch.nn.modules.activation.SiLU'>)

*Configurable block used for Convolution2d-Normalization-Activation blocks.

Pseudocode

Function forward(input x):

1. Pass the input (x) through the convolutional layer and store the result back to x.
2. Pass the output from the convolutional layer (now stored in x) through the batch normalization layer and store the result back to x.
3. Apply the activation function to the output of the batch normalization layer (x) and return the result.*

	Type	Default	Details
in_channels	int		Number of channels in the input image
out_channels	int		Number of channels produced by the convolution
kernel_size	int		Size of the convolving kernel
stride	int	1	Stride of the convolution.
padding	int	0	Zero-padding added to both sides of the input.
bias	bool	True	If set to False, the layer will not learn an additive bias.
eps	float	1e-05	A value added to the denominator for numerical stability in BatchNorm2d.
momentum	float	0.1	The value used for the running_mean and running_var computation in BatchNorm2d.
affine	bool	True	If set to True, this module has learnable affine parameters.
track_running_stats	bool	True	If set to True, this module tracks the running mean and variance.
activation_function	Type	SiLU	The activation function to be applied after batch normalization.

source

DarknetBottleneck

 DarknetBottleneck (in_channels:int, out_channels:int, eps:float=0.001,
                    momentum:float=0.03, affine:bool=True,
                    track_running_stats:bool=True, add_identity:bool=True)

*Basic Darknet bottleneck block used in Darknet.

This class represents a basic bottleneck block used in Darknet, which consists of two convolutional layers with a possible identity shortcut.

Based on OpenMMLab’s implementation in the mmdetection library:

OpenMMLab’s Implementation*

	Type	Default	Details
in_channels	int		The number of input channels to the block.
out_channels	int		The number of output channels from the block.
eps	float	0.001	A value added to the denominator for numerical stability in the ConvModule’s BatchNorm layer.
momentum	float	0.03	The value used for the running_mean and running_var computation in the ConvModule’s BatchNorm layer.
affine	bool	True	A flag that when set to True, gives the ConvModule’s BatchNorm layer learnable affine parameters.
track_running_stats	bool	True	If True, the ConvModule’s BatchNorm layer will track the running mean and variance.
add_identity	bool	True	If True, add an identity shortcut (also known as skip connection) to the output.
Returns	None

source

CSPLayer

 CSPLayer (in_channels:int, out_channels:int, num_blocks:int,
           kernel_size:int=1, stride:int=1, padding:int=0,
           eps:float=0.001, momentum:float=0.03, affine:bool=True,
           track_running_stats:bool=True, add_identity:bool=True)

*Cross Stage Partial Layer (CSPLayer).

This layer consists of a series of convolutions, blocks of transformations, and a final convolution. The inputs are processed via two paths: a main path with blocks and a shortcut path. The results from both paths are concatenated and further processed before returning the final output.

The blocks are instances of the DarknetBottleneck class which perform additional transformations.

Based on OpenMMLab’s implementation in the mmdetection library:

OpenMMLab’s Implementation*

	Type	Default	Details
in_channels	int		Number of input channels.
out_channels	int		Number of output channels.
num_blocks	int		Number of blocks in the bottleneck.
kernel_size	int	1	Size of the convolving kernel.
stride	int	1	Stride of the convolution.
padding	int	0	Zero-padding added to both sides of the input.
eps	float	0.001	A value added to the denominator for numerical stability.
momentum	float	0.03	The value used for the running_mean and running_var computation.
affine	bool	True	A flag that when set to True, gives the layer learnable affine parameters.
track_running_stats	bool	True	Whether or not to track the running mean and variance during training.
add_identity	bool	True	Whether or not to add an identity shortcut connection if the input and output are the same size.
Returns	None

source

Focus

 Focus (in_channels:int, out_channels:int, kernel_size:int=1,
        stride:int=1, bias:bool=False, eps:float=0.001,
        momentum:float=0.03, affine:bool=True,
        track_running_stats:bool=True)

*Focus width and height information into channel space.

Based on OpenMMLab’s implementation in the mmdetection library:

OpenMMLab’s Implementation*

	Type	Default	Details
in_channels	int		Number of input channels.
out_channels	int		Number of output channels.
kernel_size	int	1	Size of the convolving kernel.
stride	int	1	Stride of the convolution.
bias	bool	False	If set to False, the layer will not learn an additive bias.
eps	float	0.001	A value added to the denominator for numerical stability in the ConvModule’s BatchNorm layer.
momentum	float	0.03	The value used for the running_mean and running_var computation in the ConvModule’s BatchNorm layer.
affine	bool	True	A flag that when set to True, gives the ConvModule’s BatchNorm layer learnable affine parameters.
track_running_stats	bool	True	Whether or not to track the running mean and variance during training.

source

SPPBottleneck

 SPPBottleneck (in_channels:int, out_channels:int,
                pool_sizes:List[int]=[5, 9, 13], eps:float=0.001,
                momentum:float=0.03, affine:bool=True,
                track_running_stats:bool=True)

*Spatial Pyramid Pooling layer used in YOLOv3-SPP

Based on OpenMMLab’s implementation in the mmdetection library:

OpenMMLab’s Implementation*

	Type	Default	Details
in_channels	int		The number of input channels.
out_channels	int		The number of output channels.
pool_sizes	List	[5, 9, 13]	The sizes of the pooling areas.
eps	float	0.001	A value added to the denominator for numerical stability in the BatchNorm layer.
momentum	float	0.03	The value used for the running_mean and running_var computation in the BatchNorm layer.
affine	bool	True	A flag that when set to True, gives the BatchNorm layer learnable affine parameters.
track_running_stats	bool	True	Whether to keep track of running mean and variance in BatchNorm.
Returns	None

source

CSPDarknet

 CSPDarknet (arch='P5', deepen_factor=1.0, widen_factor=1.0,
             out_indices=(2, 3, 4), spp_kernal_sizes=(5, 9, 13),
             momentum=0.03, eps=0.001)

*The CSPDarknet class implements a CSPDarknet backbone, a convolutional neural network (CNN) used in various image recognition tasks. The CSPDarknet backbone forms an integral part of the YOLOX object detection model.

Based on OpenMMLab’s implementation in the mmdetection library:

OpenMMLab’s Implementation*

	Type	Default	Details
arch	str	P5	Architecture configuration, ‘P5’ or ‘P6’.
deepen_factor	float	1.0	Factor to adjust the number of channels in each layer.
widen_factor	float	1.0	Factor to adjust the number of blocks in CSP layer.
out_indices	tuple	(2, 3, 4)	Indices of the stages to output.
spp_kernal_sizes	tuple	(5, 9, 13)	Sizes of the pooling operations in the Spatial Pyramid Pooling.
momentum	float	0.03	Momentum for the moving average in batch normalization.
eps	float	0.001	Epsilon for batch normalization to avoid numerical instability.

csp_darknet_cfg = CSP_DARKNET_CFGS[model_type]
csp_darknet = CSPDarknet(**csp_darknet_cfg)

backbone_inp = torch.randn(1, 3, 256, 256)

with torch.no_grad():
    backbone_out = csp_darknet(backbone_inp)
[out.shape for out in backbone_out]

[torch.Size([1, 96, 32, 32]),
 torch.Size([1, 192, 16, 16]),
 torch.Size([1, 384, 8, 8])]

source

YOLOXPAFPN

 YOLOXPAFPN (in_channels, out_channels, num_csp_blocks=3,
             upsample_cfg={'scale_factor': 2, 'mode': 'nearest'},
             momentum=0.03, eps=0.001)

*Path Aggregation Feature Pyramid Network (PAFPN) used in YOLOX.

In object detection tasks, this class merges the feature maps from different layers of the backbone network. It helps in aggregating multi-scale feature maps to enhance the detection of objects of various sizes.

Based on OpenMMLab’s implementation in the mmdetection library:

OpenMMLab’s Implementation*

pafpn_cfg = PAFPN_CFGS[model_type]
yolox_pafpn = YOLOXPAFPN(**pafpn_cfg)

with torch.no_grad():
    neck_out = yolox_pafpn(backbone_out)
[out.shape for out in neck_out]

[torch.Size([1, 96, 32, 32]),
 torch.Size([1, 96, 16, 16]),
 torch.Size([1, 96, 8, 8])]

source

YOLOXHead

 YOLOXHead (num_classes:int, in_channels:int, feat_channels=256,
            stacked_convs=2, strides=[8, 16, 32], momentum=0.03,
            eps=0.001)

*The YOLOXHead class is a PyTorch module that implements the head of a YOLOX model https://arxiv.org/abs/2107.08430, used for bounding box prediction.

The head takes as input feature maps at multiple scale levels (e.g., from a feature pyramid network) and outputs predicted class scores, bounding box coordinates, and objectness scores for each scale level.

Based on OpenMMLab’s implementation in the mmdetection library:

OpenMMLab’s Implementation*

	Type	Default	Details
num_classes	int		The number of target classes.
in_channels	int		The number of input channels.
feat_channels	int	256	The number of feature channels.
stacked_convs	int	2	The number of convolution layers to stack.
strides	list	[8, 16, 32]	The stride of each scale level in the feature pyramid.
momentum	float	0.03	The momentum for the moving average in batch normalization.
eps	float	0.001	The epsilon to avoid division by zero in batch normalization.

head_cfg = HEAD_CFGS[model_type]
yolox_head = YOLOXHead(num_classes=80, **head_cfg)

with torch.no_grad():
    cls_scores, bbox_preds, objectness = yolox_head(neck_out)    
print(f"cls_scores: {[cls_score.shape for cls_score in cls_scores]}")
print(f"bbox_preds: {[bbox_pred.shape for bbox_pred in bbox_preds]}")
print(f"objectness: {[objectness.shape for objectness in objectness]}")

cls_scores: [torch.Size([1, 80, 32, 32]), torch.Size([1, 80, 16, 16]), torch.Size([1, 80, 8, 8])]
bbox_preds: [torch.Size([1, 4, 32, 32]), torch.Size([1, 4, 16, 16]), torch.Size([1, 4, 8, 8])]
objectness: [torch.Size([1, 1, 32, 32]), torch.Size([1, 1, 16, 16]), torch.Size([1, 1, 8, 8])]

source

YOLOX

 YOLOX (backbone:__main__.CSPDarknet, neck:__main__.YOLOXPAFPN,
        bbox_head:__main__.YOLOXHead)

*Implementation of YOLOX: Exceeding YOLO Series in 2021

https://arxiv.org/abs/2107.08430

Pseudocode

Function forward(input_tensor x):

Pass x through the backbone module. The backbone module performs feature extraction from the input images. Store the output as ‘x’.
Pass the updated x through the neck module. The neck module performs feature aggregation of the extracted features. Update ‘x’ with the new output.
Pass the updated x through the bbox_head module. The bbox_head module predicts bounding boxes for potential objects in the images using the aggregated features. Update ‘x’ with the new output.
Return ‘x’ as the final output. The final ‘x’ represents the model’s predictions for object locations within the input images.*

	Type	Details
backbone	CSPDarknet	Backbone module for feature extraction.
neck	YOLOXPAFPN	Neck module for feature aggregation.
bbox_head	YOLOXHead	Bbox head module for predicting bounding boxes.

yolox = YOLOX(csp_darknet, yolox_pafpn, yolox_head)

with torch.no_grad():
    cls_scores, bbox_preds, objectness = yolox(backbone_inp)    
print(f"cls_scores: {[cls_score.shape for cls_score in cls_scores]}")
print(f"bbox_preds: {[bbox_pred.shape for bbox_pred in bbox_preds]}")
print(f"objectness: {[objectness.shape for objectness in objectness]}")

cls_scores: [torch.Size([1, 80, 32, 32]), torch.Size([1, 80, 16, 16]), torch.Size([1, 80, 8, 8])]
bbox_preds: [torch.Size([1, 4, 32, 32]), torch.Size([1, 4, 16, 16]), torch.Size([1, 4, 8, 8])]
objectness: [torch.Size([1, 1, 32, 32]), torch.Size([1, 1, 16, 16]), torch.Size([1, 1, 8, 8])]

source

init_head

 init_head (head:__main__.YOLOXHead, num_classes:int)

*Initialize the YOLOXHead with appropriate class outputs and convolution layers.

This function configures the output channels in the YOLOX head to match the number of classes in the dataset. It also initializes multiple level convolutional layers for each stride in the YOLOX head.*

	Type	Details
head	YOLOXHead	The YOLOX head to be initialized.
num_classes	int	The number of classes in the dataset.
Returns	None

yolox_head.multi_level_conv_cls

ModuleList(
  (0-2): 3 x Conv2d(96, 80, kernel_size=(1, 1), stride=(1, 1))
)

init_head(yolox_head, 19)
yolox_head.multi_level_conv_cls

ModuleList(
  (0-2): 3 x Conv2d(96, 19, kernel_size=(1, 1), stride=(1, 1))
)

source

build_model

 build_model (model_type:str, num_classes:int, pretrained:bool=True,
              checkpoint_dir:str='./pretrained_checkpoints/')

Builds a YOLOX model based on the given parameters.

	Type	Default	Details
model_type	str		Type of the model to be built.
num_classes	int		Number of classes for the model.
pretrained	bool	True	Whether to load pretrained weights.
checkpoint_dir	str	./pretrained_checkpoints/	Directory to store checkpoints.
Returns	YOLOX		The built YOLOX model.

yolox = build_model(model_type, 19, pretrained=True)

test_inp = torch.randn(1, 3, 256, 256)

with torch.no_grad():
    cls_scores, bbox_preds, objectness = yolox(test_inp)
    
print(f"cls_scores: {[cls_score.shape for cls_score in cls_scores]}")
print(f"bbox_preds: {[bbox_pred.shape for bbox_pred in bbox_preds]}")
print(f"objectness: {[objectness.shape for objectness in objectness]}")