1 Introduction

Object detection is one of the most fundamental tasks in computer vision. Due to the rapid progress of deep convolutional neural networks (CNN) [10,11,12, 15,16,17, 35, 36, 38, 40], the performance of object detection has been significantly improved.

Recent CNN based object detectors can be categorized into one-stage detectors, like YOLO [29, 30], SSD [24], and RetinaNet [22], and two-stage detectors, e.g. Faster R-CNN [31], R-FCN [18], FPN [21]. Both of them depend on the backbone network pretrained for the ImageNet classification task. However, there is a gap between the image classification and the object detection problem, which not only needs to recognize the category of the object instances but also spatially localize the bounding-boxes. More specifically, there are two problems using the classification backbone for object detection tasks. (i) Recent detectors, e.g., FPN, involve extra stages compared with the backbone network for ImageNet classification in order to detect objects with various sizes. (ii) Traditional backbones produce higher receptive field based on large downsampling factors, which is beneficial to the visual classification. However, the spatial resolution is compromised which will fail to accurately localize the large objects and recognize the small objects.

A well designed detection backbone should tackle all of the problems above. In this paper, we propose DetNet, which is a novel backbone designed for object detection. More specifically, to address the large scale variations of the object instances, DetNet involves additional stages which are utilized in the recent object detectors like FPN. Different from traditional pre-trained models for ImageNet classification, we maintain the spatial resolution of the features even though extra stages are included. However, high resolution feature maps bring more challenges to build a deep neural network due to the computational and memory cost. To keep the efficiency of our DetNet, we employ a low complexity dilated bottleneck structure. By integrating these improvements, our DetNet not only maintains high resolution feature maps but also keeps the large receptive field, both of which are important for the object detection task.

To summarize, we have the following contributions:

  • We are the first to analyze the inherent drawbacks of traditional ImageNet pre-trained model for fine-tuning recent object detectors.

  • We propose a novel backbone, called DetNet, which is specifically designed for the object detection task by maintaining the spatial resolution and enlarging the receptive field.

  • We achieve new state-of-the-art results on MSCOCO object detection and instance segmentation track based on a low complexity DetNet59 backbone.

2 Related Works

Object detection is a heavily researched topic in computer vision. It aims at finding “where” and “what” each object instance is when given an image. Old detectors extract image features by using hand-engineered object component descriptors, such as HOG [5], SIFT [26], Selective Search [37], Edge Box [41]. For a long time, DPM [8] and its variants were the dominant methods among traditional object detectors. With the rapid progress of deep convolutional neural networks, CNN based object detectors have yielded remarkable results and become a new trend in the detection literature. In the network structure, recent CNN based detectors are usually split into two parts. The one is the backbone network, and the other is the detection branch. We briefly introduce these two parts as follows.

2.1 Backbone Network

The backbone networks for object detection are usually borrowed from the ImageNet [32] classification. In the last few years, ImageNet has been regarded as the most authoritative datasets to evaluate the capability of deep convolution neural networks. Many novel networks are designed to get higher performance for ImageNet. AlexNet [17] is among the first to try to increase the depth of CNN. In order to reduce the network computation and increase the valid receptive field, AlexNet down-samples the feature map with 32 strides which is a standard setting for the following works. VGGNet [35] stacks 3 \(\times \) 3 convolution operation to build a deeper network, while still involves 32 strides in feature maps. Most of the following researches adopt VGG like structure, and design a better component in each stage (split by stride). GoogleNet [36] proposes a novel inception block to involve more diverse features. ResNet [10] adopts “bottleneck” design with residual sum operation in each stage, which has been proved a simple and efficient way to build a deeper neural network. ResNext [38] and Xception [2] use group convolution layer to replace the traditional convolution. It reduces the parameters and increases the accuracy simultaneously. DenseNet [13] densely concat several layers, it further reduces parameters while keeping competitive accuracy. Another different research is Dilated Residual Network [39] which extracts features with less strides. DRN achieves notable results on segmentation, while has little discussion on object detection. There are still lots of research for efficient backbone, such as [11, 15, 40]. However they are usually designed for classification.

2.2 Object Detection Branch

Detection branch is usually attached to the base-model which is designed and trained for ImageNet classification dataset. There are two different design logic for object detection. The one is one-stage detector, which directly uses backbone for object instance prediction. For example, YOLO [29, 30] uses a simple efficient backbone DarkNet [29], and then simplifies detection as a regression problem. SSD [24] adopts reduced VGGNet [35] and extracts features in multi-layers, which enables network more powerful to handle variant object scales. RetinaNet [22] uses ResNet as a basic feature extractor, then involves “Focal” loss [22] to address class imbalance issue caused by extreme foreground-background ratio. The other popular pipeline is the two-stage detector. Specifically, recent two-stage detector will predict lots of proposals first based on the backbone, then an additional classifier is involved for proposal classification and regression. Faster R-CNN [31] directly generates proposals from the backbone by using Region Proposal Network (RPN). R-FCN [18] proposes to generate a position sensitive feature map from the output of the backbone, then a novel pooling methods called position sensitive pooling is utilized for each proposal. Deformable convolution Networks [4] tries to enable convolution operation with geometric transformations by learning additional offsets without supervision. It is among the first to ameliorate backbone for object detection. Feature Pyramid Network [21] constructs feature pyramids by exploiting inherent multi-scale, pyramidal hierarchy of deep convolutional networks, specifically FPN combines multi-layer output by utilizing U-shape structure, and still borrows the traditional ResNet without further study. DSOD [33] first proposes to train detection from scratch, whose results are lower than pretrained methods.

In conclusion, traditional backbones are usually designed for ImageNet classification. What is the suitable backbone for object detection is still an unexplored field. Most of the recent object detectors, no matter one-stage or two-stage, follow the pipeline of ImageNet pre-trained models, which is not optimal for detection performance. In this paper, we propose DetNet. The key idea of DetNet is to design a better backbone for object detection.

3 DetNet: A Backbone Network for Object Detection

3.1 Motivation

Recent object detectors usually rely on a backbone network which is pretrained on the ImageNet classification dataset. As the task of ImageNet classification is different from the object detection which not only needs to recognize the category of the objects but also spatially localize the bounding-boxes. The design principles for the image classification is not good for the localization task as the spatial resolution of the feature maps is gradually decreased for the standard networks like VGG16 and Resnet. A few techniques like Feature Pyramid Network (FPN) as in Fig. 1A. [21] and dilation are applied to these networks to maintain the spatial resolution. However, there still exists the following three problems when trained with these backbone networks.

Fig. 1.
figure 1

Comparisons of different backbones used in FPN. Feature pyramid networks (FPN) with the traditional backbone is illustrated in (A). The traditional backbone for image classification is illustrated in (B). Our proposed backbone is illustrated in (C), which has higher spatial resolution and the same stages as FPN. We do not illustrate stage 1 (with stride 2) feature map due to the limitation of figure size.

The Number of Network Stages Is Different. As shown in Fig. 1B, typical classification network involves 5 stages, with each stage down-sampling feature maps by pooling 2x or stride 2 convolution. Thus the spatial size of the output feature map is “32x” sub-sampled. Different from traditional classification network, feature pyramid detectors usually adopt more stages. For example, in Feature Pyramid Networks (FPN) [21], an additional stage P6 is added to handle larger objects. The stages of P6 and P7 are added in RetinaNet [22] in a similar way. Obviously, extra stages like P6 are not pre-trained in the ImageNet dataset.

Weak Visibility (Localization) of Large Objects. The feature map with strong semantic information has strides of 32 respect to the input image, which brings large valid receptive field and leads the success of ImageNet classification task. However, large stride factor is harmful for the object localization. In Feature Pyramid Networks, the large object is generated and predicted within the deeper layers, the boundary of these object may be too blurry to get an accurate regression. This case is even worse when more stages are involved into the classification network, since more down-sampling brings more strides to object.

Invisibility (Recall) of Small Objects. Another drawback of large stride is the missing of small objects. The information from the small objects will be easily weaken as the spatial resolution of the feature maps is decreased and the large context information is integrated. Therefore, Feature Pyramid Network predicts small object in shallower layers. However, shallow layers usually only have low semantic information which may be not sufficient to recognize the category of the object instances. Therefore the detectors usually enhance their classification capability by involving the context cues of high-level representations from the deeper layers. As Fig. 1A shows, Feature Pyramid Networks relieve it by adopting bottom-up pathway. However, if the small objects are missing in deeper layers, these context cues will be decreased simultaneously.

To address these problems, we propose DetNet which has following characteristics. (i) The number of stages is directly designed for Object Detection. (ii) Even though we involve more stages (such as 6 stages or 7 stages) than traditional classification network, we maintain high spatial resolution of the feature maps, while keeping large receptive field.

DetNet has several advantages over traditional backbone networks like ResNet for object detection. First, DetNet has exactly the same number of stages as the detector used, therefore extra stages like P6 can be pre-trained in the ImageNet dataset. Second, benefited by high resolution feature maps in the last stage, DetNet is more powerful in locating the boundary of astronomical objects and finding the small objects. More detailed discussion can be referred to Sect. 4.

3.2 DetNet Design

In this subsection, we will present the detailed structure of DetNet. We adopt ResNet-50 as our baseline, which is widely used as the backbone network in a lot of object detectors. To fairly compare with the ResNet-50, we keep stage 1,2,3,4 the same as original ResNet-50 for our DetNet.

There are two challenges to make an efficient and effective backbone for object detection. On the one hand, keeping the spatial resolution for deep neural network costs extremely large amount of time and memory. On the other hand, reducing the down-sampling factor will lead to the small valid receptive field, which will be harmful to many vision tasks, such as image classification and semantic segmentation.

DetNet is carefully designed to address the two challenges. Specifically, DetNet follows the same setting for ResNet from the first stage to the fourth stage. The difference starts from the fifth stage and an overview of our DetNet for image classification can be found in Fig. 2D. Let us discuss the implementation details of DetNet59 derived from the ResNet50. Similarly, our DetNet can be easily extended with deep layers like ResNet101. The detailed design of our DetNet59 is illustrated as follows:

  • We introduce the extra stage, e.g., P6, in the backbone which will be utilized for object detection as in FPN. Meanwhile, we fix the spatial resolution as 16x downsampling after stage 4.

  • Since the spatial size is fixed after stage 4, in order to introduce a new stage, we employ a dilated [1, 25, 27] bottleneck with 1 \(\times \) 1 convolution projection (Fig. 2B) in the beginning of the each stage. We find the model in Fig. 2B is important for multi-stage detectors like FPN.

  • We apply bottleneck with dilation as a basic network block to efficiently enlarge the receptive field. Since dilated convolution is still time consuming, our stage 5 and stage 6 keep the same channels as stage 4 (256 input channels for bottleneck block). This is different from traditional backbone design, which will double channels in a later stage.

Fig. 2.
figure 2

Detail structure of DetNet (D) and DetNet based Feature Pyramid NetWork (E). Different bottleneck block used in DetNet is illustrated in (A, B). The original bottleneck is illustrated in (C). DetNet follows the same design as ResNet before stage 4, while keeps spatial size after stage 4 (e.g. stage 5 and 6).

It is easy to integrate DetNet with any detectors with/without feature pyramid. Without losing representativeness, we adopt prominent detector FPN as our baselines to validate the effectiveness of DetNet. Since DetNet only changes the backbone of FPN, we fix the other structures in FPN except for backbone. Because we do not reduce spatial size after stage 4 of Resnet-50, we simply sum the output of these stages in top-down path way.

4 Experiments

In this section, we will evaluate our approach on the popular MS COCO benchmark, which has 80 objects categories. There are 80k images in the training set, and 40k images in the validation dataset. Following a common practice, we further split the 40k validation set into 35k large-val datasets and 5k mini-val datasets. All of our validation experiments involve training set and the large-val for training (about 115k images), then test on 5k mini-val datasets. We also report the final results of our approach on COCO test-dev, which has no disclosed labels.

We use standard coco metrics to evaluate our approach, including AP (averaged precision over intersection-over-union thresholds), AP\(_{50}\), AP\(_{75}\) (AP at use different IoU thresholds), and AP\(_{S}\), AP\(_{M}\), AP\(_{L}\) (AP at different scales: small, middle, large).

4.1 Detector Training and Inference

Following training strategies provided by DetectronFootnote 1 repository [7], our detectors are end-to-end trained on 8 Pascal TITAN XP GPUs, optimized by synchronized SGD with a weight decay of 0.0001 and momentum of 0.9. Each mini-batch has 2 images, so the effective batch-size is 16. We resize the shorter edge of the image to 800 pixels, the longer edge is limited to 1333 pixels to avoid large memory cost. We pad the images within mini-batch to the same size by filling zeros into the right-bottom of the image. We use typical “2x” training settings used in Detectron [7]. Learning rate is set to 0.02 at the begin of the training, and then decreased by a factor of 0.1 after 120k and 160k iterations and finally terminates at 180k iterations. We also warm-up our training by using smaller learning rate \(0.02\,\times \,0.3\) for first 500 iterations.

All experiments are initialized with ImageNet pre-trained weights. We fix the parameters of stage 1 in the backbone network. Batch normalization is also fixed during detector fine-tuning. We only adopt a simple horizontal flip data augmentation. As for proposal generation, unless explicitly stated, we first pick up 12000 proposals with highest scores, then followed by non maximum suppression (NMS) operation to get at most 2000 RoIs for training. During testing, we use 6000/1000 (6000 highest scores for NMS, 1000 RoIs after NMS) setting. We also involve popular RoI-Align technique used in Mask R-CNN [9].

4.2 Backbone Training and Inference

Following most hyper-parameters and training settings provided by ResNext [38], we train backbone on ImageNet classification datasets by 8 Pascal TITAN XP GPUs with 256 total batch size. Following the standard evaluation strategy for testing, we report the error on the single 224 \(\times \) 224 center crop from the image with 256 shorter sides.

4.3 Main Results

We adopt FPN with the ResNet-50 backbone as our baseline because FPN is a prominent detector for many other vision tasks, such as instance segmentation and skeleton [9]. To validate the effectiveness of DetNet for FPN, we propose DetNet-59 which involves an additional stage compared with ResNet-50. More design details can be found in Sect. 3. Then we replace ResNet-50 backbone with DetNet-59 and keep the other structures the same as the original FPN.

We first train DetNet-59 on ImageNet classification, results are shown in Table 1. DetNet-59 has 23.5% top-1 error at the cost of 4.8G FLOPs. Then we train FPN with DetNet-59, and compare it with ResNet-50 based FPN. From Table 1 we can see DetNet-59 has superior performance than ResNet-50 (over 2 points gains in mAP).

Table 1. Results of different backbones used in FPN. We first report the standard Top-1 error on ImageNet classification (the lower error is, the better accuracy in classification). FLOPs means the computation complexity. We also illustrate FPN COCO results to investigate effectiveness of these backbone for object detection.

Since DetNet-59 has more parameters than ResNet-50 (because we involving additional stage for FPN P6), a natural hypothesis is that the improvement is mainly due to more parameters. To validate the effectiveness of DetNet-59, we also train FPN with ResNet-101 which has 7.6G FLOPs complexity, the results is 39.8 mAP. ResNet-101 has much more FLOPs than DetNet-59, and still yields lower mAP than DetNet-59. We further add the FPN experiments based on DetNet-101. Specifically, DetNet-101 has 20 (6 in DetNet-59) repeated bottle-neck blocks in ResNet stage 4. As expected, DetNet-101 has superior results than ResNet-101, which validates that DetNet is more suitable than ResNet as a backbone network for object detection.

As DetNet is directly designed for object detection, to further validate the advantage of DetNet, we train FPN based on DetNet-59 and ResNet-50 from scratch. The results are shown in Table 2. Noticing that we use multi-gpu synchronized batch normalization during training as in [28] in order to train from scratch. Concluding from the results, DetNet-59 still outperforms ResNet-50 by 1.8 points, which further validate that DetNet is more suitable for object detection.

Table 2. FPN results on different backbones, which is trained from scratch. Since we don’t involve ImageNet pre-trained weights, we want to directly compare backbone capability for object detection.

4.4 Results Analysis

In this subsection, we will analyze how DetNet improves the object detection. There are two key-points in object detection evaluation: average precision (AP) and average recall (AR). AR means how much objects we can find out, AP means how much objects are correctly localized (right label for classification). AP and AR are usually evaluated on different IoU threshold to validate the regression capability for object location. The larger IoU is, the more accurate regression needs. AP and AR are also evaluated on different range of bounding box areas (small, middle, and large) to find the detail results on the various scales of the objects.

At first, we investigate the impact of DetNet on detection accuracy. We evaluate the performance on different IoU thresholds and object scales as shown in Table 3.

Table 3. Comparison of Average Precision (AP) of FPN on different IoU thresholds and different bounding box scales. AP\(_{50}\) is a effective metric to evaluate classification capability. AP\(_{85}\) requires accurate location of the bounding box prediction. Therefore it validates the regression capability of our approaches. We also illustrate AP at different scales to capture the influence of high resolution feature maps in backbone.
Table 4. Comparison of Average Recall (AR) of FPN on different IoU thresholds and different bounding box scales. AR\(_{50}\) is a effective metric to show how many reasonable bounding boxes we find out (class agnostic). AR\(_{85}\) means how accurate of box location.

DetNet-59 has an impressive improvement in the performance of large object location, which brings 5.5 (40.0 vs 34.5) points gains in AP\(_{85}\)@large. The reason is that original ResNet based FPN has a big stride in deeper feature map, large objects may be challenging to get an accurate regression.

We also investigate the influence of DetNet for finding the small objects. As shown in Table 4, we make the detail statistics on averaged recall at different IoU threshold and scales. We conclude the table as follows:

  • Compared with ResNet-50, DetNet-59 is more powerful for finding missing small objects, which yields 6.4 points gain (66.4 vs 60.0) in AR\(_{50}\) for the small object. DetNet keeps the higher resolution in deeper stages than ResNet, thus we can find smaller objects in deeper stages. Since we use up-sampling path-way in Fig. 1A. Shallow layer can also involve context cues for finding small objects. However, AR\(_{85}\)@small is comparable (18.7 vs 19.6) between ResNet-50 and DetNet-59. This is reasonable. DetNet has no use for small object location, because ResNet based FPN has already used the large feature map for the small object.

  • DetNet is good for large object localization, which has 56.3 (vs 50.2) in AR\(_{85}\) for large objects. However, AR\(_{50}\) in the large object does not change too much (95.4 vs 95.0). In general, DetNet finds more accurate large objects rather than missing large objects.

Fig. 3.
figure 3

The detail structure of DetNet-59-NoProj, which adopts module in Fig. 1A to split stage 6 (while original DetNet-59 adopts Fig. 1B to split stage 6). We design DetNet-59-NoProj to validate the importance of involving a new semantic stage as FPN for object detection. (Color figure online)

4.5 Discussion

As mentioned in Sect. 3, the key idea of DetNet is a novel designed backbone specifically for object detection. Based on a prominent object detector like Feature Pyramid Network, DetNet-59 follows exactly the same number of stages as FPN while maintaining high spatial resolution. To discuss the importance of the backbone for object detection, we first investigate the influence of stages.

Since the stage-6 of DetNet-59 has the same spatial size as stage-5, a natural hypothesis is that DetNet-59 simply involves a deeper stage-5 rather than producing a new stage-6. To prove DetNet-59 indeed involves an additional stage, we carefully analyze the details of DetNet-59 design. As shown in Fig. 2B. DetNet-59 adopts a dilated bottleneck with simple 1 \(\times \) 1 convolution as the projection layer to split stage 6. It is much different from traditional ResNet, when spatial size of the feature map does not change, the projection will be simple identity in bottleneck structure(Fig. 2A) rather than 1 \(\times \) 1 convolution(Fig. 2B). We break this convention. We claim the bottleneck with 1 \(\times \) 1 convolution projection is effective to create a new stage even spatial size is unchanged.

To prove our idea, we involve DetNet-59-NoProj which is modified DetNet-59 by removing 1 \(\times \) 1 projection convolution. Detail structure is shown in Fig. 3. There are only minor differences (red cell) between DetNet-59 (Fig. 2D) and DetNet-59-NoProj (Fig. 3).

First we train DetNet-59-NoProj in ImageNet classification, results are shown in Table 5. DetNet-59-NoProj has 0.5 higher Top1 error than DetNet-59. Then We train FPN based on DetNet-59-NoProj in Table 5. DetNet-59 outperforms DetNet-59-NoProj over 1 point for object detection.

The experimental results validate the importance of involving a new stage as FPN used for object detection. When we use module in Fig. 2A in our network, the output feature map is not much different from the input feature map, because output feature map is just sum of original input feature map and its transformation. Therefore, it is not easy to create a novel semantic stage for the network. While if we adopt module in Fig. 2B, it will be more divergent between input and output feature map, which enables us to create a new semantic stage.

Table 5. Comparison of DetNet-59 and DetNet-59-NoProj. We report both results on ImageNet classification and FPN COCO detection. DetNet-59 consistently outperforms DetNet-59-NoProj, which validates the importance of the backbone design (same semantic stage) as FPN.
Table 6. Comparison of FPN results on DetNet-59 and ResNet-50-dilated to validate the importance of pre-train backbone for detection. ResNet-50-dilated means that we fine-tune detection based on ResNet-50 weights, while involving dilated convolution in stage-5 of the ResNet-50. We don’t illustrate Top-1 error of ResNet-50-dilated because it can not be directly used for image classification.

Another natural question is that “what is the result if we train FPN initialized with ResNet-50 parameters, and dilate stage 5 of the ResNet-50 during detector fine-tuning (for simplify, we denote it as ResNet-50-dilated)”. To show the importance of pre-train backbone for detection, we compare DetNet-59 based FPN with ResNet-50-dilate based FPN in Table 6. ResNet-50-dilated has more FLOPs than DetNet-59, while gets lower performance than DetNet-59. Therefore, we have shown the importance of directly training base-model for object detection.

Fig. 4.
figure 4

Illustrative results of DetNet-59 based FPN.

Fig. 5.
figure 5

Illustrative results of DetNet-59 based Mask R-CNN.

4.6 Comparison to State of the Art

We evaluate DetNet-59 based FPN on MSCOCO [20, 23] detection test-dev dataset, and compare it with recent state-of-the-art methods listed in Table 7. Noticing that test-dev dataset is different from the mini-validation dataset used in ablation experiments. It has no disclosed labels and is evaluated on the server. Without any bells and whistles, our simple but efficient backbone achieves new state-of-the-art on COCO object detection, even outperforms strong competitors with ResNet-101 backbone. It is worth noting that DetNet-59 has only 4.8G FLOPs complexity while ResNet-101 has 7.6G FLOPs. We refer the original FPN results provided in Mask R-CNN [9]. It should be higher by using Detectron [7] repository, which will generate 39.8 mAP for FPN-ResNet-101.

Table 7. Comparison of object detection results between our approach and state-of-the-art on MSCOCO test-dev datasets. Based on our simple and effective backbone DetNet-59, our model outperforms all previous state-of-the-art. It is worth noting that DetNet-59 yields better results with much lower FLOPs.

To validate the generalization capability of our approach, we also evaluate DetNet-59 for MSCOCO instance segmentation based Mask R-CNN. Results are shown in Table 8 for test-dev. Thanks for the impressive ability of our DetNet59, we obtain a new state-of-the-art result on instance segmentation as well.

Table 8. Comparison of instance segmentation results between our approach and other state-of-the-art on MSCOCO test-dev datasets. Benefit from DetNet-59, we achieve a new state-of-the-art on instance segmentation task.

Some of the results are visualized in Figs. 4 and 5. Detection results of FPN with DetNet-59 backbone are shown in Fig. 4. Instance segmentation results of Mask R-CNN with DetNet-59 backbone are shown in Fig. 5. We only illustrate bounding boxes and instance segmentation no less than 0.5 classification scores.

5 Conclusion

In this paper, we design a novel backbone network specifically for the object detection task. Traditionally, the backbone network is designed for the image classification task and there is a gap when transferred to the object detection task. To address this issue, we present a novel backbone structure called DetNet, which is not only optimized for the classification task but also localization friendly. Impressive results have been reported on the object detection and instance segmentation based on the COCO benchmark.