Modeling spatial layout for scene image understanding via a novel multiscale sum-product network
Introduction
Natural scene image understanding, which aims at labeling each pixel of a scene image to a predefined object class and simultaneously performing segmentation or recognition of multiple objects that occur in the scene, has been extensively studied in the past years (Farabet, Couprie, Najman, LeCun, 2013, Gritti, Damkat, Monaci, 2013, Liu, Xu, Feng, 2011, Rincón, Bachiller, Mira, 2005, Shotton, Winn, Rother, Criminisi, 2009, Tighe, Lazebnik, 2013, Tu, Chen, Yuille, Zhu, 2005, Yin, Jiao, Chai, Fang, 2015). However, since even the objects of the same class tend to exhibit large intra-class variations in natural scenes, automatically providing satisfying high-level semantics from complex images is still a very challenging task. With the recent development of vision-based hardware and network techniques, it becomes an active research topic and has attracted more and more researchers in computer vision and pattern recognition community.
Fortunately, in addition to low-level cues, most natural scene images contain intrinsic spatial structures, namely, scene contextual information. Thus the state-of-the-art approaches consider the spatial layout of each scene image as a kind of prior information to improve image understanding results. These methods jointly model the low-level appearances of every single patch and the spatial structures between adjacent patch pairs through a unified framework. A common choice for spatial layout modeling is to resort to graphical models typically like Conditional Random Field (CRF) (Krähenbühl, Koltun, 2011, Ladicky, Russell, Kohli, Torr, 2009), where nodes are built on image pixels or superpixels, while edges incorporate the second order priors like the smoothness between adjacent nodes. Then the problem of image understanding is treated as Maximum-a-Posterior (MAP) inference in the graphical model. These methods are particularly effective for modeling the relationship between adjacent objects. However, they may not perform well for complex scene images due to the following limitations. First, CRF with a grid structure has the ability to utilize adjacent spatial relations, but cannot characterize a wider range of spatial context constraints among non-adjacent scene objects, which sometimes play a more important role than adjacent relations in automatically understanding scene image contents. For example, it is always hard to decide scene content directly from local visual features due to their instabilities brought by intensity, color, texture, illumination, occlusion and viewpoint variations. Instead, by combining a wider range of spatial relations on different scales, scene content understanding results can be greatly improved. Unfortunately, how to combine both the adjacent (short-range) and the nonadjacent (long-range) spatial layouts of image content to enforce the consistency of scene image parsing with such graphical models is still admittedly a hard problem. Second, the inference and training of some complex graphical models may be inefficient, which make the use of graphical models in real-life computer vision related applications inflexible and difficult. For example, CRF with a full connected structure is often hard to infer and train (Farabet et al., 2013). Finally, the state-of-the-art graphical models often face the difficulty on how to integrate high-order object shape priors, which are essential in parsing or understanding image semantics accurately. However, object shape priors sometimes are not well integrated into such models.
In our previous research, we explored scene image co-segmentation by a topic-level random walk framework (Yuan, Lu, & Shivakumara, 2014) and object category discovery in natural scenes using a context-aware graphical model (Yuan & Lu, 2014). To further address the discussed issues in visual scene image understanding, this paper proposes a novel deep architecture named Multiscale Sum-Product Network (MSPN), which can be viewed as a stacked sum-product network (SPN) (Poon & Domingos, 2011) to jointly model the distribution over image-level labels and unary potentials from different scene scales. Due to the deep structure of MSPN, the proposed model is able to characterize both the local (short-range) and the global (long-range) spatial relations on different scales from pixel-level, patch-level to image level through a hierarchical manner for better parsing the semantics from complex scene images. Ideally, the combination of both the two types of spatial relations can help understand scene images more accurately since long-range interactions among image patches can be well characterized by MSPN. Additionally, by stacked MSPN, we have the ability to characterize high-order relations among pixels in a flexible and implicit way. To the best of our knowledge, this is the first work by introducing the conceptual deep sum-product network into scene image understanding or content parsing research to reduce the instabilities brought by the unpredictable variations of low-level local visual appearances.
In our architecture, the product operation models various correlations between every two adjacent patches, on which the sum operation further integrates these correlations into the “feature” of a larger patch. On the bottom layer of MSPN, an SPN is designed for each patch to model the joint distribution over the unary potentials and image labels from the previous scale, aiming at modeling the local context information within the image patch. On the up layer of MSPN, a global SPN is proposed for the whole image, aggregating the information from all the SPNs of the patches on the bottom layer, the unary potentials and image labels. Thus, the up layer of MSPN is able to capture long-range interactions among image patches and thereby successfully models the global context information of image content for parsing complex scene semantics more accurately.
In addition to the deep modeling of spatial layouts in every scene, MSPN also allows for efficient inference during the testing phase, which benefits from the fact that the proposed MSPN is a deep tractable model and only contains relatively simple product and max operations. This is particularly useful for designing real-life systems with a much lower computational load. We show the overall pipeline for understanding the semantics of an unknown scene image in Fig. 1, where we first compute the multiscale unary potentials (middle-left in Fig. 1), and then infer the label for each pixel inside it by maximizing the posterior with the learned MSPN (middle-right), which can be conducted efficiently by a two-pass algorithm. Finally, scene image understanding results will be further refined by using the over-segmented region information from the original scene image (right).
Our main contributions are two-folds: (1) a novel deep network framework named MSPN is proposed to perform semantic image segmentation by a more effective and efficient way comparing with the popular graphic models; and (2) the architecture of MSPN is elaborately designed to model multi-scale features, either local or global spatial layout of any scene image, and higher-order priors under an unified framework. The results on two popular benchmarks show that the proposed MSPN addresses semantic image segmentation effectively.
The rest of the paper is organized as follows. Section 2 discusses the related work. In Section 3, we introduce the structure of MSPN. Section 4 gives the training and inference methods for scene image understanding. Experimental results and discussions are given in Section 5, and finally Section 6 concludes the proposed model.
Section snippets
Related work
The problem of image understanding or parsing has been extensively studied in the previous research, and the existing approaches can be roughly classified into three categories, namely, bottom-up scoring, top-down refinement, and region label reasoning.
In bottom-up scoring methods, a fairly large number of object hypotheses is first generated and then low-level color, texture and shape features are used on these segments for classifying object regions. For example, Gu, Lim, Arbelaez, and
Multiscale sum-product network for scene understanding
To model both the short-range scene context for every image patch and the long-range scene context among different patches, we put forward MSPN to characterize these correlations. Essentially, our MSPN can be considered as a stacked SPN as introduced. The SPN on the bottom layer aims at modeling local context for each scene image patch, while the SPN on the upper layer makes use of the output of the bottom layer, namely, local correlations inside patches, and the unary potentials on this scale
Learning multiscale sum-product network
In the proposed MSPN, generally there are a large number of nodes and densely connected edges. For a certain dataset with images corresponding groundtruth and multiscale unary potentials our goal is to obtain the learned MSPN that is parameterized by proper weights and structures to best explain the training dataset. For this purpose, we learn the MSPN in a generative fashion, aiming at maximizing the following log likelihood: where the
Experiments
We evaluate our method on two benchmarks consisting of the MSRC-21 dataset (Shotton et al., 2009) and the SIFT FLOW dataset (Liu, Yuen, & Torralba, 2009). As both these two datasets contain a number of object classes organized with particular spatial layouts, they are very suitable for evaluating the proposed scene understanding framework. The MSRC-21 dataset consists of 591 color images with the size of 213 × 320 pixels and the corresponding groundtruth labels of altogether 21 classes. The
Discussion and conclusion
This paper has proposed a multiscale sum-product network (MSPN) to capture the spatial-layout in a coarse-to-fine manner for scene image understanding. We first construct a multiscale representation of unary potentials and object labels. We then use MSPN to model the joint distribution over multiscale unary potentials and object labels. For testing, we infer object labels by using the most probable explanation (MPE) technique. The inference results are then fed into a superpixel based refine
Acknowledgment
The work described in this paper was supported by the Natural Science Foundation of China under Grant No. 61272218 and No. 61321491, and the Program for New Century Excellent Talents under NCET-11-0232.
References (40)
- et al.
Region contextual visual words for scene categorization
Expert System with Application
(2011) - et al.
Semantic segmentation of images exploiting DCT based features and random forest
Pattern Recognition
(2016) - et al.
Knowledge modeling for the image understanding task as a design task
Expert System with Application
(2005) - et al.
Hierarchical Bayesian models for unsupervised scene understanding
Computer Vision and Image Understanding
(2015) - et al.
Interpretation of complex scenes using dynamic tree-structure Bayesian networks
Computer Vision and Image Understanding
(2007) - et al.
Scene classification based on single-layer SAE and SVM
Expert System with Application
(2015) - et al.
Sum-product networks for modeling activities with stochastic structure
Cvpr
(2012) - et al.
Semantic segmentation using regions and parts
Cvpr
(2012) - et al.
Harmony potentials - fusing global and local scale for semantic image segmentation
International Journal of Computer Vision
(2012) - et al.
Object segmentation by alignment of poselet activations to image contours
Cvpr
(2011)