Rare category exploration with noisy labels
Introduction
Rare categories were first proposed by Pelleg and Moore in Pelleg and Moore (2004), where a rare category was defined as tiny clusters of similar anomalies. Such rare categories are pervasive in real life. For instance, in cyber security, a rare category can be seen as a cluster of a specific kind of network attacks (Gu, Perdisci, Zhang, & Lee, 2008). Rare category examples (e.g., network attacks) are usually of higher significance than that of data examples from the major category (e.g., normal network accesses). Hence, rare category exploration (RCE) is proposed to conduct a comprehensive analysis on interesting rare categories.
RCE is formally formulated as finding the remaining data examples of a rare category from a few seeds of labelled data examples belonging to this rare category. That is, extracting data examples of the rare category with the help of a few labels. The RCE technique can be widely used in many real-world applications. For example, in the area of financial transactions, a rare category may correspond to a small fraction of malicious transactions (Bay, Kumaraswamy, Anderle, Kumar, & Steier, 2006). Finding out all the malicious transactions can help us analyze the specific security leaks in a financial system.
Driven by the wide applications of RCE, researchers have explored various possibilities to solve the RCE task. A thread of the existing RCE algorithms relies on a few labelled data examples for model building and training (He, Tong, Carbonell, 2010, Liu, Huang, He, Chiew, Gao, 2015). He el al. proposed an optimization based solution to train a rare category classifier (He et al., 2010). Liu et al. utilized wavelet transformation for rare category exploration (Liu et al., 2015). Another thread of RCE algorithms concentrates on extracting a rare category only from one single correctly labelled data example from this rare category (Huang, Chiew, Gao, He, & Li, 2014).
Though various solutions have been proposed for the RCE task, the following challenges still remain in many real-world applications. First, in practice, it is quite common that there may be some erroneous labels (referred to as noisy labels henceforth). These noisy labels are generated from many ways, including the false tagging by crowd sourcing (Kazai, Kamps, Milic-Frayling, 2013, Krishna, Hata, Chen, Kravitz, Shamma, Fei-Fei, Bernstein, 2016), the misdiagnoses by domain experts (Rebbapragada, Brodley, Sulla-Menashe, Friedl, 2012, Song, Wang, Zhang, Sun, Yang, 2015), and even the adversarial countermeasures by attackers (Nelson, Rubinstein, Huang, Joseph, Lau, Lee, Rao, Tran, Tygar, 2010, Wang, Wang, Zheng, Zhao, 2014). However, most of the existing methods assume the labels are 100% correct and reliable. It seems that the performance of these algorithms will unfortunately degrade substantially in practical RCE scenarios. Second, rare categories are usually complex-shaped and even overlapped with major categories in some special cases. Therefore, such approaches that extract a rare category only from one correct label may wrongly identify some of the data examples located at the boundary of the rare category, due to the lack of sufficient label information.
To address the above challenges, we investigate an RCE scenario in the presence of noisy labels based on the following two basic assumptions, namely (1) the label for the seed of the objective rare category is correct, and (2) noisy labels exist across the other few labelled data examples from both rare and major categories. To this end, we first propose a label propagation based algorithm known as SLP (seed label propagation) to identify the coarse shape of the objective rare category from a seed. Equipped with a compactness based similarity matrix, SLP propagates the seed label along the high density areas of the objective rare category. Then, we develop a mixture-information based propagation model termed as RLP to extract the latent and useful information in the other few labels while explicitly addressing the noisy label issue. Specifically, RLP first refines the noisy labels and the coarse shape; then, it propagates the refined labels over the whole dataset to find those data examples not included in the coarse shape. Without any rigid assumption on the structure of a rare category, our algorithms can find complex-shaped rare categories in the presence of noisy labels.
In brief, the key contributions of this work are:
- 1.
To our knowledge we are the first to explicitly define and address the rare category exploration problem in the presence of noisy labels.
- 2.
We propose a compactness based similarity matrix among data examples for capturing the characteristics of rare categories. To handle a dataset with complex shaped rare categories, we propose the kernelized version of this compactness based similarity matrix.
- 3.
Extensive experiments have been conducted on six real world datasets. Experimental results show that our proposed methods can achieve a relatively satisfactory performance even when 20% noisy labels are presented. As another contribution, we have proven the effectiveness of the proposed approaches via theoretical analysis.
The remaining sections are organized as follows. We review the related work in Section 2, and give the problem formulation and assumptions in Section 3. In Section 4, we first introduce the construction of the compactness based similarity matrix and its kernelized version, and then present the label propagation based algorithm SLP and the label-noise robust algorithm RLP. Finally, we report the experimental results and findings in Section 5 and conclude the paper in Section 6.
Section snippets
Related work
The related work can be classified into five categories, namely (1) imbalanced classification, (2) semi-supervised learning, (3) rare category detection, (4) rare category exploration, and (5) noisy labels learning.
Problem formulation and assumptions
First, let denote the set of unlabelled data examples, where d is the dimensionality of X. Further, let O ⊂ X denote the set of the target rare category hidden in X. Let s0 ∈ O denote the seed of the objective rare category, which is assumed to be labelled correctly by trustable human experts. Now, we formulate the problem of RCE in the presence of noisy labels as follows.
Definition 1 RCE in the presence of noisy labels Given an unlabelled dataset X, where the first data example X0 is the seed (referred to as s0 henceforth)
Our RCE algorithms
A complete architecture of our solution to the RCE process is shown in Fig. 1. Starting from a trustable seed of the objective rare category, our solution works as follows. First, we construct a compactness based similarity matrix (as discussed in Section 4.1). To handle some special cases in which data examples cannot be linearly reconstructed, we kernelize this compactness based similarity matrix (as discussed in Section 4.2). Then, our proposed algorithm, seed label propagation (SLP),
Experimental evaluation
In this section, we verify the effectiveness and efficiency of our algorithms from three aspects, i.e., (1) the accuracy comparison in terms of F-score, (2) label-noise robustness, and (3) the ability to find complex-shaped rare categories. In the following, we first introduce the datasets used in our experiments.
Conclusion
In this paper, we have proposed two algorithms known as SLP and RLP for RCE in the presence of noisy labels. In detail, we have proposed a similarity matrix which has enabled our SLP algorithm to identify the coarse shape of the objective rare category, and a mechanism of refinement which has enabled RLP to refine data examples of the rare category in the presence of noisy labels. Our experiments conducted on different datasets have further verified the effectiveness and efficiency of our
Acknowledgment
This work was partly supported by NSFC under No. 61472359.
References (42)
- et al.
rare category exploration
Expert Systems with Applications
(2014) - et al.
Prior-free rare category detection: More effective and efficient solutions
Expert Systems with Applications
(2014) - et al.
Semi-supervised learning
Academic press library in signal processing
(2014) - et al.
Frequency-tuned salient region detection
Proceedings of CVPR
(2009) - et al.
Learning from noisy examples
Machine Learning
(1988) - Asuncion, A., & Newman, D. (2007). UCI machine learning...
- et al.
Large scale detection of irregularities in accounting data
Proceedings of ICDM
(2006) - et al.
Label-noise robust logistic regression and its applications
Proceedings of ECML PKDD
(2012) - et al.
Cluster kernels for semi-supervised learning
Proceedings of NIPS
(2002) - et al.
Active learning for class imbalance problem
Proceedings of SIGIR
(2007)