Elsevier

Expert Systems with Applications

Volume 114, 30 December 2018, Pages 503-515
Expert Systems with Applications

Rare category exploration with noisy labels

https://doi.org/10.1016/j.eswa.2018.07.050Get rights and content

Highlights

  • We define and address the rare category exploration problem with noisy labels.

  • We propose a compactness based similarity matrix for RCE.

  • Experimental evaluation demonstrates the effectiveness of our two methods.

  • We prove the effectiveness of our methods via theoretical analysis.

Abstract

Starting from a few labelled data examples as the seeds, rare category exploration (RCE) aims to find out the target rare category hidden in the given dataset. However, the performance of conventional RCE approaches is very sensitive to noisy labels while the presence of noises in manually generated labels is almost inevitable. To address this deficiency of traditional RCE approaches, this paper investigates the RCE process in the presence of noisy labels, which to the best of our knowledge has not yet been intensively studied by previous research. Based on the assumption that only one labelled data example of the rare category is correctly labelled while the other few data examples may be wrongly labelled, we first propose a label propagation based algorithm SLP to extract the coarse shape of a rare category. Then, we refine the result by proposing a mixture-information based propagation model, RLP. Extensive experiments have been conducted on six real-world datasets, which show that our method outperforms the state-of-the-art RCE approaches. We also show that even with 20% noisy labels, our method is able to achieve a satisfactory accuracy.

Introduction

Rare categories were first proposed by Pelleg and Moore in Pelleg and Moore (2004), where a rare category was defined as tiny clusters of similar anomalies. Such rare categories are pervasive in real life. For instance, in cyber security, a rare category can be seen as a cluster of a specific kind of network attacks (Gu, Perdisci, Zhang, & Lee, 2008). Rare category examples (e.g., network attacks) are usually of higher significance than that of data examples from the major category (e.g., normal network accesses). Hence, rare category exploration (RCE) is proposed to conduct a comprehensive analysis on interesting rare categories.

RCE is formally formulated as finding the remaining data examples of a rare category from a few seeds of labelled data examples belonging to this rare category. That is, extracting data examples of the rare category with the help of a few labels. The RCE technique can be widely used in many real-world applications. For example, in the area of financial transactions, a rare category may correspond to a small fraction of malicious transactions (Bay, Kumaraswamy, Anderle, Kumar, & Steier, 2006). Finding out all the malicious transactions can help us analyze the specific security leaks in a financial system.

Driven by the wide applications of RCE, researchers have explored various possibilities to solve the RCE task. A thread of the existing RCE algorithms relies on a few labelled data examples for model building and training (He, Tong, Carbonell, 2010, Liu, Huang, He, Chiew, Gao, 2015). He el al. proposed an optimization based solution to train a rare category classifier (He et al., 2010). Liu et al. utilized wavelet transformation for rare category exploration (Liu et al., 2015). Another thread of RCE algorithms concentrates on extracting a rare category only from one single correctly labelled data example from this rare category (Huang, Chiew, Gao, He, & Li, 2014).

Though various solutions have been proposed for the RCE task, the following challenges still remain in many real-world applications. First, in practice, it is quite common that there may be some erroneous labels (referred to as noisy labels henceforth). These noisy labels are generated from many ways, including the false tagging by crowd sourcing (Kazai, Kamps, Milic-Frayling, 2013, Krishna, Hata, Chen, Kravitz, Shamma, Fei-Fei, Bernstein, 2016), the misdiagnoses by domain experts (Rebbapragada, Brodley, Sulla-Menashe, Friedl, 2012, Song, Wang, Zhang, Sun, Yang, 2015), and even the adversarial countermeasures by attackers (Nelson, Rubinstein, Huang, Joseph, Lau, Lee, Rao, Tran, Tygar, 2010, Wang, Wang, Zheng, Zhao, 2014). However, most of the existing methods assume the labels are 100% correct and reliable. It seems that the performance of these algorithms will unfortunately degrade substantially in practical RCE scenarios. Second, rare categories are usually complex-shaped and even overlapped with major categories in some special cases. Therefore, such approaches that extract a rare category only from one correct label may wrongly identify some of the data examples located at the boundary of the rare category, due to the lack of sufficient label information.

To address the above challenges, we investigate an RCE scenario in the presence of noisy labels based on the following two basic assumptions, namely (1) the label for the seed of the objective rare category is correct, and (2) noisy labels exist across the other few labelled data examples from both rare and major categories. To this end, we first propose a label propagation based algorithm known as SLP (seed label propagation) to identify the coarse shape of the objective rare category from a seed. Equipped with a compactness based similarity matrix, SLP propagates the seed label along the high density areas of the objective rare category. Then, we develop a mixture-information based propagation model termed as RLP to extract the latent and useful information in the other few labels while explicitly addressing the noisy label issue. Specifically, RLP first refines the noisy labels and the coarse shape; then, it propagates the refined labels over the whole dataset to find those data examples not included in the coarse shape. Without any rigid assumption on the structure of a rare category, our algorithms can find complex-shaped rare categories in the presence of noisy labels.

In brief, the key contributions of this work are:

  • 1.

    To our knowledge we are the first to explicitly define and address the rare category exploration problem in the presence of noisy labels.

  • 2.

    We propose a compactness based similarity matrix among data examples for capturing the characteristics of rare categories. To handle a dataset with complex shaped rare categories, we propose the kernelized version of this compactness based similarity matrix.

  • 3.

    Extensive experiments have been conducted on six real world datasets. Experimental results show that our proposed methods can achieve a relatively satisfactory performance even when 20% noisy labels are presented. As another contribution, we have proven the effectiveness of the proposed approaches via theoretical analysis.

The remaining sections are organized as follows. We review the related work in Section 2, and give the problem formulation and assumptions in Section 3. In Section 4, we first introduce the construction of the compactness based similarity matrix and its kernelized version, and then present the label propagation based algorithm SLP and the label-noise robust algorithm RLP. Finally, we report the experimental results and findings in Section 5 and conclude the paper in Section 6.

Section snippets

Related work

The related work can be classified into five categories, namely (1) imbalanced classification, (2) semi-supervised learning, (3) rare category detection, (4) rare category exploration, and (5) noisy labels learning.

Problem formulation and assumptions

First, let X={X0,X1,,Xn}Rd denote the set of unlabelled data examples, where d is the dimensionality of X. Further, let O ⊂ X denote the set of the target rare category hidden in X. Let s0 ∈ O denote the seed of the objective rare category, which is assumed to be labelled correctly by trustable human experts. Now, we formulate the problem of RCE in the presence of noisy labels as follows.

Definition 1 RCE in the presence of noisy labels

Given an unlabelled dataset X, where the first data example X0 is the seed (referred to as s0 henceforth)

Our RCE algorithms

A complete architecture of our solution to the RCE process is shown in Fig. 1. Starting from a trustable seed of the objective rare category, our solution works as follows. First, we construct a compactness based similarity matrix (as discussed in Section 4.1). To handle some special cases in which data examples cannot be linearly reconstructed, we kernelize this compactness based similarity matrix (as discussed in Section 4.2). Then, our proposed algorithm, seed label propagation (SLP),

Experimental evaluation

In this section, we verify the effectiveness and efficiency of our algorithms from three aspects, i.e., (1) the accuracy comparison in terms of F-score, (2) label-noise robustness, and (3) the ability to find complex-shaped rare categories. In the following, we first introduce the datasets used in our experiments.

Conclusion

In this paper, we have proposed two algorithms known as SLP and RLP for RCE in the presence of noisy labels. In detail, we have proposed a similarity matrix which has enabled our SLP algorithm to identify the coarse shape of the objective rare category, and a mechanism of refinement which has enabled RLP to refine data examples of the rare category in the presence of noisy labels. Our experiments conducted on different datasets have further verified the effectiveness and efficiency of our

Acknowledgment

This work was partly supported by NSFC under No. 61472359.

References (42)

  • H. Huang et al.

    rare category exploration

    Expert Systems with Applications

    (2014)
  • Z. Liu et al.

    Prior-free rare category detection: More effective and efficient solutions

    Expert Systems with Applications

    (2014)
  • X. Zhou et al.

    Semi-supervised learning

    Academic press library in signal processing

    (2014)
  • R. Achanta et al.

    Frequency-tuned salient region detection

    Proceedings of CVPR

    (2009)
  • D. Angluin et al.

    Learning from noisy examples

    Machine Learning

    (1988)
  • Asuncion, A., & Newman, D. (2007). UCI machine learning...
  • S. Bay et al.

    Large scale detection of irregularities in accounting data

    Proceedings of ICDM

    (2006)
  • J. Bootkrajang et al.

    Label-noise robust logistic regression and its applications

    Proceedings of ECML PKDD

    (2012)
  • O. Chapelle et al.

    Cluster kernels for semi-supervised learning

    Proceedings of NIPS

    (2002)
  • S. Ertekin et al.

    Active learning for class imbalance problem

    Proceedings of SIGIR

    (2007)
  • H.G. Golub et al.

    Matrix computations

    (2012)
  • G. Gu et al.

    Botminer: Clustering analysis of network traffic for protocol- and structure-independent botnet detection

    Proceedings of USENIX security

    (2008)
  • S. Hassani

    Mathematical methods: for students of physics and related fields

    (2008)
  • J. He et al.

    Nearest-neighbor-based active learning for rare category detection

    Proceedings of NIPS

    (2007)
  • J. He et al.

    Prior-free rare category detection

    Proceedings of SDM

    (2009)
  • J. He et al.

    Graph-based rare category detection

    Proceedings of ICDM

    (2008)
  • J. He et al.

    Rare category characterization

    Proceedings of ICDM

    (2010)
  • T.M. Hospedales et al.

    Finding rare classes: Active learning with generative and discriminative models

    IEEE transactions on knowledge and data engineering

    (2013)
  • H. Huang et al.

    CLOVER: A faster prior-free approach to rare-category detection

    Knowledge and information systems

    (2013)
  • H. Huang et al.

    RADAR: rare category detection via computation of boundary degree

    Proceedings of PAKDD

    (2011)
  • G. Kazai et al.

    An analysis of human factors and label accuracy in crowdsourcing relevance judgments

    Information retrieval

    (2013)
  • Cited by (0)

    View full text