Jump to content

Isolation forest: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Reference was citing an article which did not include any information about the iForest. Further more in the "History" Section are the bullet points obvioulsy from Liu et al. 2012
Mulinaky (talk | contribs)
No edit summary
Line 2: Line 2:
{{short description|Algorithm for anomaly detection}}
{{short description|Algorithm for anomaly detection}}
{{COI|date=July 2022}}
{{COI|date=July 2022}}
'''Isolation Forest''' is an algorithm for data [[anomaly detection]] initially developed by Fei Tony Liu and Zhi-Hua Zhou in 2008.<ref>{{Cite journal |last=Tony Liu |first=Fei |last2=Zhou |first2=Zhi-Hua |title=Isolation Forest |url=https://ieeexplore.ieee.org/abstract/document/4781136 |journal=[[IEEE Xplore]]}}</ref> It detects anomalies using isolation (how far a data point is from the rest of the data). This method deviates from the mainstream philosophy that underpinned most existing anomaly detectors at the time. Instead of profiling all normal instances before anomalies are identified, Isolation Forest detects anomalies using [[Binary tree|binary trees]]. The algorithm has a linear time complexity with a low constant and a low memory requirement, which works well with high volume data.<ref name=":0">{{Cite journal|last1=Liu|first1=Fei Tony|last2=Ting|first2=Kai Ming|last3=Zhou|first3=Zhi-Hua|date=December 2008|title=Isolation Forest|journal=2008 Eighth IEEE International Conference on Data Mining|pages=413–422|doi=10.1109/ICDM.2008.17|isbn=978-0-7695-3502-9|s2cid=6505449}}</ref><ref name=":1">{{Cite journal|last1=Liu|first1=Fei Tony|last2=Ting|first2=Kai Ming|last3=Zhou|first3=Zhi-Hua|date=December 2008|title=Isolation-Based Anomaly Detection|journal=2012 ACM Transactions on Knowledge Discovery from Data|pages=3:1–3:39|url=https://dl.acm.org/doi/10.1145/2133360.2133363|doi=10.1145/2133360.2133363}}</ref>
'''Isolation Forest''' is an algorithm for data [[anomaly detection]] initially developed by [[Fei Tony Liu|Fei Tony Li]]<nowiki/>u and [[Zhou Zhi-Hua|Zhi-Hua Zhou]] in 2008.<ref>{{Cite journal |last=Tony Liu |first=Fei |last2=Zhou |first2=Zhi-Hua |title=Isolation Forest |url=https://ieeexplore.ieee.org/abstract/document/4781136 |journal=[[IEEE Xplore]]}}</ref> It detects anomalies using isolation (how far a data point is from the rest of the data). This method deviates from the mainstream philosophy that underpinned most existing anomaly detectors at the time. Instead of profiling all normal instances before anomalies are identified, Isolation Forest detects anomalies using [[Binary tree|binary trees]]. The algorithm has a linear time complexity with a low constant and a low memory requirement, which works well with high volume data.<ref name=":0">{{Cite journal|last1=Liu|first1=Fei Tony|last2=Ting|first2=Kai Ming|last3=Zhou|first3=Zhi-Hua|date=December 2008|title=Isolation Forest|journal=2008 Eighth IEEE International Conference on Data Mining|pages=413–422|doi=10.1109/ICDM.2008.17|isbn=978-0-7695-3502-9|s2cid=6505449}}</ref><ref name=":1">{{Cite journal|last1=Liu|first1=Fei Tony|last2=Ting|first2=Kai Ming|last3=Zhou|first3=Zhi-Hua|date=December 2008|title=Isolation-Based Anomaly Detection|journal=2012 ACM Transactions on Knowledge Discovery from Data|pages=3:1–3:39|url=https://dl.acm.org/doi/10.1145/2133360.2133363|doi=10.1145/2133360.2133363}}</ref>
[[File:Normalized Anomaly Scores of Isolation Forest.png|thumb|The normalized anomaly scores of the Isolation Forest algorithm fit on the Old Faithful dataset]]
[[File:Normalized Anomaly Scores of Isolation Forest.png|thumb|The normalized anomaly scores of the Isolation Forest algorithm fit on the Old Faithful dataset]]


Line 69: Line 69:
The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of [[Binary search tree|Binary Search Trees]] (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST.<ref name=":2" /> As a consequence, the estimation of average <math>h(x)</math> for external node terminations is the same as that of the unsuccessful searches in BST, that is<ref>{{Cite book |title=Data structures & algorithm analysis in Java|last=Shaffer |first=Clifford A. |date=2011|publisher=Dover Publications |isbn=9780486485812|edition=3rd Dover |location=Mineola, NY|oclc=721884651}}</ref>
The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of [[Binary search tree|Binary Search Trees]] (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST.<ref name=":2" /> As a consequence, the estimation of average <math>h(x)</math> for external node terminations is the same as that of the unsuccessful searches in BST, that is<ref>{{Cite book |title=Data structures & algorithm analysis in Java|last=Shaffer |first=Clifford A. |date=2011|publisher=Dover Publications |isbn=9780486485812|edition=3rd Dover |location=Mineola, NY|oclc=721884651}}</ref>


<math>c(m) = \begin{cases} 2H(m-1)-\frac{2(m-1)}{n} & \text{for }m>2 \\ 1 & \text{for }m=2 \\ 0 & \text{otherwise} \end{cases}</math>


where <math>n</math> is the testing data size, <math>m</math> is the size of the sample set and <math>H</math> is the harmonic number, which can be estimated by <math>H(i)=ln(i)+\gamma</math>, where <math>\gamma=0.5772156649 </math> is the [[Euler–Mascheroni constant|Euler-Mascheroni constant]].
where <math>n</math> is the testing data size, <math>m</math> is the size of the sample set and <math>H</math> is the harmonic number, which can be estimated by <math>H(i)=ln(i)+\gamma</math>, where <math>\gamma=0.5772156649 </math> is the [[Euler–Mascheroni constant|Euler-Mascheroni constant]].
Line 75: Line 74:
The value of c(m) above represents the average of <math>h(x)</math> given <math>m</math>, so we can use it to normalise <math>h(x)</math> and get an estimation of the anomaly score for a given instance x:
The value of c(m) above represents the average of <math>h(x)</math> given <math>m</math>, so we can use it to normalise <math>h(x)</math> and get an estimation of the anomaly score for a given instance x:


<math>s(x,m)=2^\frac{-E(h(x))}{c(m)}</math>
<math>s(x,m)=2^\frac{-E(h(x))}{c(m)}</math><math>c(m) = \begin{cases} 2H(m-1)-\frac{2(m-1)}{n} & \text{for }m>2 \\ 1 & \text{for }m=2 \\ 0 & \text{otherwise} \end{cases}</math>


where <math>E(h(x))</math> is the average value of <math>h(x)</math> from a collection of iTrees. It is interesting to note that for any given instance <math>x</math>:
where <math>E(h(x))</math> is the average value of <math>h(x)</math> from a collection of iTrees. It is interesting to note that for any given instance <math>x</math>:

Revision as of 13:54, 30 December 2022

Isolation Forest is an algorithm for data anomaly detection initially developed by Fei Tony Liu and Zhi-Hua Zhou in 2008.[1] It detects anomalies using isolation (how far a data point is from the rest of the data). This method deviates from the mainstream philosophy that underpinned most existing anomaly detectors at the time. Instead of profiling all normal instances before anomalies are identified, Isolation Forest detects anomalies using binary trees. The algorithm has a linear time complexity with a low constant and a low memory requirement, which works well with high volume data.[2][3]

The normalized anomaly scores of the Isolation Forest algorithm fit on the Old Faithful dataset

Isolation Forest splits the data space using lines that are orthogonal to the origin and assigns higher anomaly scores to data points that need fewer splits to be isolated. The figure on the right shows an application of the Isolation Forest algorithm to the waiting time between eruptions and the duration of the eruption of the Old Faithful geyser in Yellowstone National Park. Darker shades of red indicate higher estimated anomaly scores.

Anomalies in a large dataset may follow very complicated patterns, which are difficult to detect visually in the majority of cases. This makes the field of anomaly detection well suited for the application of machine learning techniques.

History

The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008.[2] It takes advantage of two quantitative properties of anomalous data points in a sample:

  1. Few - anomalous data points are the minority, consisting of fewer instances
  2. Different - anomalous data points have attribute-values that are very different from those of normal instances

Since anomalies are "few and different", they are easier to “isolate” compared to normal points. Isolation Forest builds an ensemble of “Isolation Trees” (iTrees) for the data set, and anomalies are the points that have shorter average path lengths.

In 2010, an extension of the algorithm - SCiforest [4] was published to address the issues of clustered anomalies and utilize random hyperplanes to enhance the algorithm's ability to detect axis-paralleled anomalies. These abilities are lacking in the original implementation.

In a later paper, published in 2012[3] the same authors described a set of experiments to prove that iForest:

  • has a low linear time complexity and a small memory requirement
  • is able to deal with high dimensional data with irrelevant attributes
  • can be trained with or without anomalies in the training set
  • can provide detection results with different levels of granularity without re-training

In 2013 Zhiguo Ding and Minrui Fei proposed a framework based on iForest to resolve the problem of detecting anomalies in streaming data.[5] More applications of iForest to streaming data are described in papers by Tan et al.,[6] Susto et al.[7] and Weng et al.[8]

One of the main problems encountered when applying iForest to anomaly detection is the way the “anomaly score” is computed. This problem was highlighted in a 2018 paper,[9] wherein an improved iForest model named Extended Isolation Forest (EIF) was proposed. The paper describes the improvements made to the original model that enhance the consistency and reliability of the anomaly score produced for a given data point. It is not sure if the random slope enhancement proposed in EIF is novel, as a similar construct "random hyperplane" was proposed in SCiForest in 2010, which mitigates the same axis-parallel bias in iForest.

Algorithm

Fig. 2 - an example of isolating a non-anomalous point in a 2D Gaussian distribution.

At the basis of the Isolation Forest algorithm, there is the tendency of anomalous instances in a dataset to be easier to separate from the rest of the sample (isolate), compared to normal points. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value between the minimum and maximum values allowed for that attribute.

Isolating an Anomalous Point
Fig. 3 - an example of isolating an anomalous point in a 2D Gaussian distribution.

An example of random partitioning in a 2D dataset of normally distributed points is given in Fig. 2 for a non-anomalous point and Fig. 3 for a point that's more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.

Mathematically, recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point in Fig. 2 is greater than the path length of in Fig. 3.

More formally, let be a set of d-dimensional points and . An Isolation Tree (iTree) is defined as a data structure with the following properties:

  1. for each node in the Tree, is either an external-node with no child, or an internal-node with one “test” and exactly two daughter nodes ( and )
  2. a test at node consists of an attribute and a split value such that the test determines the traversal of a data point to either oder .

In order to build an iTree, the algorithm recursively divides by randomly selecting an attribute and a split value , until either

  1. the node has only one instance, or
  2. all data at the node have the same values.

When the iTree is fully grown, each point in is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length of point is defined as the number of edges traverses from the root node to get to an external node.

A probabilistic explanation of iTree is provided in the iForest original paper.[2]

Properties of isolation forest

  • Sub-sampling: since iForest does not need to isolate all normal instances, it can frequently ignore the big majority of the training sample. As a consequence, iForest works very well when the sampling size is kept small, a property that is in contrast with the great majority of existing methods, where a large sampling size is usually desirable.[2][3]
  • Swamping: when normal instances are too close to anomalies, the number of partitions required to separate anomalies increases, a phenomenon known as swamping, which makes it more difficult for iForest to discriminate between anomalies and normal points. One of the main reasons for swamping is the presence of too much data for the purpose of anomaly detection, which implies one possible solution to the problem is sub-sampling. Since iForest responds very well to sub-sampling in terms of performance, the reduction of the number of points in the sample is also a good way to reduce the effect of swamping.[2]
  • Masking: when the number of anomalies is high it is possible that some of those aggregate in a dense and large cluster, making it more difficult to separate the single anomalies and, in turn, to detect such points as anomalous. Similarly to swamping, this phenomenon (known as “masking”) is also more likely when the number of points in the sample is big and can be alleviated through sub-sampling.[2]
  • High Dimensional Data: one of the main limitations of standard, distance-based methods is their inefficiency in dealing with high dimensional datasets.[10] The main reason for that is, in a high dimensional space every point is equally sparse, so using a distance-based measure of separation is pretty ineffective. Unfortunately, high-dimensional data also affects the detection performance of iForest, but the performance can be vastly improved by adding a features selection test like Kurtosis to reduce the dimensionality of the sample space.[2][4]
  • Normal Instances Only: iForest performs well even if the training set does not contain any anomalous point,[4] the reason being that iForest describes data distributions in such a way that high values of the path length correspond to the presence of data points. As a consequence, the presence of anomalies is pretty irrelevant to iForest's detection performance.

Anomaly detection with isolation forest

Anomaly detection with Isolation Forest is a process composed of two main stages:[4]

  1. in the first stage, a training dataset is used to build iTrees.
  2. in the second stage, each instance in the test set is passed through these iTrees, and a proper “anomaly score” is assigned to the instance.

Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as “anomaly” any point whose score is greater than a predefined threshold, which depends on the domain the analysis is being applied to.

Anomaly score

The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST.[4] As a consequence, the estimation of average for external node terminations is the same as that of the unsuccessful searches in BST, that is[11]


where is the testing data size, is the size of the sample set and is the harmonic number, which can be estimated by , where is the Euler-Mascheroni constant.

The value of c(m) above represents the average of given , so we can use it to normalise and get an estimation of the anomaly score for a given instance x:

where is the average value of from a collection of iTrees. It is interesting to note that for any given instance :

  • if is close to then is very likely to be an anomaly
  • if is smaller than then is likely to be a normal value
  • if for a given sample all instances are assigned an anomaly score of around , then it is safe to assume that the sample doesn't have any anomaly

Open source implementations

Original implementation:

  • Isolation Forest, an algorithm that detects data-anomalies using binary trees written in R. Released by the paper's first author Liu, Fei Tony in 2009.

Other implementations (in alphabetical order):

Other variations of Isolation Forest algorithm implementations:

See also

References

  1. ^ Tony Liu, Fei; Zhou, Zhi-Hua. "Isolation Forest". IEEE Xplore.
  2. ^ a b c d e f g Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). "Isolation Forest". 2008 Eighth IEEE International Conference on Data Mining: 413–422. doi:10.1109/ICDM.2008.17. ISBN 978-0-7695-3502-9. S2CID 6505449.
  3. ^ a b c Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). "Isolation-Based Anomaly Detection". 2012 ACM Transactions on Knowledge Discovery from Data: 3:1–3:39. doi:10.1145/2133360.2133363.
  4. ^ a b c d e Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (September 2010). "On Detecting Clustered Anomalies Using SCiForest". Joint European Conference on Machine Learning and Knowledge Discovery in Databases - ECML PKDD 2010: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. 6322: 274–290. doi:10.1007/978-3-642-15883-4_18. ISBN 978-3-642-15882-7.
  5. ^ Ding, Zhiguo; Fei, Minrui (September 2013). "An Anomaly Detection Approach Based on Isolation Forest Algorithm for Streaming Data using Sliding Window". 3rd IFAC International Conference on Intelligent Control and Automation Science.
  6. ^ Tan, Swee Chuan; Ting, Kai Ming; Liu, Fei Tony (2011). "Fast anomaly detection for streaming data". Proceedings of the Twenty-Second international joint conference on Artificial Intelligence. Vol. 2. AAAI Press. pp. 1511–1516. doi:10.5591/978-1-57735-516-8/IJCAI11-254. ISBN 9781577355144.
  7. ^ Susto, Gian Antonio; Beghi, Alessandro; McLoone, Sean (2017). "Anomaly detection through on-line isolation Forest: An application to plasma etching". 2017 28th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC) (PDF). pp. 89–94. doi:10.1109/ASMC.2017.7969205. ISBN 978-1-5090-5448-0.
  8. ^ Weng, Yu; Liu, Lei (15 April 2019). "A Collective Anomaly Detection Approach for Multidimensional Streams in Mobile Service Security". IEEE Access. 7: 49157–49168. doi:10.1109/ACCESS.2019.2909750.
  9. ^ Hariri, Sahand; Carrasco Kind, Matias; Brunner, Robert J. (2019). "Extended Isolation Forest". IEEE Transactions on Knowledge and Data Engineering. 33 (4): 1479–1489. arXiv:1811.02141. doi:10.1109/TKDE.2019.2947676. S2CID 53236735.
  10. ^ Dilini Talagala, Priyanga; Hyndman, Rob J.; Smith-Miles, Kate (12 Aug 2019). "Anomaly Detection in High Dimensional Data". arXiv:1908.04000 [stat.ML].
  11. ^ Shaffer, Clifford A. (2011). Data structures & algorithm analysis in Java (3rd Dover ed.). Mineola, NY: Dover Publications. ISBN 9780486485812. OCLC 721884651.