--
Welcome to the course, which is all about Anomaly Detection in R!
Let's start by considering what is meant when we talk about anomalies.
An anomaly can be defined as a data point or collection of data points that don't seem to follow the same pattern as the rest of the data.
There are a number of different ways in which a data point can differ from the rest of a data set. To make this clearer, let's consider some specific examples.
A point anomaly is the simplest type of anomaly and is the motivation for many of the techniques covered by this course.
A point anomaly is defined as a single data point that is unusual or anomalous compared to the rest of the data. For example, observing a single unseasonally hot spring day could be considered anomalous.
The hot spring is anomalous because the temperature is extreme compared to all of the others. Point anomalies often occur in this way, as a singular extreme value on a single attribute of the data point.
The summary() function prints the maximum, minimum, upper and lower quartiles, and the mean and median, and can give a sense for how far an extreme point lies from the rest of the data. It's quite clear in this case that the 30-celsius day is a long way from the median of 22 point 45.
The easiest way to get a sense for how unusual a particular value is is by using a graphical summary like a boxplot. In R this is created using the boxplot function.
The boxplot function takes a column of values as an input argument, here illustrated with the temperature data, and produces a box and whiskers representation of the distribution of the values. Note that the ylab argument accepts a character string with which to label the yaxis, in this case the units are in Celsius.
The box extends to the upper and lower quartiles, while the whiskers stretch further and often extend to the maximum and minimum values in the data. The whiskers don't always reach the maximum and minimum values when extreme points are present, and instead the extreme values are represented as distinct points, making them easier to spot.
In the case shown here, the maximum temperature of 30 celsius stands out from the others and looks like a clear point anomaly.
It's important to note that a point anomaly is not necessarily always extreme. A point anomaly can also arise as an unusual combination of values across several attributes.
A collective anomaly is a collection of similar data instances that can be considered anomalous together when compared to the rest of the data.
For example, a consecutive 10 day period of high temperatures are shown by the red points in the plot. These daily temperatures are unusual because they occur together, and are likely caused by the same underlying weather event.
Data points in a collective anomaly may each also be point anomalies, but this needn't be true. For example, in the case of daily temperatures in a heatwave, a single warm day in summer may be completely normal for the season, but several such days that occur consecutively can cause the event to be considered an anomaly.
Collective anomalies are particularly important in studies over time, where events can cause several data points to appear anomalous at the same time.
Let's put this into practice.
0 Comments