Sunday, January 24, 2021

Feature Engineering Time Series


"The loss of information that I mentioned above could actually be a useful feature of the Fourier Transformation. Let me explain. Your timeseries probably has some noise in it. One very useful thing that you can do in your feature engineering process would be to take the Fourier Transformation of your time-series. It will probably look pretty chaotic when you do that, because there is some noise built into the signal. So now that you are in the frequency-domain, you can just threshold your transformed data-set. You can do this by simply setting an arbitrary threshold say any frequency with an amplitude greater than 10, any amplitude less than 10, you set to zero. And then you can transform the series back to the time-domain. What you will find when you do this is that your time-series has been simplified. There is less noise. Essentially, what you have done is you have applied a filter to your data. You have filtered out the components of the signal that are likely noise." - Ryan Barnes

Some Fourier transform examples

With code stolen from here [SO] we can see an example:

"A property of the Fourier Transform which is used, for example, for the removal of additive noise, is its distributivity over addition." [1]

When you look at the code to generate this, you'll see it mention it will try to "remove DC". In this context, DC means the average (that is, frequency 0) values. This is a historical legacy of electrical engineering where DC means the Direct Current rather than the waves that make AC electricity.

Now, let's try a 2D representation where I'm actually trying to simulate human behaviour. Let's say it's people going to Church or Synagogue one day per week. We can do this "because the Fourier Transform is separable ... Using these two formulas, the spatial domain image is first transformed into an intermediate image using N one-dimensional Fourier Transforms. This intermediate image is then transformed into the final image, again using N one-dimensional Fourier Transforms." [1]

Can Fourier transforms pick it out of the data? (Code is on my GitHub repository):
Fig 2: Regular activity. Time is the y-axis in the top graphic. The second is frequency space. The third, amplitudes per column.

The first thing to notice is that instead of our spectrum having a peak both side of the origin for the frequency, there are many peaks at multiples of the frequency. These are the harmonics. The reason Figure 1 does not have them is that the signal is made of two pure sinusoidal waves. There are no other frequencies at play.

Now, let's add some noise.
Regular activity with noise
Our regular periodicity is still the most prominent signal in frequency space and the lowest harmonic does indeed correspond to to a regular event - in this case, it's something that happens every 8 days and the lowest harmonic is indeed at 0.125.

Worked example

So much for the theory, let's examine the practise. I note that my /var/log/syslog has lots of noise in it. I want to de-noise it to see if there is something suspicious in it. I used this code in my GitHub repository to spot irregular activity. Note, this is a very simple implementation that just looks at the process that logged an event and its time. I'll make it more sophisticated later.

Anyway, running the parsed log through a Fourier transform resulted in this:

A Fourier transform of syslog events

We can see that there are clearly some regulars so taking them out, we see that the three (arbitrary number alert) most irregular events came from anacron, systemd and nvidia-persistenced. Looking at these logs in syslog did indeed indicated that they were isolated events albeit nothing to worry about (dockerd and containerd were the most regular naggers).


No comments:

Post a Comment