Time Series Features

Features:

Time domain：
均值，方差，標準差，最大值，最小值，過零點個數，最大值與最小值之差，眾數

frequency domain：
直流分量，圖形的均值、方差、標準差、斜度、峭度，幅度的均值、方差、標準差、斜度、峭度

R

1. Spectral entropy of a time series
Computes feature of a time series based on tiled (non-overlapping) windows.
2. lumpiness is the variance of the variances
3. stability is the variance of the means
Computes feature of a time series based on sliding (overlapping) windows
4. max_level_shift finds the largest mean shift between two consecutive windows.
5. max_var_shift finds the largest var shift between two consecutive windows
6. max_kl_shift finds the largest shift in Kulback-Leibler divergence between two consecutive windows

7. Number of crossing points: the number of times a time series crosses the median
8. Number of flat spots: Number of flat spots in a time series(rel)
9. Hurst coefficent: Computes the Hurst coefficient indicating the level of fractional differencing of a time series

10. Autocorrelation-based features: Computes various measures based on autocorrelation coefficients of the original series, first-differenced series and second-differenced series
x_acf1
x_acf10
diff1_acf1
diff1_acf10
diff2_acf1
diff2_acf10
11. Partial autocorrelation-based features: Computes various measures based on partial autocorrelation coefficients of the original series, first-differenced series and second-differenced series

x_pacf5

diff1_pacf5

diff2_pacf5

12. Parameter estimates of Holt's linear trend method: Estimate the smoothing parameter for the level-alpha and the smoothing parameter for the trend-beta

13. Autocorrelation coefficient at lag 1 of the residual: Computes the first order autocorrelation of the residual series of the deterministic trend model

**. Strength of trend and seasonality of a time series:

Computes various measures of trend and seasonality of a time series based on an STL decomposition

- Summary Stat:

14. the length of a time series

15. the variance of a time series

16. the variance of the residules

17. the variance of the detrend series

18. the variance of the deseasonal series

19. the number of the seasonal

20. Measure of trend strength

21. Measure of seasonal strength

22. Find time of peak and trough for each component

23. Compute measure of spikiness

24. Compute measures of linearity and curvature

25. ACF of remainder

**. Heterogeneity coefficients

Computes various measures of heterogeneity of a time series. First the series is pre-whitened using an AR model to give a new series y.

We fit a GARCH(1,1) model to y and obtain the residuals, e. Then the four measures of heterogeneity are:

26. the sum of squares of the first 12 autocorrelations of y^2
27. the sum of squares of the first 12 autocorrelations of e^2
28. the R^2 value of an AR model applied to y^2
29. the R^2} value of an AR model applied to e^2
The statistics obtained from y^2 are the ARCH effects, while those  from e^2 are the GARCH effects.

python

1. the absolute energy of the time series which is the sum over the squared values

2. the sum over the absolute value of consecutive changes in the series x

3. Calculates the value of an aggregation function (e.g. var or mean) of the autocorrelation taken over different all possible lags (1 to length of x)

where  is the length of the time series ,  its variance and  its mean.

4. Calculates a linear least-squares regression for values of the time series that were aggregated over chunks versus the sequence from 0 up to the number of chunks minus one. The chunk specifies how many time series values are in each chunk. Further, the aggregation function could be “max”, “min” or , “mean”, “median”

5. Implements a vectorized Approximate entropy algorithm.

https://en.wikipedia.org/wiki/Approximate_entropy

For short time-series this method is highly dependent on the parameters, but should be stable for N > 2000, see:

Yentes et al. (2012) - The Appropriate Use of Approximate Entropy and Sample Entropy with Short Data Sets

Other shortcomings and alternatives discussed in:

Richman & Moorman (2000) - Physiological time-series analysis using approximate entropy and sample entropy

6. This feature fits the unconditional maximum likelihood of an autoregressive AR(k) process. The k parameter is the maximum lag of the process

For the configurations from param which should contain the maxlag “k” and such an AR process is calculated. Then the coefficients  whose index  contained from “coeff” are returned.

7. The Augmented Dickey-Fuller test is a hypothesis test which checks whether a unit root is present in a time series sample. 

8.  Calculates the autocorrelation of the specified lag, according to the formula [1]

where  is the length of the time series ,  its variance and  its mean. l denotes the lag.

References

[1] https://en.wikipedia.org/wiki/Autocorrelation#Estimation

compare with 3.

9. First bins the values of x into max_bins equidistant bins. Then calculates the value of

where  is the percentage of samples in bin .

10. This function calculates the value of

which is

where  is the mean and  is the lag operator. It was proposed in [1] as a measure of non linearity in the time series.

References

[1] Schreiber, T. and Schmitz, A. (1997).

Discrimination power of measures for nonlinearity in a time series

PHYSICAL REVIEW E, VOLUME 55, NUMBER 5

11. First fixes a corridor given by the quantiles ql and qh of the distribution of x. Then calculates the average, absolute value of consecutive changes of the series x inside this corridor.

Think about selecting a corridor on the y-Axis and only calculating the mean of the absolute change of the time series inside this corridor.

12.This is an estimate for a time series complexity [1] (A more complex time series has more peaks, valleys etc.). It calculates the value of

References

[1] Batista, Gustavo EAPA, et al (2014).

CID: an efficient complexity-invariant distance for time series.

Data Mining and Knowledge Difscovery 28.3 (2014): 634-669.

13. the number of values in x that are higher than the mean of x

14. the number of values in x that are lower than the mean of x

15. Calculates a Continuous wavelet transform for the Ricker wavelet, also known as the “Mexican hat wavelet” which is defined by

where  is the width parameter of the wavelet function.

This feature calculator takes three different parameter: widths, coeff and w. The feature calculater takes all the different widths arrays and then calculates the cwt one time for each different width array. Then the values for the different coefficient for coeff and width w are returned. (For each dic in param one feature is returned)

16. Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole series

17. the spectral centroid (mean), variance, skew, and kurtosis of the absolute fourier transform spectrum.

18. Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast fourier transformation algorithm

The resulting coefficients will be complex, this feature calculator can return the real part (attr==”real”), the imaginary part (attr==”imag), the absolute value (attr=”“abs) and the angle in degrees (attr==”angle).

19. Returns the first location of the maximum value of x. The position is calculated relatively to the length of x.

20. the first location of the minimal value of x. The position is calculated relatively to the length of x.

21.Coefficients of polynomial , which has been fitted to the deterministic dynamics of Langevin model

as described by [1].

For short time-series this method is highly dependent on the parameters.

References

[1] Friedrich et al. (2000): Physics Letters A 271, p. 217-222

Extracting model equations from experimental data

22. Checks if any value in x occurs more than once

23. Checks if the maximum value of x is observed more than once

24.Checks if the minimal value of x is observed more than once

25.Those apply features calculate the relative index i where q% of the mass of the time series x lie left of i. For example for q = 50% this feature calculator will return the mass center of the time series

26.  the kurtosis of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G2).

27.Boolean variable denoting if the standard dev of x is higher than ‘r’ times the range = difference between max and min of x. Hence it checks if

According to a rule of the thumb, the standard deviation should be a forth of the range of the values.

28. the relative last location of the maximum value of x. The position is calculated relatively to the length of x.

29. the last location of the minimal value of x. The position is calculated relatively to the length of x.

30. the length of x

31.Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one. This feature assumes the signal to be uniformly sampled. It will not use the time stamps to fit the model. The parameters control which of the characteristics are returned.

Possible extracted attributes are “pvalue”, “rvalue”, “intercept”, “slope”, “stderr”, see the documentation of linregress for more information.

32. the length of the longest consecutive subsequence in x that is bigger than the mean of x

33. the length of the longest consecutive subsequence in x that is smaller than the mean of x

34. Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial , which has been fitted to the deterministic dynamics of Langevin model

as described by

Friedrich et al. (2000): Physics Letters A 271, p. 217-222 Extracting model equations from experimental data

For short time-series this method is highly dependent on the parameters.

35. Calculates the highest value of the time series x.

36. the mean of x

37. the mean over the absolute differences between subsequent time series values which is

38.  the mean over the absolute differences between subsequent time series values which is

39.  the mean value of a central approximation of the second derivative

40. the median of x

41. Calculates the lowest value of the time series x.

42. Calculates the number of crossings of x on m. A crossing is defined as two sequential values where the first value is lower than m and the next is greater, or vice-versa. If you set m to zero, you will get the number of zero crossings.

43. This feature calculator searches for different peaks in x. To do so, x is smoothed by a ricker wavelet and for widths ranging from 1 to n. This feature calculator returns the number of peaks that occur at enough width scales and with sufficiently high Signal-to-Noise-Ratio (SNR)

44. Calculates the number of peaks of at least support n in the time series x. A peak of support n is defined as a subsequence of x where a value occurs, which is bigger than its n neighbours to the left and to the right.

Hence in the sequence

>>> x = [3, 0, 0, 4, 0, 0, 13]

4 is a peak of support 1 and 2 because in the subsequences

>>> [0, 4, 0]
>>> [0, 0, 4, 0, 0]

4 is still the highest value. Here, 4 is not a peak of support 3 because 13 is the 3th neighbour to the right of 4 and its bigger than 4.

45. Calculates the value of the partial autocorrelation function at the given lag. The lag k partial autocorrelation of a time series  equals the partial correlation of  and , adjusted for the intermediate variables  ([1]). Following [2], it can be defined as

with (a)  and (b)  being AR(k-1) models that can be fitted by OLS. Be aware that in (a), the regression is done on past values to predict whereas in (b), future values are used to calculate the past value . It is said in [1] that “for an AR(p), the partial autocorrelations [  ] will be nonzero for k<=p and zero for k>p.” With this property, it is used to determine the lag of an AR-Process.

References

[1] Box, G. E., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015).

Time series analysis: forecasting and control. John Wiley & Sons.

[2] https://onlinecourses.science.psu.edu/stat510/node/62

46. the percentage of unique values, that are present in the time series more than once.

len(different values occurring more than once) / len(different values)

This means the percentage is normalized to the number of unique values, in contrast to the percentage_of_reoccurring_values_to_all_values.

47. the ratio of unique values, that are present in the time series more than once.

# of data points occurring more than once / # of all data points

This means the ratio is normalized to the number of data points in the time series, in contrast to the percentage_of_reoccurring_datapoints_to_all_datapoints.

48. Calculates the q quantile of x. This is the value of x greater than q% of the ordered values from x.

49. Count observed values within the interval [min, max)

50. Ratio of values that are more than r*std(x) (so r sigma) away from the mean of x. 

51. Returns a factor which is 1 if all values in the time series occur only once, and below one if this is not the case. In principle, it just returns

# unique values / # values

52.Calculate and return sample entropy of x.

References

[1] http://en.wikipedia.org/wiki/Sample_Entropy

[2] https://www.ncbi.nlm.nih.gov/pubmed/10843903?dopt=Abstract

53. the sample skewness of x (calculated with the adjusted Fisher-Pearson standardized moment coefficient G1).

54. This feature calculator estimates the cross power spectral density of the time series x at different frequencies. To do so, the time series is first shifted from the time domain to the frequency domain.

The feature calculators returns the power spectrum of the different frequencies.

55. the standard deviation of x

56. the sum of all data points, that are present in the time series more than once.

57. Returns the sum of all values, that are present in the time series more than once.

58. Calculates the sum over the time series values

59. Boolean variable denoting if the distribution of x looks symmetric. This is the case if

60. This function calculates the value of

which is

where  is the mean and  is the lag operator. It was proposed in [1] as a promising feature to extract from time series.

References

[1] Fulcher, B.D., Jones, N.S. (2014).

Highly comparative feature-based time-series classification.

Knowledge and Data Engineering, IEEE Transactions on 26, 3026–3037.

61.Count occurrences of value in time series x.

62. the variance of x 

63. Boolean variable denoting if the variance of x is greater than its standard deviation. Is equal to variance of x being larger than 1

搜尋此網誌

劉三格物鐵雲

Time Series Features

留言

張貼留言

這個網誌中的熱門文章

標準差與 Wald 統計量

可能性比檢定(Likelihood ratio test)

Wold Decomposition Theorem