Fast, Interpretable, and Deterministic Time Series Classification with a Bag-Of-Receptive-Fields

When working with time series classification, the most accurate models today are often convolution-based methods such as ROCKET, MiniRocket, Hydra, and InceptionTime, or hybrid models like MR-Hydra. These approaches represent time series in highly complex, non-interpretable feature spaces. Their strengths are clear: they are both accurate and efficient. But they also come with downsides: they are often opaque, making it hard to understand their decisions, and they can be non-deterministic, producing slightly different outputs each time they run.

On the other hand, interpretable models, such as shapelet-based (ST, RDST), interval-based (DR-CIF, RSTSF), and dictionary-based (MR-SEQL, MR-SQM, WEASEL+MUSE), focus on building features that are more interpretable for humans. These methods are reasonably accurate and efficient, and, most importantly, interpretable. However, even here, many remain non-deterministic.

The goal of the Bag-of-Receptive-Fields (BORF) is to design a time series transformation that is accurate, efficient, interpretable, and deterministic, all at once.

From Bag-of-Patterns to Bag-of-Receptive-Fields

One early interpretable approach is the Bag-of-Patterns (BOP) by Baydogan et al. (2013). It works by extracting small “words” from a time series and counting how often they occur.

BOP illustration

BOP’s advantages are its determinism and interpretability. But it is inefficient, achieves relatively poor accuracy, and was originally limited to univariate series only. Our work improves and extends BOP into the Bag-of-Receptive-Fields (BORF), addressing these limitations.

We try to solve some of these issues with BORF.

BORF

Step 1. - PAA (Naive)

BORF is built on Piecewise Aggregate Approximation (PAA). Given a univariate time series $X$, of length $m$, the naive method, used in BOP, works like this:

Sliding Window

Extract all possible $v$ windows of length $w$ from the time series (we visualize only the first five in the image).

Sliding window

Segmenting

Split each window into $l$ segments of length $q$ (assuming $w$ divisible by $l$).

Segmenting

Averaging

Take the average of each segment.

PAA averaging

The problem with this naive approach is its complexity:

\[O(v \times w)\]

This can be quadratic in the length of the time series, i.e., $O(m^2)$, for a sufficiently big $w$.

Step 1. - PAA (Optimized)

The PAA values can be stored in a matrix of shape $v\times l$:

\[M= \underbrace{ \left[ \begin{array}{ccc} & \vdots & \\ \dots & {\color{green}\mu_{i,j}} & \dots \\ & \vdots & \\ \end{array} \right] }_{l} \left.\vphantom{\begin{array}{c} \mu_{1,1} \\ \vdots \\ \mu_{v,1} \end{array}}\right\} ~_{v}\]

So, if we precompute all possible segment averages, we should be able to collect them by iterating on $v\times l$ entries, instead of $v\times w$ as before. Indeed, the naive computation repeats averages unnecessarily. Instead, we compute each segment average once, using a moving average, $\boldsymbol\mu^{\textit{seg}}$, with a window equal to $l$, and collect them into $M$:

\[{\color{green}{\mu_{i,j}}}= \color{orange}\mu^{\textit{seg}}_{1 + (i-1) +(j-1) q}\]

The formula complicates slightly if we borrow dilation ($d$) and stride ($s$) from convolution-based methods to define receptive fields:

\[{\color{green}{\mu_{i,j}}}= \color{orange} \mu^{\textit{seg}}_{1 + (i-1) \mathop{s}\limits_{\color{black}\uparrow} +(j-1) \mathop{d}\limits_{\color{black}\uparrow} \, q}\]

In this case the segment can skip observations, and the windows may not be contiguous. A small detail here is that, to compute the moving average with dilation $d$, we partition the time series into $d$ interleaved subsequences, compute a standard length-$q$ moving average on each subsequence and then interleave the results back to the original index order.

This gives an optimized complexity:

\[\underbrace{O(v \times l)}_{\text{OPTIMIZED}}\le \underbrace{O(v \times w)}_{\text{NAIVE}}\]

In practice, this means that the complexity is independent from the window length, and depends only on the word length. This is very useful as the length of the time series grows.

Step 2. - SAX

Once we have the averages, we normalise each using its window mean and standard deviation (again computed as moving quantities):

\[\text{z-score}({\color{green}\mu_{i,j}})=\frac{\color{green}\mu_{i,j} - \color{red}\mu^{\textit{win}}_{is}}{\color{purple}\sigma^{\textit{win}}_{is}}\]

Normalisation

We then bin these values into symbols according to a chosen alphabet size:

Binning

Next, we hash the symbol sequences (our “words”) into integers and count their frequency:

Hashing

Finally, we record them as a sparse coordinate tensor (COO format), keeping track of which time series and signal each pattern came from:

COO tensor

Multiple Representations

To capture patterns at different scales, we repeat the process using multiple parameter combinations:

Window lengths: $[2^2, 2^3, \dots]$
Dilations: $[2^0, 2^1, \dots]$
Word lengths: $[2, 4, 8]$

We fix the alphabet size to 3, set stride to 1, and flatten + concatenate all features.

BORF pipeline

BORF is implemented in the aeon library and can be plugged into any machine learning pipeline.

Performance and Explanation

On the UEA and UCR benchmarks, BORF shows strong accuracy while being much faster than many competitors.

Accuracy plot
Runtime plot

Because BORF produces a tabular representation, we can apply standard explainability methods like SHAP:

SHAP pipeline

On the CBF dataset, we can inspect both local and global explanations:

CBF example

The picture below shows an instance correctly classified as a Cylinder. On the left, we show a saliency map, obtained by mapping the importance of each contained pattern back to the original time series. The most important points seem to be in the center of the time series, with the edges mainly composed of noise.

On the right, we show the most important absent pattern (0,1,1,2 or low-medium-medium-high), a rising one.

Local explanation

The global explanation below shows the behavior of the pattern in the full dataset. On the left, the medoid (in dark gray), and all possible alignments of that pattern in the dataset (light gray). On the right, we show the frequency of that pattern (x-axis) depending on the label (y-axis), colored by its importance. So low-medium-medium-high is contained many times in Bell instances, and when that pattern is contained many times, it pushes the prediction towards Bell. When the pattern is contained only a few times, or not contained at all, it pushes the prediction either towards Cylinder or Funnel.

Global explanation

We also test the approach on real-world datasets like FingerMovements. Here is the paper reference for more details:

@article{spinnato2024fast,
  title={Fast, interpretable and deterministic time series classification with a bag-of-receptive-fields},
  author={Spinnato, Francesco and Guidotti, Riccardo and Monreale, Anna and Nanni, Mirco},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

Brain XAI