iTwin IoT - Analytics Module Part 1: What does the Correlation Analysis tool do?

Overview

This article provides a brief overview of the Correlation Analysis tool and offers information about how the tool works. The remaining articles in this section of the product wiki will guide you in using the Analytics module to create Correlation alerts.

What does the Correlation Analysis do?

The correlation analysis tool allows you to select a sensor and then select one or multiple sensors to compare against the target sensor. This helps users. identify what factors are most strongly linked to changes in key sensor metrics (such as piezometric levels in a dam) and quantify the strength of these relationships (e.g. piezometric levels vs. water elevation). Then the tool will help you configure correlation alerts based on the selected risk level using a confidence interval, to help drive more informed, data-driven decisions about the need for asset invention and maintenance.

The application will then run an analysis and generate a linear correlation matrix heatmap, showing the strength and direction of relationships between the target sensor and other selected metrics. This visual representation helps users quickly assess correlations between variables.

The analysis also generates a Variable Importance bar chart highlighting the overall importance (linear and non-linear) of variables in relation to the target sensor metric, aiding in feature selection and understanding the significance of each metric.

Further diving into the analysis results, you can also evaluate correlation best-fit linear and quadratic formulas, and create correlation alerts with these formulas.

How does the Correlation Analysis work?

This section will provide behind-the-scenes information about how the tool works so that users can better understand the tool and trust their analyses.

Sampling Interval

The Sampling Interval is a function toggle under Advanced settings in the Analysis configuration.

Sampling interval is the amount of time between two consecutive data points in a time series used for the analysis. The sampling interval averages each sensor's data over each period, independently.

The example below shows a sampling interval set to 1 hour.

Outlier Removal

The Outlier Removal is a function toggle under Advanced settings in the Analysis configuration.

If enabled, outliers (not anomalies) are removed for the correlation analysis, but all data is plotted in the correlation graphs for reference. A Hampel Filter is applied to the data to remove outliers.

Hampel Filter: A sliding-window of size 10 that flags points whose deviation from the local median exceeds 6 times the scaled Median Absolute Deviation (MAD): MAD=median(∣xi −median(x)∣)

A screenshot of a graphAI-generated content may be incorrect.

No outlier removal is used in small datasets (n < 100).

Linear Correlation

Computes the linear correlation between different pairs of variables in a set of sensor time series data. A higher correlation value indicates that the relationship between the two metrics can be best described by a straight line. This helps in understanding dependencies, identifying trends, and detecting redundancies in sensor data.

Relative Importance

The Relative Importance calculation is used in the Variable Importance bar charts.

The Analysis a machine learning model that uses all sensor metric data to predict (fit a regression model over) the values of the target sensor metric. Then looks at the contribution of each sensor in creation of that model.

Since the variable importance values are relative to each other, and they always sum up to 1, a large value does not necessarily mean a high correlation exists between that metric and the target metric.

Regression Models and Fit Scores

The polynomial regression tool fits the best linear or quadratic regression model using historical correlation data between two time series. The tool provides the regression equation along with the R² score to indicate the goodness of fit. It uses least squares error to model relationships between two sensor time series. The equations supported at the moment are:

Linear (1st degree):
y = ax + b
Quadratic (2nd degree):
y = ax^2 + bx + c

R2 value

R2 = 1 - least_square_error / variance
Adjusted R2 = 1 - (1 - R2) (n - 1) / (n – degree)

n=# of data points

Confidence Interval

The user can pick a symmetrical confidence value, or set the lower and upper lines independently to achieve a non-symmetric setup.

The center line in red is the best fit curve, and the presented confidence band around it refers to the region or range around the regression curve (whether linear or quadratic) that accounts for the uncertainty in the predicted values of the dependent variable (y axis) for given values of the independent variable (x axis).

When you decrease the confidence value from the default 95% to say, 70%, the confidence band gets narrower and would contain less of the historical data.

Given a normal, or Gaussian, distribution, the confidence intervals can be illustrated as such:

You can imagine this might roughly be the historical distribution of the y-axis sensor metric data at a given x-axis value (a vertical line in the graph).

The shaded regions highlight the range of values within each confidence interval, while dashed black lines mark their boundaries. The percentage labels indicate the confidence levels at the center of each horizontal line. The lower and upper bounds of each confidence interval are also displayed, showing the corresponding percentage values at each of the boundary points.

For a 95% confidence interval (C), the lower (L) and upper (U) values should be set to 2.5% and 97.5%, respectively, for a symmetric setup:

L = (1-C)/2 = (1-0.95)/2 = 0.025 = 2.5%
U = (1+C)/2 = (1+0.95)/2 = 0.975 = 97.5%
C = U − L = 0.975 - 0.025 = 0.95 = 95%

When adjusting the confidence band manually (the second option under the confidence interval section, potentially to achieve a non-symmetric setup), the values should satisfy the condition:

1 > U > 0.5
0.5 > L > 0

If the user modifies the confidence value, the starting lower and upper bounds should reflect this formula. For example, a 90% confidence interval would set the lower to 5% and the upper to 95%. The app should ensure that the lower value is always smaller than the upper, and both should be between 0 and 1. Additionally, when switching from a manually adjusted band to a symmetric setup, the app should automatically adjust the confidence value and boundaries to reflect the current manual settings.

When the user is setting the lower value, the lowest they can go is 0, and that is the widest band possible (for the lower line). There is a higher limit of 50 to the lower value and that would be where the lower line is on top of the red line.
When the user is setting the upper value, the highest they can go is 100, and that is the widest band possible (for the upper line). There is a lower limit of 50 to the higher value and that would be where the higher line is on top of the red line.

As for the impact of the confidence interavl to Correlation alerts, ss the confidence value is decreased, the band created contains less of the data. So if the user would create an alert based on the thresholds identified (the upper and lower lines of the band), they would get more alerts, since it is more probable that the band would not contain more of that data.