# Data interpolation with Kriging

## Introduction: Rationale of the kriging approach

Environmental features, such as coastal bathymetry or seabed composition, are the product of many interacting geophysical processes. These processes are in principle deterministic, but their interactions are so complex that the variation seems random. This complexity and incomplete knowledge of the processes means that a deterministic or mathematical solution to quantify the variation is out of reach. Statistical methods may therefore be used to describe these characteristics..

Suppose one has to predict the value of a variable from experimental data in a location where no measurements are available, knowing that the underlying process has a significant random component. A common procedure is to fit an a priori chosen parameterized function to the data by minimizing the sum of squared deviations, the so-called ordinary least squares method (OLS). However, this method has a number of shortcomings. It does not provide the best possible estimate due to the arbitrariness of the chosen interpolation function. OLS does not give the true uncertainty of the prediction; an estimate based on the squared residuals is always significantly smaller than the real uncertainty. OLS does not take into account any correlations between the data. In cases where the measurement locations are not uniformly distributed around the target location, OLS will generally overweight data clusters compared to isolated data closer to the prediction location. The latter deficiency can be mitigated by linear regression using generalized least squares (GLS).

Kriging is a method to overcome the shortcomings of the least-square methods. It is basically a linear regression method for estimating point values (or spatial averages) at any location of a region. The concept was developed by D. Krige (1951) and the theoretical foundation was given by G. Matheron (1969). Kriging refers to a group of geostatistical interpolation methods in which the value at an unobserved location is predicted by a linear combination of the values at surrounding locations, using weights according to a model that takes into account the spatial correlation. Kriging provides unbiased estimates with minimum variance.

In the following we present the basics of the theory underlying kriging following Goovaerts (1997).

## Kriging principles

Observations provide a set of data $z_i \equiv z(\vec{x_i}), \; i=1, .., n$ in spatial locations $\vec{x_i}$, which are considered to be a particular realization of a process $Z$ with a significant random component. Measuring locations $\vec{x_i}, \; i=1, .. n$ are selected in a close neighborhood of a target location $\vec{x_0}$ where we want to predict the value of $Z$. The random component $\epsilon(\vec{x})$ of the observations represents the result of unobserved small-scale processes and observation errors.

The variable $Z(\vec{x_i})$ representing the random process is characterized by a smoothly varying deterministic component $m$ and a stochastic component $\epsilon$,

$Z(\vec{x})=m(\vec{x})+\epsilon(\vec{x}), \qquad (1)$

with $E[Z(\vec{x})]=m(\vec{x}), \; E[\epsilon(\vec{x})]=0 , \;$ where $E[f]$ designates the statistical expected value of the random function $f$.

The estimator $Z^*$ of $Z$ at location $\vec{x_0}$ is constructed as a weighted sum of the values of $Z$ at the observed neighborhood locations $\vec{x_i}, i=1, .. ,n$:

$Z^*(\vec{x_0}) = \sum_{i=1}^{n} w_i Z_i = m_0 + \epsilon (\vec{x_0}) = m_0 + \sum_{i=1}^{n} w_i \big( Z_i - m_i \big) , \qquad (2)$

where $Z_i \equiv Z(\vec{x_i}), \; m_0 \equiv m(\vec{x_0}), \; m_i \equiv m(\vec{x_i}) = E[Z(\vec{x_i})]$ and $w_i \equiv w_i(\vec{x_0})$ for $i=1, .. ,n$.

The predictor $Z^*$ is unbiased, because $E[Z^*(\vec{x_0})]=m_0 =E[Z(\vec{x_0})] .$

The kriging estimator (2) can also be written $Z^*(\vec{x_0}) = \sum_{i=1}^{n} w_i Z_i + m_0 - \sum_{i=1}^{n} w_i m_i , \quad$ showing that

$\quad m_0 - \sum_{i=1}^{n} w_i m_i = 0 . \qquad (3)$

The basic principles of the kriging method can be summarized as follows:

• ASSUMPTION A: the spatial distribution of the variable $Z(\vec{x})$ has small-scale fluctuations $\epsilon(\vec{x})$ that are not well captured by the observation grid $\vec{x_i}$;
• ASSUMPTION B: the fluctuations $\epsilon_i \equiv \epsilon(\vec{x_i})$ at the observation points $\vec{x_i}$ can be considered as representing a stochastic process which is approximately Gaussian-distributed with zero mean;
• ASSUMPTION C: the fluctuations $\epsilon_i$ at neighboring observation points are significantly correlated;
• ASSUMPTION D of translation/rotation invariance: the covariances $C(r_{ij}) \equiv E[\epsilon_i \epsilon_j]$ depend only on the distances $r_{ij} \equiv |\vec{x_i} - \vec{x_j}|$ for $i,j=0,1, .., n$;
• METHOD: the estimator $Z^*(\vec{x_0})$ for interpolation at an intermediate location $\vec{x_0}$ is given by the weighted sum of the values of the variable $Z(\vec{x})$ at neighboring observation points $\vec{x_i}$, where the weights $w_i \equiv w_i(\vec{x_0})$ are determined by requiring that the squared error $\sigma_0^2 \equiv E \Big[ \big( Z^*(\vec{x_0}) - Z(\vec{x_0}) \big)^2 \Big]$ is minimal, taking into account the condition (3).

The METHOD implies

$\Large\frac{\partial \sigma_0^2}{\partial w_i}\normalsize = 0, \; i=1, .., n ,\qquad (4)$

where the partial derivates are subject to the condition (3).

## Ordinary Kriging (OK)

In ordinary kriging (OK) it is assumed that there are sufficient data locations $\vec{x_i}$ close to $\vec{x_0}$ that it is possible to choose a neighborhood in which $m(\vec{x})$ is approximately constant,

$E[Z_i] \equiv m_i \approx m(\vec{x_0}) \equiv m . \qquad (5)$

The predictor $Z^*$ is written as:

$Z^*(\vec{x_0}) = m + \sum_{i=1}^n w_i (Z_i -m) = \sum_{i=1}^n w_i Z_i + m \big(1 - \sum_{i=1}^n w_i \big). \qquad (6)$

Because $Z^*(\vec{x_0}) = \sum_{i=1}^n w_i Z_i , \quad$ we have

$\sum_{i=1}^n w_i - 1 = 0 . \qquad (7)$

The coefficients $w_i$ can be determined from this equation and from the condition (4), taking into account the constraint Eq. (7). This system of $n+1$ equations is solved by introducing Eq. (7) in Eq. (4) with a so-called Lagrange multiplier $\mu$. The Lagrange multiplier is an additional unknown constant which makes the number of unknown parameters equal to the number of equations.

The system of linear equations from which $w_i$ and $\mu$ must be solved is given by

$\phi=\sigma_0^2 + 2 \mu \big(\sum_{i=1}^n w_i – 1 \big) , \quad \Large\frac{\partial \phi}{\partial \mu}\normalsize = 0, \; \Large\frac{\partial \phi}{\partial w_i}\normalsize = 0, \; i=1, .., n\qquad (8).$

The partial derivatives of condition (8) yield Eq. (7) and a system of $n$ linear equations (see Appendix A):

$\sum_{j=1}^n \gamma_{ij} w_j - \mu = \gamma_{i0} , \; i=1, .., n , \qquad (9)$

where the semivariances $\gamma_{ij}, \gamma_{i0}$ are defined by

$\gamma_{ij} \equiv \gamma(r_{ij}) \equiv ½ E[(\epsilon_i-\epsilon_j)^2] = ½ E[(Z_i-Z_j)^2]; \; \gamma_{i0} \equiv \gamma(r_{i0}) ; \quad r_{ij}=|\vec{x_i}-\vec{x_j}| ; \; r_{i0} = |\vec{x_i}-\vec{x_0}| , \qquad (10)$

and where $\mu$ is a parameter to be determined from the $n+1$ equations (7) and (9).

## Solution by means of a variogram

For solving the equations (9) the variograms $\gamma_{ij}, \gamma_{i0}$ must be known. An approximate estimate of the variograms can be obtained if translation invariance is assumed (i.e., the variograms depend for the particular chosen neighborhood only on the distances $r_{ij}=|\vec{x_i}-\vec{x_j}|, \; r_{i0} = |\vec{x_i}-\vec{x_0}|$) and if the chosen neighborhood contains a sufficient number $n$ of data points. In this case the function $\gamma(r)$ (the variogram) can be constructed from the $½n(n-1)$ data values $½ (z_i-z_j)^2, \; i,j=1, .., n, i \ne j$, each corresponding to a particular value $r=r_{ij}$. An explanation is given in appendix B and a typical example is shown in Fig. 1. The variogram generally appears as a plot with data points scattered around a curve, which can be represented by a smooth function. Different parameterized functions can be tried out. According to the pattern of data points, the most suited function should be chosen and fitted to the variogram. Once the variogram function $\gamma(r)$ has been determined, the weights $w_i$ can be found from the system of $n+1$ linear equations (7, 9). The value of the random process $Z$ at location $x_0$ can then be determined from

$Z^*(\vec{x_0}) = \sum_{i=1}^n w_i z_i . \qquad (11)$

The error is given by (see appendix A)

$\sigma_0^2 = \mu + \sum_{i=1}^n w_i \gamma_{i0}. \qquad (12)$

## Kriging the mean

The mean $m$ can be determined in a way similar to the kriging procedure for the stochastic component. The estimator of the mean is given by

$m^*=\sum_{i=1}^n u_i Z_i . \qquad (13)$

The conditions $E[m^*] = E[m]$ and $E[Z_i]=m$ imply $\sum_{i=1}^n u_i=1 . \qquad (14)$

The mean has no stochastic component: $E[m^*m]=E[m^2]=m^2 .$

The squared error is thus given by $\sigma_m^2=E[(m^*-m)^2]=\ \sum_{i,j=1}^n u_i u_j C_{ij} , \qquad (15)$

where $C_{ij}$ are the covariances $C_{ij} \equiv C(r_{ij}) \equiv E[(Z_i-m)(Z_j-m)]$.

Minimizing the error taking into account the condition (13) is done as for the stochastic component

$\phi=\sigma_m^2 + 2 \mu \big(\sum_{i=1}^n u_i – 1 \big) , \quad \Large\frac{\partial \phi}{\partial \mu}\normalsize = 0, \; \Large\frac{\partial \phi}{\partial u_i}\normalsize = 0, \; i=1, .., n\qquad (16).$

The condition (16) yields together with Eq. (13) a system of $n+1$ linear equations

$\sum_{j=1}^n C_{ij} u_j + \mu = 0 , \; i=1, .., n , \qquad (17)$

from which the coefficients $u_i$ and the Lagrange multiplier $\mu$ can be determined. The mean can now be computed from the data $z_i$:

$m^* = \sum_{i-1}^n u_i z_i . \qquad (18)$

## Kriging in the presence of trends

The minimum observation area required for data interpolation to a target location can be too large to ignore trends. The ordinary kriging (OK) method cannot be applied in this case. Removing the trend by subtracting from the data a fitted function based on least-squares will generally introduce a bias as explained in the introduction and does not provide a correct error estimate.

The trend $m(\vec{x})$ can be determined by a kriging procedure similar to the procedure for the mean. Therefore it is assumed that an estimator of the trend $m^*$ can be written as a linear superposition of known functions $f_k(\vec{x}), \; k=1, .., K$,

$m^*(\vec{x})=\sum_{k=1}^K a_k f_k(\vec{x}) . \qquad (19)$

For example, in case of a linear trend one may choose $f_1=1, f_2=x, f_3=y$.

The estimator of the process $Z$ at the target location $\vec{x_0}$ can now be written as:

$Z^*(\vec{x_0}) = \sum_{i=1}^n w_i Z_i = m^*(\vec{x_0}) + \sum_{i=1}^n w_i \big( Z_i - m^*(\vec{x_i}) \big) = \sum_{i=1}^n w_i Z_i + \sum_{k=1}^K a_k \Big(f_{0k} - \sum_{i=1}^n w_i f_{ik} \Big) , \qquad (20)$

where $f_{0k}=f_k(\vec{x_0}), f_{ik}=f_k(\vec{x_i})$.

The condition $E[Z^*(\vec{x_i})] = E[Z(\vec{x_i})] = m^*(\vec{x_i})$ yields $K+1$ equations $f_{0k} - \sum_{i=1}^n w_i f_{ik} = 0 , \quad \sum_{i=1}^n w_i - 1 =0 . \qquad (21)$

The squared error $\sigma^2$ is the same as for ordinary kriging, $\sigma^2=\sigma_0^2$ (Eq. (A1)).

The function to be minimized requires $K+1$ Lagrange multipliers $\mu_k$ in order to account for the conditions (21),

$\phi=\sigma_0^2+ 2\mu_0 \Big(\sum_{i=1}^n w_i - 1\Big) + 2\sum_{k=1}^K \mu_k \Big(\sum_{i=1}^n w_i f_{ik} - f_{0k} \Big) . \qquad (22)$.

Minimization with respect to $w_i, i=1, .., n$ and $\mu_k, k=0, 1, .., K$ yields the system of $K+1$ linear equations (21) and $n$ linear equations

$\sum_{i=1}^n w_i \gamma_{ij} - \mu_0 - \sum_{k=1}^K \mu_k f_{ik} = \gamma_{i0} , \qquad (23)$

from which the parameters $w_i, \mu_k$ can be determined if the variogram function $\gamma(r)$ is known. However, the semivariances $\gamma_{ij} \equiv ½ E[(\epsilon_i-\epsilon_j)^2]$ cannot be estimated directly from the data because the trend components $m_i$ are not yet known. This issue can be solved by using an iterative procedure.

Another procedure to remove the trend from the data makes use of regression analysis with generalized least squares (GLS). However, the estimation of the residuals by generalized least squares is also an iterative process: first the drift model is estimated using ordinary least squares (OLS); then the covariance function of the residuals is used to obtain the GLS coefficients; these can be used to re-compute residuals and so on.

## Applications and limitations

Many environmental parameters are influenced by processes that interact at a wide range of spatial scales. Spatial surveys generally do not resolve the smallest scales, where subscale processes produce correlations between values observed in a wider neighborhood. In such cases, kriging is the most appropriate technique to determine parameter values at intermediate unobserved locations and to estimate the uncertainty of the interpolated values. Examples in the coastal and marine environment include data collected for bathymetric mapping, sediment maps, mapping of benthic communities, plankton distributions, seabed geochemical components, pollutants and mineral resources.

Several commercial and free GIS software packages can perform kriging interpolation. However, the user must be aware that the kriging procedure is highly sensitive to the specification of the correlation and variance functions inferred from the data. Mis-specification of the variogram model leads to erroneous weights of the kriging interpolator and inaccurate interpolations. Translational invariance of the correlations is a crucial assumption underlying the kriging method. Whether this assumption is met must be verified experimentally and/or theoretically. So some other methods (e.g. Bayesian approaches) have been developed if the kriging conditions are not well satisfied. In general, the accuracy of interpolation by kriging will be limited if the number of observations sampled is small, the data has limited spatial scope, or the data is in fact not sufficiently spatially correlated. In these cases, a sample variogram is hard to generate, and other methods (e.g. regression) may prove preferable to kriging for spatial prediction.

## Appendix A

The error $\sigma_0^2$ can be written

$\sigma_0^2 \equiv E \Big[ \big( Z^*(\vec{x_0}) - Z(\vec{x_0}) \big)^2 \Big] = E \Big[ \big( \sum_{i=1}^n w_i (m+\epsilon_i) - m - \epsilon_0 \big)^2 \Big] = E \Big[ \big( \sum_{i=1}^n w_i \epsilon_i - \epsilon_0 \big)^2 \Big] = \sum_{i,j=1}^n w_i w_j C_{ij} + C(0) -2 \sum_{i=1}^n w_i C_{i0} \; , \qquad (A1)$

where the covariances $C_{ij} \equiv C(r_{ij}) \equiv E[\epsilon_i \epsilon_j] \equiv E[(Z_i-m) (Z_j-m)] , \; C_{i0} \equiv C(r_{i0}) \equiv E[\epsilon_i \epsilon_0] , \; C(0)=E[\epsilon_i^2]=E[\epsilon_0^2]$ and $r_{ij}=|\vec{x_i}-\vec{x_j}|, \; r_{i0}=|\vec{x_i}-\vec{x_0}| .$

The condition (8) then yields

$\Large\frac{\partial \phi}{\partial w_i}\normalsize = 2 \Big( \mu + \sum_{j=1}^n w_j C_{ij} - C_{i0} \Big) = 0 , \; i=1, .., n . \qquad (A2)$

Instead of the covariance it is often more practical to consider the variance. The semivariances $\gamma_{ij}, \gamma_{i0}$ are defined by $\gamma_{ij} \equiv \gamma(r_{ij}) \equiv ½ E[(Z_i-Z_j)^2]$, $\gamma_{i0} \equiv \gamma(r_{i0})$.

Because $2 \gamma_{ij} \equiv E[(Z_i-Z_j)^2] = E[(Z_i-m)^2]+E[(Z_j-m)^2] – 2 E[(Z_i-m)( Z_j-m)] = 2 C(0) – 2 C_{ij}$, we have

$C_{ij} = C(0) - \gamma_{ij} . \qquad (A3)$

Substitution in Eq. (A2) yields Eq. (9),

$\sum_{j=1}^n \gamma_{ij} w_j - \mu = \gamma_{i0} , \; i=1, .., n , \qquad (9)$

Substitution of (A3) in the expression (A1) gives

$\sigma_0^2 = \sum_{i,j=1}^n w_i w_j C_{ij} + C(0) -2 \sum_{i=1}^n w_i C_{i0} = -\sum_{i,j=1}^n w_i w_j \gamma_{ij} +2 \sum_{i=1}^n w_i \gamma_{i0} . \qquad (A4)$

The value of $\mu$ can be found by multiplying Eq. (A2) by $w_i$ and by summing over $i$. From comparison with the expression (A4) we find

$\sigma_0^2 = \mu + \sum_{i=1}^n w_i \gamma_{i0} . \qquad (A5)$

## Appendix B

The variogram function $\gamma(r)$ is related to the covariance function $C(r)$ by (A3),

$\gamma(r)=C(0)-C(r) . \qquad (B1)$

The practical way to establish the variogram is by taking the average of all data pairs $\vec{x_i}, \vec{x_j}$ lying in a strip of width $\Delta r$ at a distance $r$ from each other. The variogram is computed as

$\gamma(r)=\Large\frac{1}{2 n_r}\normalsize \sum_{i,j, i \ne j} (z_i – z_j)^2 , \qquad (B2)$

for the $n_r$ data locations satisfying $r \lt |\vec{x_i} - \vec{x_j}|\lt r+\Delta r$. The values of $\gamma(r)$ computed in this way are plotted as in Fig. 1.

The strongest correlation is expected at short distances $r$. However, if one extrapolates the variance $\gamma(r)$ to small values of $r$, the result will generally not vanish. This can be due to measurement errors, to temporal fluctuations in case of non-simultaneous observations or to processes at smaller scales than the observation grid. The result $\gamma(0)$ is called the nugget, see Fig. 1. The correlation weakens at large distances, so the variance increases. The kriging method is meaningful only if the variance becomes substantially larger than the nugget. The variance may level off at great distance to some value close to $C(0)$. However, it may also increase further. In that case there can be a trend in the data that falsifies the assumption of uniform $m(\vec{x})$ that underlies the ordinary kriging model. If the number of datapoints in a neighborhood where $m(\vec{x})$ is approximately uniform is too small for constructing a variogram, a procedure must be chosen for incorporating trend analysis into the kriging method, see Kriging in the presence of trends.

Kriging is a local predictor, and only the nearest few points to the interpolation point carry significant weight, especially if the nugget variance is a small proportion of the total variance. As a rule of thumb, the minimum number of datapoints in the neighborhood of uniform $m(\vec{x})$ for applying ordinary kriging should be larger than 10. 

Outliers and skewness of the dataset can introduce bias into the variogram. Assumption B regards the Gaussian distribution of the random component of the random process. Strong skewness can often be reduced by log-transformation of the data. In this case the kriging method is applied to the log-transformed dataset. Outliers due to measurement errors or other causes should be removed from the dataset. A bias in the variogram can also result from anisotropy of the random process. If the variance along one axis is very different from the variance along the perpendicular axis, the effect of anisotropy can be overcome by a linear transformation (contraction or dilation) of one of the axes.

A well-chosen representation of the variogram is essential for the quality of the kriging interpolation. This can be checked through the MSRD (mean squared deviation ratio) criterion for the mean of the squared errors divided by the corresponding kriging variances. Ideally the MRSD should equal 1, and so for kriging one should choose the variogram model for which the MSDR is closest to 1. 

## Related articles

Interpolation of measured grain-size fractions
Acoustic kelp bed mapping in shallow rocky coasts - case study Helgoland (North Sea)
Estimation of spatial distribution of phytoplankton in the North Sea