Difference between revisions of "Data analysis techniques for the coastal zone"

From Coastal Wiki
Jump to: navigation, search
 
(One intermediate revision by the same user not shown)
Line 6: Line 6:
 
* Artificial neural networks
 
* Artificial neural networks
 
* Kriging
 
* Kriging
 +
* Random Forest Regression
 +
* Support Vector Regression
 
Each technique has advantages and disadvantages. The most suitable technique depends on the problem at hand and on the quantity and quality of the available data. In the table below we provide some guidance for choosing the most appropriate technique for the analysis of data on coastal processes.
 
Each technique has advantages and disadvantages. The most suitable technique depends on the problem at hand and on the quantity and quality of the available data. In the table below we provide some guidance for choosing the most appropriate technique for the analysis of data on coastal processes.
  
Line 38: Line 40:
 
|-
 
|-
 
| style="border:2px solid lightblue; font-weight:bold;  font-size: 12px; text-align:center"| [[Data interpolation with Kriging |Kriging]]   
 
| style="border:2px solid lightblue; font-weight:bold;  font-size: 12px; text-align:center"| [[Data interpolation with Kriging |Kriging]]   
| style="border:2px solid lightblue; font-size: 12px;text-align:left"|* Optimal interpolation method in case of correlated data errors <br> * Provides uncertainty estimate <br> * Can handle non-uniform sampling
+
| style="border:2px solid lightblue; font-size: 12px;text-align:left"|* Optimal interpolation method if errors in the data are spatially or temporally correlated <br> * Provides uncertainty estimate <br> * Can handle non-uniform sampling
| style="border:2px solid lightblue; font-size: 12px;text-align:left"|* Assumption that error correlations only depend on distance <br> * Data records must be either in space or time domain
+
| style="border:2px solid lightblue; font-size: 12px;text-align:left"|* Assumption that correlations of data deviations from the interpolated function decrease with distance <br> * Data records must be either in space or time domain
 
| style="border:2px solid lightblue; font-size: 12px;text-align:center"| Data records with variability at a wide range of scales
 
| style="border:2px solid lightblue; font-size: 12px;text-align:center"| Data records with variability at a wide range of scales
 +
|-
 +
| style="border:2px solid lightblue; font-weight:bold;  font-size: 12px; text-align:center"| [[Support Vector Regression]] 
 +
| style="border:2px solid lightblue; font-size: 12px;text-align:left"|* Prediction tool based on machine learning from training data<br> * Handles unstructured data and nonlinear relationships in high dimensional spaces <br> * Does classification and regression <br> * Robust method based on sound mathematical principles <br> * Efficient for small datasets <br> * Overfitting can be easily avoided
 +
| style="border:2px solid lightblue; font-size: 12px;text-align:left"|* Black box, no easy interpretation of results, no probability estimates <br> * Sensitivity to noise and outliers <br> * Less efficient for large datasets <br> * Not reliable outside the range of trained situations <br> * Results influenced by the choice of the kernel transformation 
 +
| style="border:2px solid lightblue; font-size: 12px;text-align:center"| Pattern recognition from images, e.g. interpretation remote sensing images
 +
|-
 +
| style="border:2px solid lightblue; font-weight:bold;  font-size: 12px; text-align:center"| [[Random Forest Regression]] 
 +
| style="border:2px solid lightblue; font-size: 12px;text-align:left"|* Prediction tool based on machine learning from training data<br> * Handles nonlinear relationships <br> * Does classification and regression <br> * Resilient to data noise and data gaps <br> * Computationally efficient <br> * Low overfitting risk
 +
| style="border:2px solid lightblue; font-size: 12px;text-align:left"|* Black box, no easy interpretation of results, no probability estimates <br> * Less efficient if many trees <br> * Not reliable outside the range of trained situations transformation 
 +
| style="border:2px solid lightblue; font-size: 12px;text-align:center"| Time series forecasting, pattern recognition from images, e.g. interpretation remote sensing images
 
|}
 
|}
  
Line 50: Line 62:
 
:[[Artificial Neural Networks and coastal applications]]
 
:[[Artificial Neural Networks and coastal applications]]
 
:[[Data interpolation with Kriging]]
 
:[[Data interpolation with Kriging]]
 +
:[[Random Forest Regression]]
 +
:[[Support Vector Regression]]
  
  
Line 75: Line 89:
 
[[Category:Coastal and marine observation and monitoring]]
 
[[Category:Coastal and marine observation and monitoring]]
 
[[Category:Data analysis methods]]
 
[[Category:Data analysis methods]]
[[Category:Physical coastal and marine processes]]
 

Latest revision as of 13:14, 13 February 2024

Here we introduce a series of Coastal Wiki articles dealing with data analysis techniques. The aim of data analysis methods is generally to find a small number of functions that resolve with sufficient accuracy the spatial and temporal properties of the data in terms of external forcing factors. The data analysis techniques presented in the Coastal Wiki are:

  • Linear regression
  • Principal component analysis, empirical orthogonal functions and singular spectrum analysis
  • Wavelets
  • Artificial neural networks
  • Kriging
  • Random Forest Regression
  • Support Vector Regression

Each technique has advantages and disadvantages. The most suitable technique depends on the problem at hand and on the quantity and quality of the available data. In the table below we provide some guidance for choosing the most appropriate technique for the analysis of data on coastal processes.


Table 1. Comparison of data analysis techniques
Analysis technique Strengths Limitations Application example
Linear regression analysis * Trend detection (linear, nonlinear) from data records
* Robust, cheap, easy to implement
* Data errors must be uncorrelated and Gaussian distributed
* Error margins of interpolations and extrapolations are underestimated
* Trend functions are arbitrarily chosen
Trend analysis
Principal component analysis, empirical orthogonal functions and singular spectrum analysis * Techniques are basically the same
* Can handle large data sets
* Identification of 'hidden' spatial (1D, 2D) or temporal patterns
* Guides interpretation towards underlying processes
* Enables data reduction and noise removal
* Bias towards variables with high variance
* Less suited than wavelets in case of phase-shifted patterns
Identification of patterns in large datasets
Wavelets * Analysis of irregular, non-cyclic and nonlinear processes
* Can handle large data sets
* Enables data reduction and noise removal
* Guides interpretation towards underlying processes
* Requires equidistant data
* Not suited for small data records
* Less performant than Fourier or harmonic analysis in case of regular cyclic processes
Analysis of phenomena with strong spatial and temporal variation
Artificial Neural Networks * Prediction tool based on machine learning from training data
* Can handle complex nonlinear systems
* Identification of major influencing factors
* Predictions only within the range of the trained situations
* Black box prediction tool
* Requires large datasets
* No general prescription for optimal network design
* Possibly unreliable results due to overfitting
* No guarantee for convergence to optimal solution
Prediction of features driven by multiple external factors
Kriging * Optimal interpolation method if errors in the data are spatially or temporally correlated
* Provides uncertainty estimate
* Can handle non-uniform sampling
* Assumption that correlations of data deviations from the interpolated function decrease with distance
* Data records must be either in space or time domain
Data records with variability at a wide range of scales
Support Vector Regression * Prediction tool based on machine learning from training data
* Handles unstructured data and nonlinear relationships in high dimensional spaces
* Does classification and regression
* Robust method based on sound mathematical principles
* Efficient for small datasets
* Overfitting can be easily avoided
* Black box, no easy interpretation of results, no probability estimates
* Sensitivity to noise and outliers
* Less efficient for large datasets
* Not reliable outside the range of trained situations
* Results influenced by the choice of the kernel transformation
Pattern recognition from images, e.g. interpretation remote sensing images
Random Forest Regression * Prediction tool based on machine learning from training data
* Handles nonlinear relationships
* Does classification and regression
* Resilient to data noise and data gaps
* Computationally efficient
* Low overfitting risk
* Black box, no easy interpretation of results, no probability estimates
* Less efficient if many trees
* Not reliable outside the range of trained situations transformation
Time series forecasting, pattern recognition from images, e.g. interpretation remote sensing images


Related articles

Linear regression analysis of coastal processes
Analysis of coastal processes with Empirical Orthogonal Functions
Wavelet analysis of coastal processes
Artificial Neural Networks and coastal applications
Data interpolation with Kriging
Random Forest Regression
Support Vector Regression


References


The main authors of this article are Job Dronkers, Grzegorz, Rozynski, Vanessa, Magar and James, Sutherland
Please note that others may also have edited the contents of this article.