Classification

PROBLEM


Classification is the problem of identifying to which category an object belongs to. Considered an instance of supervised predictive modeling. Classification is distinguished to one-class, binary (two-class) and multiclass tasks.

SAMPLE USE-CASES:
Credit scoring. Spam filtering, Image and speech recognition, OCR, Document classification.

ALGORITHM FAMILIES:
Support Vector Machines. Decision Tree Learning. Instance Based Learning. Generalized Linear Models. Neural Networks.

Regression

PROBLEM


Regression is the problem of predicting continuous-valued attiibute associated with an object. Considered an instance of supervised learning and predictive modeling.

Regression is related to classification, but the two are different. Informally, classification predicts whether something will happen, whereas regression predicts how much something will happen.

SAMPLE USE-CASES:
Credit scoring. Spam filtering, Image and speech recognition, OCR, Document classification.

ALGORITHM FAMILIES:
Support Vector Machines. Decision Tree Learning. Instance Based Learning. Generalized Linear Models. Neural Networks.

Clustering

PROBLEM


Clustering is the problem of automatic grouping of similar objects into sets. Considered an instance of unsupervised learning and descriptive modeling.

Unlike classification. the output categories (labels) are not known beforehand in clustering.

SAMPLE USE-CASES:
Customer segmentation, Grouping shopping items, Crime hot spots identification, Outlier detection.

ALGORITHM FAMILIES:
Centroid Based Clustering. Connectivity Based Clustering, Density Based Clustering.

Anomaly Detection

PROBLEM


Anomaly Detection is the problem of identification of items,  events or observations which do not conform to an expected pattern or other items in a dataset. It relies on a number of machine learning techniques which include unsupervised, supervised and semi-supervised.

SAMPLE USE:
Detection of Bank fraud, Security intrusion, System faults, Medical problems, Textual errors

ALGORITHM FAMILIES:
Support Vector Machines, Decision Trees Learning.

Support Vector Machines

ALGORITHM FAMILY


SVMs are supervised learning algorithms which can be used for both classification or regression problems (except OneClassSVM which is unsupervised). Given labeled training data, the algorithm output an optimal hyperplane which categorizes new examples

CHARACTERISTICS:
Big Data — computation and memory-intensive while training, data sampling can be used as a workaround
Small Data — good accuracy even for small # of observations
Imbalanced Data — compensates imbalance through class weights, works fine out of the box for moderately imbalanced data
Results Interpretation — applicable for regulated fields (i.e. credit scoring)
Online Learning — most implementations only support batch setting
 Ease of use — moderate number of parameters.

IMPLEMENTATIONS:
Linear SVM. Non-Linear SVM, One-Class SVM.

Decision Tree Learning

ALGORITHM FAMILY


Decision Tree Learning uses a decision tree structure to go from observations about an item to conclusions about the item’s
target value. It is one of the most interpretable families of machine learning algorithms. This approach can be used for both
classification or regression problems.

CHARACTERISTICS:
Big Data — interpretability is getting worse on large datasets
Small Data — sufficient generalization even for very small dataset, but can lead to overfitting
Imbalanced Data — can be handled by stratified bootstrap
technique
Results Interpretation — represented by a Set Of decision
rules
Online Learning — can be trained sequentially
Ease of Use — models tuning is user-friendly

ALGORITHMS:
Classification/Regression Decision Tree, Random Forest, Isolation Forest.

Generalized Linear Models

ALGORITHM FAMILY


Classification is the problem of identifying to which category an object belongs to. Considered an instance of supervised predictive modeling. Classification is distinguished to one-class, binary (two-class) and multiclass tasks.

CHARACTERISTICS:
Big Data — there are a lot of implementations for large datasets, that include iterative approaches for solving linear equation systems
 Small Data — overfitting resistance due to representing statistically significant dependencies only                                  Imbalanced Data — poor, can be handled by applying imbalance compensations techniques, e.g. complex sampling           Results Interpretation — results drivers can be easily extracted
Online Learning — most implementations only support batch setting
Ease of Use — models tuning is user-friendly

ALGORITHM:
Logistic Regression. Bayesian Naive Classifier.

Instance-Based Learning

ALGORITHM FAMILY


Instance-Based Learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of gel terming explicit generalization. compares new problem instances with instances seen in training and stored in This approach can be used for both classification or regression problems.

CHARACTERISTICS:

Big Data — generally. the application ofthis algorithms for big data is not feasible due to memory and prediction
speed restrictions
Small Data — good accuracy. even for a small number of observations
Imbalanced Data — more robust for data with unbalanced classes and is efficient for multiclass classification with a
small number of features
Results Interpretation — transparent inference: process, but without the possibility of getting explanation rules
Online Learning — can be trained sequentially
Ease of Use — models tuning is user-friendly

ALGORITHMS:

K-Nearest Neighbors. Radius Neighbors.

Centroid-Based Clustering

ALGORITHM FAMILY


Centroid based models represent clusters by a central vector, which does not need to be an actual object. The analyst needs to define the number of clusters in advance and the clustering algorithm then targets the optimization problem of finding k centers and assigning objects to the nearest center. in a way that minimizes the squared distances.

CHARACTERISTICS:

Big Data — can handle big data well, but it’s only fast for low dimensional data
Noise Resistance — sensitive to noisy data
Interpretability — clustering results easy to interpret, except unimodal distribution cases
Online Learning— can be trained sequentially
Data Flexibility — extremely difficult due to limited set of distance metrics and their properties

ALGORITHMS:

K-Means Clustering.

Density-Based Clustering

ALGORITHM FAMILY


In density-based clustering. clusters are defined as areas of higher density in comparison with the rest of the dataset. Objects, which do not belong to the high-density clusters, are usually considered to be noise and border points.

CHARACTERISTICS:
Big Data — depends on implementation. worse case scenarios are characterized by sub-quadratic time or memory complexity
Noise Resistance — has a notion of noise, and is robust to outliers
Interpretability — clusters are characterized by a higher density than the remainder of the dataset
Online Learning — can be trained sequentially
Data Flexibility — possible, but in the case when a symmetric distance function can be defined

ALGORITHMS:
DBSCAN.

Density-Based Clustering

ALGORITHM FAMILY


In density-based clustering. clusters are defined as areas of higher density in comparison with the rest of the dataset. Objects, which do not belong to the high-density clusters, are usually considered to be noise and border points.

CHARACTERISTICS:
Big Data — depends on implementation. worse case scenarios are characterized by sub-quadratic time or memory complexity
Noise Resistance — has a notion of noise, and is robust to outliers
Interpretability — clusters are characterized by a higher density than the remainder of the dataset
Online Learning — can be trained sequentially
Data Flexibility — possible, but in the case when a symmetric distance function can be defined

ALGORITHMS:
DBSCAN.

Artificial Neural Network

ALGORITHM FAMILY


ANN is a computational model based on the structure and functions of biological neural networks. It can be used for both
classification and regression as well as unsupervised learning. The model works well for unstructured data and complex non-
linear relationships but usually is computationally expensive.

CHARACTERISTICS:

Big Data — performance positively correlates with data volume, but at the expense of computational resources
Small Data — the less data available. the bigger chance other algorithms will outperform neural networks
Imbalanced Data — class imbalance increases the possibility of overfitting, but it can be mitigated with class weights
Results Interpretation — usually difficult to interpret;
Online Learning — can be trained sequentiall
Ease of use — requires substantial understanding of the ANN field

ALGORITHMS:

Feedforward ANN (Multilayer Perceptron)

Classification/Regression Trees

DECISION TREE LEARNING | ALGORITHM


A non-parametric supervised learning method used for classification and regression. The goal is to create a model that
predicts the value of a target variable by learning simple decision rules inferred from the data features.

CHARACTERISTICS:

Accuracy — depends heavily on # of features; a tree with few samples in high-dimensional space is very likely to overfit
Training Speed — heuristic algorithms provide the linear dependency on a training dataset size
Prediction Speed — depends on the tree size
Overfitting Resistance — tends to create over-complex trees that don’t generalize well from the training data                Probabilistic Interpretation — can be extracted

TIPS:
• Naturally works with both categorical and numerical data
• The most interpretable results

IMPLEMENTATIONS:
R, Python (scikit-learn), Spark (mllib).

Random Forest

DECISION TREE LEARNING | ALGORITHM


An ensemble learning method that constructs a multitude of decision trees at training time and outputting the class that the classes (classification) or mean prediction is the (regression) the individual trees. It has improved predictive accuracy and controls over-fitting.

CHARACTERISTICS:

Accuracy — due to bootstrap aggregation of independent trees
Training Speed — depends on model complexity, but it can be parallelized
Prediction Speed — depends on model complexity but it can be improved due to independent trees
Overfitting Resistance — complexity doesn’t lead to over fitting
Probabilistic Interpretation — can be extracted

TIPS: 
• Naturally works with both categorical and numerical data
• High accuracy without extensive tuning

IMPLEMENTATIONS:
R, Python (scikit-learn). Spirk (mllib), Azure.

Isolation Forest

DECISION TREE LEARNING | ALGORITHM


Isolation Forest (IForest) is an unsupervised algorithm that learns set of trees. which give an abi’ity to classify new observation as an anomaly or not. The algorithm expects that outlier; are easier to isolate from rest, so “tree path” length is used as a criterion of separation.

CHARACTERISTICS:

Training Speed — linear dependence on # of observations, can vary on data/model complexity, easily parallelized
Prediction Speed — almost linear, depends on # of observations
NoiseSensitivity — IForest is built using trees, but bootstrapping improves its resistance to noise
Distribution Invariance — good with any kind of data distribution

TIPS: 

• Can achieve high detection performance with high-dimensional problems that have a large number of irrelevant attributes
• Works well even with anomalies in the training set

IMPLEMENTATIONS:

R, Python (scikit-learn).

Bayesian Naive Classifier

GENERALIZED LINEAR MODELS | ALGORITHM


A simple probabilistic classifier based on Bayes’ Theorem With an assumption of strong independence (naive) among features.

CHARACTERISTICS:

Accuracy — a big dataset is required to make reliable estimations of the probability of each class
Training Speed — converges toward its asymptotic faster than other methods. like logistic regression
Prediction Speed — depends on # of classes
Overfitting Resistance — great generalization to unseen data based on its training set. due to a very simple hypothesis function
Probabilistic Interpretation — classes probabilities are naturally estimated for binary classification

TIPS: 

• Can handle missing data
• Converges quicker than discriminative models like logistic regression. so you need less training data

IMPLEMENTATIONS:

R, Python (scikit-learn). Spark (mllib), Azure.

Linear Regression

GENERALIZED LINEAR MODELS | ALGORITHM


A linear approach for modelling relationship between variables. Used for predicting continuous output, based on one or multiple
inputs. The algorithm is simple and fast, often used at the early exploration stage, but could be overly simplistic.

CHARACTERISTICS:

Accuracy — limited due to only linear relations in data
Training Speed — linear dependence on the training dataset size and # of features
Prediction Speed — prediction can be performed as multiplication of two matrixes, so it is fast and easy to parallelize
Overfitting Resistance — very sensitive to outliers and # of features; recommended outliers treatment and regularization techniques

TIPS:

• In some realizations categorical variables should be transformed to a set of binary ones to be used in the algorithm
• Results are interpretable as features influences are represented by learned coefficients

IMPLEMENTATIONS:

R, Python (scikit-learn), Spark (mllib), Azure.

Logistic Regression

GENERALIZED LINEAR MODELS | ALGORITHM


Bayesian linear regression models treat regression coefficients and the disturbance variance as random variables, rather than
fixed but unknown quantities. This assumption leads to a more flexible model and intuitive inferences.

CHARACTERISTICS:

Accuracy — separates observations into two groups. If observations are not linearly separable, it’s impossible to achieve high accuracy
Training Speed — fast for binary classification, it can also be trained with a large dataset using online training approach
Prediction Speed — predictions can be extracted relatively fast and computationally easy
Overfitting Resistance — considered robust to outliers. Be careful with # of features. Recommended regularization techniques
Probabilistic Interpretation — extracted by weighting predictions

TIPS:

• Multiclass-classification is done using one-vs-all or one- vs-one
• Results are interpretable as the influence of the features are; represented by learned coefficients

IMPLEMENTATIONS:

R, Python (scikit-learn), Spark (mllib), Azure.

Bayesian Linear Regression

GENERALIZED LINEAR MODELS | ALGORITHM


Bayesian linear regression models treat regression coefficients and the disturbance variance as random variables, rather than
fixed but unknown quantities. This assumption leads to a more flexible model and intuitive inferences.

CHARACTERISTICS:

Accuracy — determined by the confidence level of estimated distribution parameters
Training Speed — depends on the model complexity
Prediction Speed— similar to a linear regression model
Overfitting Resistance — due to prior probabilities, it’s possible to constrain model parameters by some more probable value, inferred from the overall data

TIPS:

• More accurate in small samples
• Can incorporate prior information

IMPLEMENTATIONS:

R, Python (scikit-learn). Azure.

K-Nearest Neighbors (KNN)

INSTANCE-BASED LEARNING I ALGORITHM


A non-parametric supervised learning method used for classification and regression; a type of lazy learning, where the
function is only approximated locally and all computation is deferred until classification.

CHARACTERISTICS:

Accuracy — sufficient accuracy for most tasks, but there is a tradeoff between accuracy vs avoiding over fitting
Training Speed — training time is high on large datasets
Prediction Speed — full training set processing is required
Overfitting Resistance — with an increase of k nearest training objects, the probability of overfitting decreases
Probabilistic Interpretation — naturally determined by the inference process

TIPS:

• One of the simplest machine learning algorithms
• Good choice for low dimensional space

IMPLEMENTATIONS:

R, Python (scikit-learn).

Radius Neighbors

INSTANCE-BASED LEARNING | ALGORIThW


An alternative to the KNN algorithm, wherein the nearest neighbor is determined by a radios hyper-parameter. Also is used for classification and regression tasks.

CHARACTERISTICS:

Accuracy — sufficient accuracy for most tasks, but there is a tradeoff between accuracy vs avoiding overfitting
Training Speed — training time is high on large datasets
Prediction Speed — full training set processing is required
Overfitting Resistance — with an increase of radius the probability of overfitting decreases
Probabilistic Interpretation — naturally determined by the inference process

TIPS:

• More accurate in small samples
• Can incorporate prior information

IMPLEMENTATIONS:

R, Python (scikit-learn). Azure.

Linear SVM

SUPPORT VECTOR MACHINES | ALGORITHM


An SVM algorithm with a linear kernel.

CHARACTERISTICS:

Accuracy — depends on data nature (linear separability is required)
Training Speed —
linear dependency on a training dataset size
Prediction Speed —
depends on # of support vectors and features
Overfitting Resistance —
be careful with model hyperparameters
Probabilistic Interpretation —
can be calculated using an expensive cross-validation

TIPS:

• Great performance in high-dimensional space
• Works well in case when # of features is larger than # of
observations
• Margins maximization provides good robustness.

IMPLEMENTATIONS:

R, Python (scikit-learn), Spark (mllib), Azure.

Non-Linear SVM

SUPPORT VECTOR MACHINES | ALGORITHM


An SVM *’gorithm with a non-linear kernel like Gaussian (e.g. RBF).

CHARACTERISTICS:

Accuracy — high accuracy for most tasks when tuning is done well
Training Speed — training time is high on large datasets;
Prediction Speed — depends on # of support vectors and features
Overfitting Resistance — theoretically, SVMs should be highly resistant to overfitting, but in practice it depends on kernel parameters
Probabilistic Interpretation — can be calculated using an expensive cross-validation

TIPS:
• Great performance in high-dimensional space
• Expert knowledge about the problem can be built into a
model by engineering the kernel
• Deep expertise in machine learning is required

IMPLEMENTATIONS
R, Python (scikit-learn), Azure.

One-Class SVM

SUPPORT VECTOR MACHINES | ALGORITHM


One-class SVM is an unsupervised algorithm that learns a decision function for novelty detection: classifying new data as similar Or different to the training set.

CHARACTERISTICS:

Training Speed — quadratic dependence on the dataset size
Prediction Speed — linear dependence on # of processed observations, can vary on # of support vectors and features
Noise Sensitivity — less robust against noise in the training dataset in comparison with other algorithms
Distribution Invariance — good with any kind of data distribution

TIPS:
• Great performance in high-dimensional space
• Can be used to create an anomaly detection model

IMPLEMENTATIONS:

R, Python (scikit-learn), Azure.

Single-Linkage Clustering

HIERARCHICAL CLUSTERING | ALGORITHM


An agglomerative clustering approach, when at each step two clusters are combined if the distance between the closest observations is the lowest among all clusters combinations.

CHARACTERISTICS:

Time complexity — slow when building a distance matrix, especially with high-dimensional data and many observations
Memory complexity — quadratic complexity
Deterministic results — agglomerate approach is clearly defined
Tuning complexity — dendrogram can be split by # of groups or height

TIPS:

• Catches complicated data structures
• Tends to produce long thin clusters, which may lead to difficulties using the algorithm
• Useful in anomaly detection

IMPLEMENTATIONS:

R, Python (scikit-learn).

Anomaly Detection

PROBLEM


Anomaly Detection is the problem of identification of items,  events or observations which do not conform to an expected pattern or other items in a dataset. It relies on a number of machine learning techniques which include unsupervised, supervised and semi-supervised.

SAMPLE USE:
Detection of Bank fraud, Security intrusion, System faults, Medical problems, Textual errors

ALGORITHM FAMILIES:
Support Vector Machines, Decision Trees Learning.

Average-Linkage Clustering

HIERARCHICAL CLUSTERING | ALGORITHM


An agglomerative clustering approach, when at each step two clusters are combined if the average distance between all
data points of these clusters is the lowest among all clusters combinations.

CHARACTERISTICS:

Time complexity — slower than single or complete linkage-algorithms as more operations are needed to compute distances
Memory complexity — quadratic complexity
Deterministic results — agglomerate approach is clearly defined
Tuning complexity — dendrogram could be split by # of groups or height

TIPS:

• All points from clusters are contributing equally, compared with single/complete linkages

IMPLEMENTATIONS:

R, Python (scikit-learn).

K-Means Clustering

CENTROID-BASED CLUSTERING | ALGORITHM


A method of vector quantization, popular for cluster analysis in data mining. k-means clustering aims to partition n observations
into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster

CHARACTERISTICS:

Time Complexity — very fast but falls in local minima, needs restarting
Memory Complexity — sub-quadratic complexity
Deterministic Results — depends on the initial choice of
Tuning Complexity — user friendly, with small number of hyper parameters

TIPS:

• The result of the algorithm is the cluster centers
• Forms of clusters can only be hyperspheres

IMPLEMENTATIONS:

R, Python (scikit-learn), Spark (mllib), Azure.

DBSCAN

DENSITY-BASED CLUSTERING | ALGORITHM


The most popular density-based clustering method. OBSCAN is based on the so-called “density-reachability” cluster rnodel, It
connects points that satisfy a density criterion like a minimum number of other objects within a defined radius.

CHARACTERISTICS:

Time Complexity — requires a linear number of range queries on the database
Memory Complexity— sub-quadratic complexity
Deterministic Results — border points that are reachable from more than one cluster can be part of either cluster, depending on the order in which the data is processed
Tuning Complexity — algorithm parameters can be set by a domain expert, if the data is well understood

TIPS:

• Does not require # of clusters as a parameter
• Good results for non-sphere clusters with equal density

IMPLEMENTATIONS:

R, Python (scikit-learn).

Multilayer Perceptron

ARTIFICIAL NEURAL NETWORK | ALGORITHM


A class of Feedforward ANNs that consists of at least one hidden layer with the nonlinear activation function. Popular activation
functions include rectified linear unit (ReLU), sigmoid function and hyperbolic tangent,

CHARACTERISTICS:

Accuracy — works well for both linear and nonlinear dependencies
Training Speed — depends heavily on model complexity and on training dataset size
Prediction Speed — depends on # of model features, scales well
Overfitting Resistance — requires big training set and regularization
Probabilistic Interpretation — thanks to the softmax activation

TIPS:

• Good performance for high dimensional space
• Works well with numerical and categorical data

IMPLEMENTATIONS:
R, Python (scikit-learn), Spark (mllib, classificatory only), Azure.