Machine Learning, Deep Learning & AI Methods for Data Analysis, Processing & Engineering


Last updated: March 2, 2026
Author: Paul Namalomba
- SESKA Computational Engineer
- SEAT Backend Developer
- Software Developer
- PhD Candidate (Civil Engineering Spec. Computational and Applied Mechanics)
Contact: kabwenzenamalomba@gmail.com
Website: paulnamalomba.github.io

Language: Python Language: C# Domain: ML/DL/AI License: MIT

Overview

This document is an exhaustive technical reference cataloguing every significant Machine Learning (ML), Deep Learning (DL), and Artificial Intelligence (AI) method currently utilized in production data analysis, data processing, and data engineering pipelines. Each method is presented with its mathematical foundation, its specific role in data workflows, and its computational complexity profile.

The target reader holds fluency in linear algebra, multivariate calculus, probability theory, and systems architecture. Elementary explanations are intentionally omitted. If a concept here is unfamiliar, consult Bishop (2006), Goodfellow et al. (2016), or Murphy (2022) before proceeding.

Phase 1 provides the algorithmic taxonomy. Phase 2 maps these algorithms to the concrete library and framework landscape in Python and C# / .NET, covering hardware acceleration, API philosophy, and production integration patterns.

Contents


Phase 1 — Comprehensive Taxonomy of ML/DL Methods in Data Analysis


1) Data Preprocessing & Feature Engineering

The quality of any downstream model is bounded by the quality of the data entering it. The methods in this section address the three cardinal sins of raw data: missingness, contamination (outliers), and scale heterogeneity.


1.1 Multiple Imputation by Chained Equations (MICE)

Theoretical Foundation

MICE (also called Fully Conditional Specification) treats each feature with missing values as a dependent variable and imputes it using the remaining features as predictors in a round-robin fashion. For $p$ features with missingness, MICE iterates:

$$ X_j^{(t+1)} \sim P\left(X_j \mid X_{-j}^{(t)}, \theta_j\right) \quad \forall\, j \in {1, \ldots, p} $$

where $X_{-j}^{(t)}$ denotes all features except $j$ at iteration $t$, and $\theta_j$ parameterizes the conditional model for feature $j$ (typically linear regression for continuous, logistic for binary, polytomous for categorical). The process converges when the joint distribution $P(X_{\text{miss}} \mid X_{\text{obs}})$ stabilizes. Convergence is assessed via trace plots of imputed statistics across iterations.

Critically, MICE does not assume a joint multivariate distribution. It operates purely on conditionals, which makes it both flexible and theoretically fragile — there is no guarantee the conditional models are compatible with a coherent joint distribution. In practice, this rarely matters. In theory, it should keep you awake at night.

Data Processing Application

MICE is the default advanced imputation strategy for tabular datasets in which missingness is MAR (Missing At Random) or MCAR (Missing Completely At Random). It preserves inter-feature correlations and distributional properties far better than mean/median imputation or simple regression imputation. It generates $m$ completed datasets, enabling Rubin's rules for pooled inference that correctly propagates imputation uncertainty into downstream confidence intervals.

  • ETL pipelines with structured clinical, financial, or sensor data
  • Upstream of any supervised learner where listwise deletion would decimate the training set
  • Feature engineering stages where imputed features feed into interaction terms or polynomial expansions

Algorithmic Complexity

Phase Time Space
Per iteration (one feature) $O(n \cdot p)$ for linear regression; $O(n \cdot p^2)$ if OLS closed-form $O(n \cdot p)$
Full MICE (T iterations, p features) $O(T \cdot p \cdot n \cdot p) = O(T \cdot p^2 \cdot n)$ $O(m \cdot n \cdot p)$ for $m$ imputations

1.2 KNN Imputation

Theoretical Foundation

For a sample $\mathbf{x}_i$ with missing feature $j$, KNN imputation identifies the $k$ nearest neighbors of $\mathbf{x}_i$ in the observed feature subspace using a distance metric (typically Euclidean, Nan-Euclidean, or Gower for mixed types), then imputes:

$$ \hat{x}{ij} = \frac{\sum{l \in \mathcal{N}k(i)} w_l \cdot x{lj}}{\sum_{l \in \mathcal{N}_k(i)} w_l} $$

where $w_l = 1/d(\mathbf{x}_i, \mathbf{x}_l)$ for distance-weighted variants, or $w_l = 1$ for uniform weighting. Distance computation over incomplete vectors uses only mutually observed dimensions, then rescales:

$$ d(\mathbf{x}i, \mathbf{x}_l) = \sqrt{\frac{p}{p{\text{obs}}}} \cdot \sqrt{\sum_{j \in \text{obs}} (x_{ij} - x_{lj})^2} $$

Data Processing Application

KNN imputation is non-parametric - it makes no distributional assumptions about the features. This makes it suitable for datasets with complex, non-linear inter-feature relationships where MICE's conditional models (linear/logistic) might be misspecified. It is particularly effective when:

  • Feature distributions are heavily multimodal
  • The dataset is moderate-sized (KNN is memory-bound for large $n$)
  • Mixed-type features are present (with Gower distance)
  • Imputation must respect local data geometry

Algorithmic Complexity

Phase Time Space
Brute-force $O(n^2 \cdot p)$ $O(n \cdot p)$
With KD-Tree (dense, low-$p$) $O(n \cdot p \cdot \log n)$ $O(n \cdot p)$
With Ball-Tree (metric spaces) $O(n \cdot p \cdot \log n)$ $O(n \cdot p)$

KNN imputation does not scale gracefully. For $n > 10^5$, approximate nearest neighbors (FAISS, Annoy) should replace exact search, or MICE should be preferred outright.


1.3 Matrix Completion via Nuclear Norm Minimization

Theoretical Foundation

Given partially observed matrix $M \in \mathbb{R}^{n \times p}$ with observed entries indexed by $\Omega$, the goal is to recover the full low-rank matrix:

$$ \min_X | X |* \quad \text{subject to} \quad X{ij} = M_{ij} \quad \forall (i,j) \in \Omega $$

where $| X |_*$ is the nuclear norm (sum of singular values), serving as the convex relaxation of the rank function. The optimization is typically solved via proximal gradient descent or Alternating Direction Method of Multipliers (ADMM). The proximal operator of the nuclear norm is the singular value thresholding operator:

$$ \text{prox}{\lambda | \cdot |*}(X) = U \cdot \text{diag}\left(\max(\sigma_i - \lambda, 0)\right) \cdot V^T $$

Data Processing Application

Matrix completion is the mathematically principled approach to imputation when the data matrix is genuinely low-rank (or approximately so). This assumption holds in recommendation systems, sensor networks with correlated channels, and any dataset where latent factors drive the observed variables. It is distinct from MICE in that it operates on the entire matrix simultaneously rather than column-by-column.

Algorithmic Complexity

Phase Time Space
SVD per iteration $O(\min(n^2 p, n p^2))$ $O(n \cdot p)$
Full optimization (T iterations) $O(T \cdot \min(n^2 p, n p^2))$ $O(n \cdot p)$
Truncated SVD (rank-$r$) $O(n \cdot p \cdot r)$ per iteration $O((n + p) \cdot r)$

1.4 Isolation Forest (Anomaly Detection)

Theoretical Foundation

Isolation Forest exploits the principle that anomalies are few and different, making them easier to isolate via random partitioning. The algorithm constructs an ensemble of $T$ isolation trees (iTrees), each built by:

  1. Randomly selecting a feature $q$ from the feature set
  2. Randomly selecting a split value $s$ uniformly between the feature's min and max in the current node
  3. Recursing until the point is isolated (node size = 1) or tree height reaches $\lceil \log_2 n \rceil$

The anomaly score for point $\mathbf{x}$ is:

$$ s(\mathbf{x}, n) = 2^{-\frac{E[h(\mathbf{x})]}{c(n)}} $$

where $E[h(\mathbf{x})]$ is the average path length over $T$ trees, and $c(n)$ is the expected path length of unsuccessful search in a Binary Search Tree:

$$ c(n) = 2H(n-1) - \frac{2(n-1)}{n}, \quad H(k) = \ln(k) + \gamma \quad (\text{Euler-Mascheroni}) $$

Scores near 1 indicate anomaly; near 0.5 indicate normal.

Data Processing Application

Isolation Forest is the workhorse anomaly detector for high-dimensional tabular data in production. Unlike distance-based methods, it does not suffer from the curse of dimensionality. It is used for:

  • Pre-training data cleansing (removing corrupted records)
  • Real-time fraud detection in streaming pipelines
  • Sensor data quality gates in IoT ETL
  • Identifying data drift samples in model monitoring

Algorithmic Complexity

Phase Time Space
Training ($T$ trees, $\psi$ subsamples) $O(T \cdot \psi \cdot \log \psi)$ $O(T \cdot \psi)$
Scoring (single point) $O(T \cdot \log \psi)$ $O(1)$
Scoring (all $n$ points) $O(n \cdot T \cdot \log \psi)$ $O(n)$

The subsampling parameter $\psi$ (default 256) decouples training cost from $n$, making Isolation Forest one of the few anomaly detectors that genuinely scales to $10^7+$ rows.


1.5 One-Class SVM (Anomaly Detection)

Theoretical Foundation

One-Class SVM (Schölkopf et al., 2001) maps the data into a feature space $\mathcal{H}$ via kernel $\phi$ and finds the hyperplane that separates the data from the origin with maximum margin:

$$ \min_{w, \xi, \rho} \frac{1}{2}|w|^2 + \frac{1}{\nu n}\sum_{i=1}^n \xi_i - \rho $$

subject to $\langle w, \phi(\mathbf{x}_i) \rangle \geq \rho - \xi_i$, $\xi_i \geq 0$. Here $\nu \in (0, 1]$ upper-bounds the fraction of outliers and lower-bounds the fraction of support vectors. The dual formulation yields:

$$ \max_\alpha -\frac{1}{2}\sum_{i,j} \alpha_i \alpha_j K(\mathbf{x}_i, \mathbf{x}_j) $$

subject to $0 \leq \alpha_i \leq \frac{1}{\nu n}$, $\sum_i \alpha_i = 1$. The decision function is:

$$ f(\mathbf{x}) = \text{sgn}\left(\sum_{i=1}^n \alpha_i K(\mathbf{x}_i, \mathbf{x}) - \rho\right) $$

Data Processing Application

One-Class SVM is preferred when the "normal" class has complex, non-convex boundaries in feature space. The RBF kernel $K(\mathbf{x}, \mathbf{y}) = \exp(-\gamma |\mathbf{x} - \mathbf{y}|^2)$ enables it to capture arbitrary decision boundaries. It is used for:

  • Novelty detection (the model sees only "normal" data during training)
  • Manufacturing defect detection where defective samples are scarce
  • Network intrusion detection

Its fatal weakness: it does not scale. The kernel matrix is $O(n^2)$ in memory. For $n > 10^4$, Isolation Forest or deep one-class methods (Deep SVDD) are categorically superior.

Algorithmic Complexity

Phase Time Space
Training (kernel matrix + QP) $O(n^2 \cdot p)$ to $O(n^3)$ $O(n^2)$
Inference (per point) $O(n_{\text{SV}} \cdot p)$ $O(n_{\text{SV}} \cdot p)$

1.6 Local Outlier Factor (LOF)

Theoretical Foundation

LOF (Breunig et al., 2000) quantifies the local density deviation of a point relative to its neighbors. For point $\mathbf{x}$:

  1. $k$-distance: distance to the $k$-th nearest neighbor
  2. Reachability distance: $\text{rd}_k(\mathbf{x}, \mathbf{y}) = \max(k\text{-dist}(\mathbf{y}), \, d(\mathbf{x}, \mathbf{y}))$
  3. Local reachability density: $\text{lrd}k(\mathbf{x}) = \left(\frac{1}{k}\sum{\mathbf{y} \in \mathcal{N}_k(\mathbf{x})} \text{rd}_k(\mathbf{x}, \mathbf{y})\right)^{-1}$
  4. LOF score: $\text{LOF}k(\mathbf{x}) = \frac{1}{k}\sum{\mathbf{y} \in \mathcal{N}_k(\mathbf{x})} \frac{\text{lrd}_k(\mathbf{y})}{\text{lrd}_k(\mathbf{x})}$

$\text{LOF} \approx 1$ means normal density; $\text{LOF} \gg 1$ means locally anomalous.

Data Processing Application

LOF excels at detecting local anomalies — points that are outliers relative to their neighborhood but not necessarily globally extreme. This is critical in datasets with clusters of varying density, where global methods (Isolation Forest, One-Class SVM) may fail. Applications:

  • Geospatial anomaly detection (e.g., GPS trajectories with localized deviations)
  • Multi-modal distributions where each mode has distinct density
  • Quality control in manufacturing with multiple product lines

Algorithmic Complexity

Phase Time Space
KNN graph construction $O(n^2 \cdot p)$ brute-force; $O(n \cdot p \cdot \log n)$ with KD-Tree $O(n \cdot k)$
LOF computation $O(n \cdot k)$ $O(n)$

1.7 Scaling Transformations

Theoretical Foundation

Scaling is not optional — it is a mathematical prerequisite for any method involving distance metrics, gradient descent, or regularization penalties:

Transform Formula Effect
Standard (Z-score) $z = \frac{x - \mu}{\sigma}$ Centers at 0, unit variance. Assumes Gaussian.
Min-Max $z = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$ Maps to $[0, 1]$. Preserves zero entries in sparse data.
Robust $z = \frac{x - \text{median}}{\text{IQR}}$ Median-centered, IQR-scaled. Resistant to outliers.
Max-Abs $z = \frac{x}{ x_{\max}
Quantile (Uniform) $z = F_n(x)$ Maps to $\text{Uniform}(0, 1)$ via empirical CDF. Non-linear.
Quantile (Gaussian) $z = \Phi^{-1}(F_n(x))$ Forces Gaussianity. Non-linear. Destroys multimodality.
Power (Yeo-Johnson) $z = \frac{(x+1)^\lambda - 1}{\lambda}$ (if $\lambda \neq 0$, $x \geq 0$) Stabilizes variance. Handles negatives (unlike Box-Cox).
Log / Log1p $z = \log(1 + x)$ Compresses right-skewed distributions. Requires $x \geq 0$.

Data Processing Application

  • Z-score: Default for SVM, logistic regression, neural networks, PCA
  • Min-Max: Image pixel normalization, features feeding into sigmoid activations
  • Robust: Any pipeline where outlier rows cannot be dropped (clinical, financial)
  • Quantile-Gaussian: Forcing normality assumptions for methods like LDA or naive Bayes
  • Yeo-Johnson: Preprocessing for linear models on heteroscedastic features

Algorithmic Complexity

All scaling transforms are $O(n)$ time, $O(1)$ to $O(n)$ space (quantile requires sorting: $O(n \log n)$).


1.8 Feature Hashing (Hashing Trick)

Theoretical Foundation

Given a categorical feature with cardinality $C$ (potentially $C > 10^6$), feature hashing maps categories to a fixed-size vector of dimension $m \ll C$ using a hash function $h: \text{categories} \to {0, 1, \ldots, m-1}$ and an optional sign function $\xi: \text{categories} \to {-1, +1}$:

$$ \phi(\mathbf{x})j = \sum{i: h(x_i) = j} \xi(x_i) \cdot v_i $$

The sign function $\xi$ ensures that the inner product is preserved in expectation: $E[\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle] = \langle \mathbf{x}, \mathbf{y} \rangle$. Collisions introduce noise but the expected error shrinks as $O(1/\sqrt{m})$.

Data Processing Application

  • Real-time feature encoding for streaming data where the category vocabulary is unbounded (URLs, user-agents, IP address prefixes)
  • NLP pipelines: hashing n-grams instead of maintaining a vocabulary dictionary
  • Memory-constrained environments where one-hot encoding is infeasible

Algorithmic Complexity

Phase Time Space
Encoding $O(n \cdot \text{nnz})$ $O(m)$ per sample (fixed, independent of $C$)

1.9 Target Encoding

Theoretical Foundation

For categorical feature $x$ with levels ${c_1, \ldots, c_C}$ and target variable $y$, target encoding replaces each level with a smoothed estimate of the conditional expectation:

$$ \hat{y}{c_k} = \lambda(n_k) \cdot \bar{y}{c_k} + (1 - \lambda(n_k)) \cdot \bar{y}_{\text{global}} $$

where $\bar{y}{c_k}$ is the mean target for level $c_k$, $n_k$ is the count, and $\lambda(n_k) = \frac{n_k}{n_k + \alpha}$ is a shrinkage factor (empirical Bayes smoothing). The hyperparameter $\alpha$ controls regularization toward the global mean $\bar{y}{\text{global}}$.

To prevent target leakage, target encoding must be computed using leave-one-out or $k$-fold strategies within the training set.

Data Processing Application

  • High-cardinality categorical features ($C > 100$) where one-hot encoding induces sparsity and dimensionality explosion
  • Tree-based model pipelines (CatBoost internalizes a variant of this)
  • Ordinal encoding is inadequate because the categories have no natural ordering

Algorithmic Complexity

$O(n)$ time, $O(C)$ space.


2) Dimensionality Reduction & Manifold Learning

High-dimensional data is not merely inconvenient — it is pathological. Distances concentrate, volumes explode, and models overfit catastrophically. The methods below compress data into lower-dimensional representations while preserving specific geometric or statistical properties.


2.1 Principal Component Analysis (PCA)

Theoretical Foundation

PCA finds the orthogonal directions of maximum variance. Given centered data matrix $X \in \mathbb{R}^{n \times p}$, PCA solves:

$$ \max_{w: |w|=1} w^T \Sigma w \quad \text{where} \quad \Sigma = \frac{1}{n-1}X^T X $$

The solution is the eigendecomposition $\Sigma = W \Lambda W^T$ where columns of $W$ are eigenvectors (principal components) and $\Lambda = \text{diag}(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p)$ are eigenvalues (variance captured per component). The projection to $d$ dimensions is:

$$ Z = X W_d \quad \text{where} \quad W_d = [w_1 \mid w_2 \mid \cdots \mid w_d] $$

Equivalently, PCA minimizes reconstruction error $|X - ZW_d^T|_F^2$, establishing its connection to the truncated SVD: $X \approx U_d \Sigma_d V_d^T$.

Data Processing Application

  • Decorrelation of multicollinear features prior to regression
  • Noise reduction by projecting onto top-$d$ components (discarding low-variance noise dimensions)
  • Visualization (projecting to $d=2$ or $d=3$)
  • Compression of high-dimensional sensor data in streaming pipelines
  • Eigenfaces-style feature extraction for image pipelines

Variance retention criterion: choose $d$ such that $\frac{\sum_{i=1}^d \lambda_i}{\sum_{i=1}^p \lambda_i} \geq \tau$ (typically $\tau = 0.95$).

Algorithmic Complexity

Phase Time Space
Full SVD $O(\min(n^2 p, n p^2))$ $O(n \cdot p)$
Truncated SVD (rank-$d$, randomized) $O(n \cdot p \cdot d)$ $O((n + p) \cdot d)$
Incremental PCA (mini-batch) $O(n \cdot p \cdot d)$ total $O(b \cdot p)$ per batch of size $b$

2.2 Kernel PCA

Theoretical Foundation

Kernel PCA performs PCA in a reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ induced by kernel function $K$. The kernel matrix $K_{ij} = \kappa(\mathbf{x}_i, \mathbf{x}_j)$ replaces the covariance matrix. After centering $\tilde{K} = K - \frac{1}{n}\mathbf{1}K - K\frac{1}{n}\mathbf{1} + \frac{1}{n}\mathbf{1}K\frac{1}{n}\mathbf{1}$, we solve:

$$ \tilde{K} \alpha^{(k)} = \lambda_k \alpha^{(k)} $$

Projection of new point $\mathbf{x}$: $z_k = \sum_{i=1}^n \alpha_i^{(k)} \kappa(\mathbf{x}_i, \mathbf{x})$.

Common kernels: - RBF: $\kappa(\mathbf{x}, \mathbf{y}) = \exp(-\gamma |\mathbf{x} - \mathbf{y}|^2)$ - Polynomial: $\kappa(\mathbf{x}, \mathbf{y}) = (\mathbf{x}^T \mathbf{y} + c)^d$ - Sigmoid: $\kappa(\mathbf{x}, \mathbf{y}) = \tanh(\alpha \mathbf{x}^T \mathbf{y} + c)$

Data Processing Application

  • Nonlinear dimensionality reduction where the data manifold is curved (e.g., the "Swiss roll")
  • Feature extraction for downstream classifiers when linear PCA discards discriminative nonlinear structure
  • Pre-processing for One-Class SVM (using the same kernel for geometric consistency)

Algorithmic Complexity

Phase Time Space
Kernel matrix computation $O(n^2 \cdot p)$ $O(n^2)$
Eigendecomposition of $K$ $O(n^3)$ $O(n^2)$
Projection (new point) $O(n \cdot p)$ $O(n)$

The $O(n^2)$ memory requirement is the kill shot. For large $n$, use Nyström approximation (subsample $m \ll n$ landmarks) reducing cost to $O(n \cdot m^2)$.


2.3 Sparse PCA

Theoretical Foundation

Sparse PCA trades explained variance for interpretability by enforcing sparsity (L1 penalty) on the loadings:

$$ \min_{W, Z} \frac{1}{2}|X - ZW^T|_F^2 + \alpha |W|_1 $$

subject to $|z_k|_2 \leq 1$. This yields components that load on only a subset of original features. The elastic net variant adds an L2 term:

$$ \min_W \frac{1}{2}|X - XW\tilde{W}^T|_F^2 + \alpha_1 |W|_1 + \alpha_2 |W|_F^2 $$

Solved via coordinate descent or ADMM.

Data Processing Application

  • Genomics (identifying which genes drive principal variation from $p > 10^4$ features)
  • Financial factor modeling where interpretable factors are regulatory requirements
  • Any domain where "the first PC is a linear combination of all 500 features" is unacceptable

Algorithmic Complexity

Phase Time Space
Per component (coordinate descent) $O(p^2 \cdot n)$ per iteration $O(n \cdot p)$
$d$ components, $T$ iterations $O(d \cdot T \cdot p^2 \cdot n)$ $O(n \cdot p)$

2.4 t-Distributed Stochastic Neighbor Embedding (t-SNE)

Theoretical Foundation

t-SNE (van der Maaten & Hinton, 2008) constructs two probability distributions: one over pairs in high-dimensional space (Gaussian kernel), one over pairs in the low-dimensional embedding (Student-t with 1 DOF):

High-dimensional affinity: $$ p_{j|i} = \frac{\exp(-|\mathbf{x}i - \mathbf{x}_j|^2 / 2\sigma_i^2)}{\sum{k \neq i}\exp(-|\mathbf{x}i - \mathbf{x}_k|^2 / 2\sigma_i^2)}, \quad p{ij} = \frac{p_{j|i} + p_{i|j}}{2n} $$

Low-dimensional affinity (heavy-tailed Student-t): $$ q_{ij} = \frac{(1 + |\mathbf{y}i - \mathbf{y}_j|^2)^{-1}}{\sum{k \neq l}(1 + |\mathbf{y}_k - \mathbf{y}_l|^2)^{-1}} $$

Objective (KL divergence): $$ \mathcal{L} = \text{KL}(P | Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}} $$

Minimized via gradient descent with momentum. The bandwidth $\sigma_i$ per point is set by binary search to match a specified perplexity: $\text{Perp}(P_i) = 2^{H(P_i)}$ where $H$ is Shannon entropy.

Data Processing Application

  • Visualization only. t-SNE is not a general-purpose dimensionality reduction method — the mapping is non-parametric (no out-of-sample extension), the KL divergence is asymmetric (penalizes placing distant points close, not the reverse), and global structure is not preserved.
  • Cluster visualization for quality inspection of embeddings
  • EDA on high-dimensional datasets (genomics, NLP embeddings)

Algorithmic Complexity

Phase Time Space
Exact $O(n^2)$ per gradient step $O(n^2)$
Barnes-Hut approximation $O(n \log n)$ per step $O(n)$
FFT-accelerated (FIt-SNE) $O(n \log n)$ per step $O(n)$

2.5 Uniform Manifold Approximation and Projection (UMAP)

Theoretical Foundation

UMAP (McInnes et al., 2018) constructs a weighted $k$-nearest-neighbor graph in high dimensions, interprets it as a fuzzy simplicial set (via Riemannian geometry on the data manifold), then optimizes a low-dimensional layout to preserve the topological structure. The key equations:

High-dimensional fuzzy membership: $$ \mu_{ij} = \exp\left(-\frac{d(\mathbf{x}_i, \mathbf{x}_j) - \rho_i}{\sigma_i}\right) $$

where $\rho_i$ is the distance to the nearest neighbor (local connectivity guarantee), and $\sigma_i$ is calibrated to achieve $\log_2(k)$ effective neighbors.

Symmetrization: $\mu_{ij}^{\text{sym}} = \mu_{ij} + \mu_{ji} - \mu_{ij} \cdot \mu_{ji}$ (fuzzy set union, not average).

Low-dimensional affinity: $$ \nu_{ij} = \left(1 + a |\mathbf{y}_i - \mathbf{y}_j|^{2b}\right)^{-1} $$

with $a, b$ fitted to approximate a smooth approximation of the target distribution. Optimization via stochastic gradient descent on the cross-entropy:

$$ \mathcal{L} = \sum_{(i,j)} \left[\mu_{ij} \log \frac{\mu_{ij}}{\nu_{ij}} + (1 - \mu_{ij}) \log \frac{1 - \mu_{ij}}{1 - \nu_{ij}}\right] $$

Data Processing Application

UMAP supersedes t-SNE for most practical purposes:

  • Preserves both local and (approximate) global structure
  • Supports out-of-sample transforms via transform() (parametric UMAP uses a neural network)
  • Directly usable as a dimensionality reduction step upstream of classifiers (unlike t-SNE)
  • Scales to $n > 10^6$ via approximate nearest neighbors (Annoy, NN-Descent)

Applications: scRNA-seq visualization, embedding space exploration, outlier detection in reduced space.

Algorithmic Complexity

Phase Time Space
KNN graph (NN-Descent) $O(n^{1.14})$ empirical $O(n \cdot k)$
Optimization (SGD) $O(n \cdot E)$ for $E$ epochs $O(n \cdot d)$
Total $O(n^{1.14} + n \cdot E)$ $O(n \cdot k + n \cdot d)$

2.6 Autoencoders (Vanilla)

Theoretical Foundation

An autoencoder is a neural network trained to reconstruct its input through a bottleneck. Given encoder $f_\theta: \mathbb{R}^p \to \mathbb{R}^d$ and decoder $g_\phi: \mathbb{R}^d \to \mathbb{R}^p$ (where $d \ll p$):

$$ \min_{\theta, \phi} \frac{1}{n}\sum_{i=1}^n \mathcal{L}(\mathbf{x}i, g\phi(f_\theta(\mathbf{x}_i))) $$

For continuous data: $\mathcal{L} = |\mathbf{x} - \hat{\mathbf{x}}|_2^2$ (MSE). For binary data: $\mathcal{L} = -\sum_j [x_j \log \hat{x}_j + (1-x_j)\log(1-\hat{x}_j)]$ (binary cross-entropy).

With linear activations, the autoencoder learns a subspace identical to PCA. Nonlinear activations (ReLU, GELU) enable learning of curved manifolds.

Data Processing Application

  • Nonlinear dimensionality reduction (the latent vector $\mathbf{z} = f_\theta(\mathbf{x})$ is the compressed representation)
  • Denoising (Denoising Autoencoders: train on corrupted input $\tilde{\mathbf{x}}$, reconstruct clean $\mathbf{x}$)
  • Anomaly detection via reconstruction error: if $|\mathbf{x} - g_\phi(f_\theta(\mathbf{x}))|^2 > \tau$, flag as anomaly
  • Feature extraction for downstream supervised models

Algorithmic Complexity

Depends on architecture. For a single hidden layer of width $h$:

Phase Time (per sample) Space
Forward pass $O(p \cdot h + h \cdot d + d \cdot h + h \cdot p) = O(p \cdot h)$ $O(p + h + d)$
Backward pass $\approx 2\times$ forward + gradient buffers
Training ($n$ samples, $E$ epochs) $O(E \cdot n \cdot p \cdot h)$ $O(\text{params})$

2.7 Variational Autoencoders (VAE)

Theoretical Foundation

VAEs (Kingma & Welling, 2014) impose a probabilistic structure on the latent space. The generative model assumes $\mathbf{z} \sim \mathcal{N}(0, I)$ and $\mathbf{x} | \mathbf{z} \sim p_\theta(\mathbf{x}|\mathbf{z})$. Since the true posterior $p(\mathbf{z}|\mathbf{x})$ is intractable, we approximate it with $q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\mu_\phi(\mathbf{x}), \text{diag}(\sigma_\phi^2(\mathbf{x})))$.

The training objective maximizes the Evidence Lower Bound (ELBO):

$$ \mathcal{L}{\text{ELBO}} = \mathbb{E}{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) | p(\mathbf{z})) $$

The first term is reconstruction quality. The second term regularizes the encoder to stay near the prior. Training uses the reparameterization trick: $\mathbf{z} = \mu + \sigma \odot \epsilon$, $\epsilon \sim \mathcal{N}(0, I)$, enabling backpropagation through the stochastic node.

The KL term has a closed-form solution for Gaussian posteriors against a Gaussian prior:

$$ D_{\text{KL}} = -\frac{1}{2}\sum_{j=1}^d \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right) $$

Data Processing Application

  • Generative modeling: sampling $\mathbf{z} \sim \mathcal{N}(0, I)$ and decoding produces novel data points
  • Disentangled representation learning ($\beta$-VAE penalizes the KL term more heavily)
  • Semi-supervised learning: partially labeled data leverages both reconstruction and classification losses
  • Data augmentation for imbalanced datasets (generate minority-class synthetic samples by conditioning the latent space)

Algorithmic Complexity

Same as vanilla autoencoder + cost of sampling and KL computation ($O(d)$ per sample). The reparameterization trick adds negligible overhead.


2.8 Sparse Autoencoders

Theoretical Foundation

Sparse autoencoders add a sparsity constraint to the latent representation. The loss function augments reconstruction error with a sparsity penalty:

$$ \mathcal{L} = \frac{1}{n}\sum_{i=1}^n |\mathbf{x}i - \hat{\mathbf{x}}_i|^2 + \beta \sum{j=1}^d \text{KL}(\rho | \hat{\rho}_j) $$

where $\hat{\rho}j = \frac{1}{n}\sum{i=1}^n h_j(\mathbf{x}_i)$ is the average activation of hidden unit $j$, $\rho$ is the target sparsity (e.g., 0.05), and:

$$ \text{KL}(\rho | \hat{\rho}) = \rho \log \frac{\rho}{\hat{\rho}} + (1-\rho) \log \frac{1-\rho}{1-\hat{\rho}} $$

This forces most hidden units to be inactive for any given input, yielding overcomplete but sparse representations.

Data Processing Application

  • Feature extraction where the latent dimension $d > p$ (overcomplete), but each sample activates only a sparse subset
  • Interpretability: active units form a "parts-based" decomposition (analogous to NMF)
  • Pre-training layer-wise in deep networks (historically important, now largely superseded by batch normalization and residual connections)

Algorithmic Complexity

Same as vanilla autoencoder. The sparsity penalty adds $O(n \cdot d)$ per epoch for computing mean activations.


3) Supervised Learning — Regression & Classification


3.1 Gradient Boosted Decision Trees — XGBoost

Theoretical Foundation

XGBoost (Chen & Guestrin, 2016) implements gradient boosting with second-order Taylor expansion of the loss function. Given ensemble prediction $\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(\mathbf{x}_i)$ at round $t$, the objective is:

$$ \mathcal{L}^{(t)} \approx \sum_{i=1}^n \left[g_i f_t(\mathbf{x}_i) + \frac{1}{2}h_i f_t^2(\mathbf{x}_i)\right] + \Omega(f_t) $$

where $g_i = \partial_{\hat{y}^{(t-1)}} \ell(y_i, \hat{y}i^{(t-1)})$ and $h_i = \partial^2{\hat{y}^{(t-1)}} \ell(y_i, \hat{y}_i^{(t-1)})$ are the first and second derivatives of the loss function evaluated at the current prediction. The regularization term:

$$ \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 $$

penalizes number of leaves $T$ and leaf weight magnitudes $w_j$.

Split finding: For each leaf node, XGBoost evaluates the gain of splitting on feature $k$ at threshold $v$:

$$ \text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma $$

where $G_L = \sum_{i \in I_L} g_i$, $H_L = \sum_{i \in I_L} h_i$ are the gradient and Hessian sums for the left child. The optimal leaf weight is $w_j^* = -\frac{G_j}{H_j + \lambda}$.

Approximate split finding: For large datasets, XGBoost uses a weighted quantile sketch to propose candidate split points, then greedily evaluates gains at these candidates only.

Data Processing Application

XGBoost is the de facto standard for structured/tabular prediction tasks. Its dominance in Kaggle competitions is not accidental — it handles:

  • Mixed feature types (continuous, ordinal, categorical via one-hot)
  • Missing values natively (learns optimal default direction at each split)
  • Feature interactions automatically (depth controls interaction order)
  • Non-linear relationships without feature engineering
  • Class imbalance via scale_pos_weight parameter

Algorithmic Complexity

Phase Time Space
Training (exact greedy, $T$ trees, depth $D$) $O(T \cdot D \cdot n \cdot p \cdot \log n)$ $O(n \cdot p)$
Training (histogram approx., $B$ bins) $O(T \cdot D \cdot n \cdot p + T \cdot D \cdot B \cdot p)$ $O(n \cdot p + B \cdot p)$
Inference (single sample) $O(T \cdot D)$ $O(1)$

3.2 LightGBM

Theoretical Foundation

LightGBM (Ke et al., 2017) introduces two critical innovations over XGBoost:

1. Gradient-based One-Side Sampling (GOSS): Instead of using all $n$ samples to estimate split gains, GOSS keeps all samples with large gradients (top-$a$ fraction) and randomly samples from the rest (fraction $b$). The gradient sums are corrected:

$$ \tilde{G}j = \sum{i \in A_j} g_i + \frac{1-a}{b}\sum_{i \in B_j} g_i $$

where $A$ is the top-$a$ set and $B$ is the random subsample of the complement.

2. Exclusive Feature Bundling (EFB): Mutually exclusive features (features that rarely take non-zero values simultaneously) are bundled into a single feature via offset encoding, reducing effective dimensionality from $p$ to $p'$ where $p' \ll p$ for sparse data.

Leaf-wise (best-first) growth: Unlike XGBoost's level-wise growth, LightGBM grows the leaf with the highest gain reduction, producing asymmetric trees that reduce loss faster with fewer leaves.

Data Processing Application

LightGBM dominates when: - $n$ is very large ($> 10^6$) — GOSS dramatically reduces per-iteration cost - $p$ is high with sparsity (NLP feature counts, one-hot explosions) — EFB compresses effectively - Training time is a constraint (2-10x faster than XGBoost on equivalent configurations) - Native categorical feature support (optimal split finding without one-hot encoding)

Algorithmic Complexity

Phase Time Space
Training (GOSS + EFB, $T$ trees, max leaves $L$) $O(T \cdot L \cdot (a \cdot n + b \cdot n) \cdot p')$ $O(n \cdot p' + B \cdot p')$
Inference $O(T \cdot L)$ worst case $O(1)$

3.3 CatBoost

Theoretical Foundation

CatBoost (Prokhorenkova et al., 2018) addresses prediction shift — a subtle form of target leakage in gradient boosting where residuals used to fit tree $t$ are computed on the same data used to find splits, causing overfitting.

Ordered Boosting: CatBoost trains on a random permutation of the data. For sample $i$ (in permutation order), the model used to compute gradients is trained only on samples ${1, \ldots, i-1}$. This eliminates the conditional bias $E[\hat{y}_i | \mathbf{x}_i] \neq f(\mathbf{x}_i)$ that plagues standard boosting.

Ordered Target Encoding: For categorical feature $c$ and target $y$, the encoding for sample $\sigma(i)$ (the $i$-th sample in permutation $\sigma$) is:

$$ \hat{x}{\sigma(i)} = \frac{\sum{j < i: x_{\sigma(j)} = x_{\sigma(i)}} y_{\sigma(j)} + a \cdot P}{|{j < i: x_{\sigma(j)} = x_{\sigma(i)}}| + a} $$

where $P$ is the global target prior and $a$ is a smoothing parameter. This is target encoding with a built-in leave-future-out guarantee.

Oblivious trees: CatBoost uses symmetric (oblivious) decision trees where the same splitting condition is applied at each level. This constrains the tree but enables SIMD-vectorized inference and acts as regularization.

Data Processing Application

CatBoost is the correct choice when: - Categorical features are abundant and high-cardinality (no manual encoding needed) - Prediction shift is a concern (small datasets, few boosting rounds) - Inference latency matters (oblivious trees are extremely cache-friendly) - Production deployment targets CPUs (CatBoost's SIMD inference is among the fastest)

Algorithmic Complexity

Phase Time Space
Training ($T$ trees, depth $D$, ordered boosting) $O(T \cdot n \cdot D \cdot p)$ $O(n \cdot p + T \cdot 2^D)$
Inference (oblivious tree, single sample) $O(T \cdot D)$ (bit operations, SIMD) $O(1)$

3.4 Support Vector Machines (SVM)

Theoretical Foundation

The SVM dual formulation for the soft-margin classifier:

$$ \max_\alpha \sum_{i=1}^n \alpha_i - \frac{1}{2}\sum_{i,j} \alpha_i \alpha_j y_i y_j K(\mathbf{x}_i, \mathbf{x}_j) $$

subject to $0 \leq \alpha_i \leq C$, $\sum_i \alpha_i y_i = 0$.

The Kernel Trick replaces $\langle \mathbf{x}_i, \mathbf{x}_j \rangle$ with $K(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle$ where $\phi$ maps to a (potentially infinite-dimensional) RKHS. All operations depend only on pairwise kernel evaluations — $\phi$ itself is never computed.

For the RBF kernel, $\phi$ maps to an infinite-dimensional space, yet the kernel evaluation is $O(p)$: $$ K(\mathbf{x}, \mathbf{y}) = \exp(-\gamma |\mathbf{x} - \mathbf{y}|^2) = \exp\left(-\gamma \sum_{k=1}^p (x_k - y_k)^2\right) $$

The SMO (Sequential Minimal Optimization) algorithm solves the QP by iteratively optimizing pairs of $\alpha_i$.

For SVM regression ($\epsilon$-SVR), the loss is the $\epsilon$-insensitive tube: $$ \ell(y, \hat{y}) = \max(0, |y - \hat{y}| - \epsilon) $$

Data Processing Application

  • Small to medium datasets ($n < 10^4$) with complex decision boundaries
  • Text classification with TF-IDF features (linear SVM is competitive with deep models)
  • Bioinformatics (protein classification, gene expression analysis)

SVMs are not used in modern large-scale data engineering. They are mathematically elegant and practically deprecated for $n > 10^5$. The kernel matrix doesn't fit in memory, and tree ensembles match or exceed accuracy on tabular data.

Algorithmic Complexity

Phase Time Space
Training (SMO, RBF kernel) $O(n^2 \cdot p)$ to $O(n^3)$ $O(n^2)$ kernel matrix
Training (linear, LIBLINEAR) $O(n \cdot p \cdot \text{nnz})$ $O(n \cdot p)$
Inference (per sample) $O(n_{\text{SV}} \cdot p)$ RBF; $O(p)$ linear $O(p)$ or $O(n_{\text{SV}})$

3.5 Elastic Net (L1 + L2 Regularized Regression)

Theoretical Foundation

Elastic Net combines Lasso (L1) and Ridge (L2) penalties:

$$ \min_\beta \frac{1}{2n}|y - X\beta|_2^2 + \alpha \left[\rho |\beta|_1 + \frac{1-\rho}{2}|\beta|_2^2\right] $$

where $\alpha > 0$ controls overall regularization and $\rho \in [0,1]$ interpolates between Ridge ($\rho=0$) and Lasso ($\rho=1$).

The L1 term induces sparsity (drives coefficients to exactly zero). The L2 term handles correlated features (Lasso alone arbitrarily selects one from a correlated group; Elastic Net shrinks them together).

Solved via coordinate descent, cycling through features:

$$ \beta_j \leftarrow \frac{S(r_j^T X_j / n, \, \alpha \rho)}{X_j^T X_j / n + \alpha(1-\rho)} $$

where $S(z, \gamma) = \text{sign}(z) \cdot \max(|z| - \gamma, 0)$ is the soft-thresholding operator and $r_j = y - X_{-j}\beta_{-j}$ is the partial residual.

Data Processing Application

  • Feature selection with multicollinearity (genomics: $p \gg n$, features correlated in blocks)
  • Interpretable models with provably sparse coefficient vectors
  • Baseline regression before escalating to gradient boosting
  • Regularization path computation via warm starts for model selection

Algorithmic Complexity

Phase Time Space
Coordinate descent (single $\alpha$) $O(n \cdot p)$ per pass, typically $O(n \cdot p \cdot T)$ for convergence $O(n \cdot p)$
Full regularization path ($K$ values) $O(K \cdot n \cdot p \cdot T)$ with warm starts $O(n \cdot p)$

3.6 Random Forests

Theoretical Foundation

Random Forest (Breiman, 2001) constructs an ensemble of $T$ decision trees, each trained on a bootstrap sample of the data. At each split, a random subset of $m = \lfloor\sqrt{p}\rfloor$ features (classification) or $m = \lfloor p/3 \rfloor$ (regression) is considered. The ensemble prediction:

$$ \hat{y}(\mathbf{x}) = \frac{1}{T}\sum_{t=1}^T f_t(\mathbf{x}) \quad \text{(regression)} $$

$$ \hat{y}(\mathbf{x}) = \text{mode}{f_t(\mathbf{x})}_{t=1}^T \quad \text{(classification)} $$

The generalization error bound (Breiman):

$$ \text{PE}^* \leq \bar{\rho} \cdot \frac{s^2(1 - s^2)}{s^2} $$

where $\bar{\rho}$ is the average correlation between trees and $s$ is the average margin strength. The key insight: reducing tree correlation (via feature subsampling and bagging) drives down ensemble error, even if individual tree error remains constant.

Data Processing Application

  • Robust baseline for any tabular task (classification or regression)
  • Feature importance estimation (permutation importance, impurity decrease)
  • Missing value proximity-based imputation (missForest)
  • Out-of-bag error estimation eliminates the need for a validation set

Algorithmic Complexity

Phase Time Space
Training ($T$ trees, max depth $D$) $O(T \cdot n \cdot m \cdot \log^2 n)$ $O(T \cdot n)$ leaf storage
Inference (single sample) $O(T \cdot D)$ $O(1)$

4) Unsupervised Learning — Clustering & Density Estimation


4.1 DBSCAN

Theoretical Foundation

DBSCAN (Ester et al., 1996) defines clusters as connected components of density-reachable points. Parameters: $\epsilon$ (neighborhood radius), $\text{minPts}$ (minimum neighbors).

  • Core point: $|\mathcal{N}_\epsilon(\mathbf{x})| \geq \text{minPts}$
  • Border point: within $\epsilon$ of a core point, but not core itself
  • Noise: neither core nor border

Two points $\mathbf{x}$ and $\mathbf{y}$ are density-reachable if there exists a chain of core points from $\mathbf{x}$ to $\mathbf{y}$, each within $\epsilon$ of the next. A cluster is the maximal set of density-connected points.

Data Processing Application

  • Discovering clusters of arbitrary shape (DBSCAN does not assume convexity or Gaussianity)
  • Geospatial clustering (GPS coordinate grouping with Haversine distance)
  • Noise/outlier labeling is a native feature (points labeled as -1)
  • Preprocessing: identifying dense regions in feature space to guide feature engineering

The critical weakness: DBSCAN cannot handle clusters of varying density. A single $\epsilon$ that captures dense clusters will merge sparse ones, and vice versa.

Algorithmic Complexity

Phase Time Space
With spatial index (KD-Tree/Ball-Tree) $O(n \log n)$ average $O(n)$
Brute-force (high-dimensional) $O(n^2)$ $O(n^2)$ distance matrix

4.2 HDBSCAN

Theoretical Foundation

HDBSCAN (Campello et al., 2013) extends DBSCAN by eliminating the $\epsilon$ parameter entirely. The algorithm:

  1. Compute the mutual reachability distance: $d_{\text{mreach}}(\mathbf{x}_i, \mathbf{x}_j) = \max(\text{core}_k(\mathbf{x}_i), \text{core}_k(\mathbf{x}_j), d(\mathbf{x}_i, \mathbf{x}_j))$ where $\text{core}_k(\mathbf{x}) = d(\mathbf{x}, \mathbf{x}^{(k)})$ is the distance to the $k$-th nearest neighbor.

  2. Build the minimum spanning tree of the mutual reachability graph.

  3. Construct a cluster hierarchy (dendrogram) by removing edges in decreasing weight order.

  4. Condense the hierarchy using minimum cluster size $m_{\text{clSize}}$: components smaller than $m_{\text{clSize}}$ at a split become noise, not subclusters.

  5. Extract the optimal flat clustering by maximizing cluster stability: $S(C) = \sum_{\mathbf{x} \in C}(\lambda_{\text{death}}(\mathbf{x}) - \lambda_{\text{birth}}(C))$ where $\lambda = 1/\epsilon$.

Data Processing Application

HDBSCAN is the correct default choice for density-based clustering in production:

  • Handles varying-density clusters (fundamentally fixes DBSCAN's weakness)
  • Provides cluster membership probabilities (soft assignment)
  • The condensed tree is a rich multi-scale representation of data structure
  • Robust to noise — noisy points are explicitly identified, not forced into clusters

Applications: customer segmentation, document clustering, spatial event detection, anomaly detection (low-stability points).

Algorithmic Complexity

Phase Time Space
Mutual reachability distances $O(n^2)$ brute-force; $O(n \log n)$ with spatial index $O(n^2)$ or $O(n \cdot k)$
MST (Prim's) $O(n^2)$ or $O(E \log n)$ with sparse graph $O(n)$
Hierarchy + extraction $O(n \log n)$ $O(n)$
Total (with KD-Tree) $O(n \log n)$ best, $O(n^2)$ worst (high $p$) $O(n \cdot k)$

4.3 Gaussian Mixture Models (GMM) — Expectation-Maximization

Theoretical Foundation

A GMM models data as a mixture of $K$ Gaussian components:

$$ p(\mathbf{x}) = \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x} \mid \mu_k, \Sigma_k) $$

where $\pi_k$ are mixing weights ($\sum_k \pi_k = 1$), $\mu_k$ are means, and $\Sigma_k$ are covariance matrices.

The EM algorithm iterates:

E-step — compute responsibilities: $$ \gamma_{ik} = \frac{\pi_k \, \mathcal{N}(\mathbf{x}i \mid \mu_k, \Sigma_k)}{\sum{j=1}^K \pi_j \, \mathcal{N}(\mathbf{x}_i \mid \mu_j, \Sigma_j)} $$

M-step — update parameters: $$ N_k = \sum_{i=1}^n \gamma_{ik}, \quad \mu_k = \frac{1}{N_k}\sum_{i=1}^n \gamma_{ik} \mathbf{x}_i $$

$$ \Sigma_k = \frac{1}{N_k}\sum_{i=1}^n \gamma_{ik}(\mathbf{x}_i - \mu_k)(\mathbf{x}_i - \mu_k)^T, \quad \pi_k = \frac{N_k}{n} $$

EM monotonically increases the log-likelihood $\sum_i \log p(\mathbf{x}_i)$ and converges to a local maximum. Model selection (choosing $K$) uses BIC:

$$ \text{BIC} = -2\log \mathcal{L} + k \log n $$

where $k$ is the number of free parameters: $k = K \cdot p + K \cdot p(p+1)/2 + K - 1$ for full covariance.

Data Processing Application

  • Soft clustering (probabilistic assignment to each cluster)
  • Density estimation for anomaly detection (low-likelihood points are anomalies)
  • Modeling heterogeneous populations (e.g., multi-modal sensor distributions)
  • Initialization for more complex models (GMM priors in Bayesian methods)
  • Covariance regularization variants: full, tied, diag, spherical trade flexibility for stability

Algorithmic Complexity

Phase Time (per iteration) Space
E-step $O(n \cdot K \cdot p^2)$ (Mahalanobis distance) $O(n \cdot K)$
M-step $O(n \cdot K \cdot p^2)$ (covariance updates) $O(K \cdot p^2)$
Total ($T$ iterations) $O(T \cdot n \cdot K \cdot p^2)$ $O(n \cdot K + K \cdot p^2)$

4.4 Agglomerative Hierarchical Clustering

Theoretical Foundation

Bottom-up clustering: each point starts as its own cluster. At each step, merge the two clusters minimizing a linkage criterion:

Linkage Formula Properties
Single $\min_{a \in A, b \in B} d(a, b)$ Chaining effect. Finds elongated clusters.
Complete $\max_{a \in A, b \in B} d(a, b)$ Compact, spherical clusters. Sensitive to outliers.
Average $\frac{1}{ A
Ward $\Delta(\text{SSE}) = \text{SSE}_{A \cup B} - \text{SSE}_A - \text{SSE}_B$ Minimizes within-cluster variance. Produces balanced trees.

Ward's method is equivalent to merging the pair that minimizes the increase in the total $\sum_k \sum_{i \in C_k} |\mathbf{x}_i - \mu_k|^2$.

Data Processing Application

  • Dendrograms for multi-scale cluster exploration (no need to pre-specify $K$)
  • Taxonomy construction (hierarchical relationships are the output, not a nuisance)
  • Connectivity-constrained clustering (only merge spatially adjacent clusters via a connectivity matrix)

Algorithmic Complexity

Phase Time Space
Naive (full distance matrix) $O(n^3)$ $O(n^2)$
With nearest-neighbor chain (Ward, complete, average) $O(n^2)$ $O(n^2)$
Single linkage (via MST) $O(n^2)$ or $O(n \log n)$ with spatial index $O(n)$

4.5 Spectral Clustering

Theoretical Foundation

Spectral clustering reformulates the clustering problem as a graph partitioning problem. Given affinity matrix $W$ (e.g., $W_{ij} = \exp(-\gamma |\mathbf{x}_i - \mathbf{x}_j|^2)$):

  1. Compute degree matrix $D = \text{diag}(\sum_j W_{ij})$
  2. Form the normalized Laplacian: $L_{\text{sym}} = I - D^{-1/2}WD^{-1/2}$ (or $L_{\text{rw}} = I - D^{-1}W$)
  3. Compute the bottom-$K$ eigenvectors $U \in \mathbb{R}^{n \times K}$ of $L$
  4. Row-normalize $U$ and run $K$-means on the rows

The spectral embedding maps points into a space where cluster structure is linearly separable, even if the original clusters are non-convex. The number of connected components equals the multiplicity of eigenvalue 0 in the Laplacian; for approximately separated clusters, the bottom-$K$ eigenvectors capture this near-disconnection.

Data Processing Application

  • Image segmentation (pixels as nodes, affinity from spatial proximity and intensity)
  • Community detection in graphs (social networks, citation networks)
  • Non-convex cluster shapes where $K$-means and GMMs fail

Algorithmic Complexity

Phase Time Space
Affinity matrix $O(n^2 \cdot p)$ $O(n^2)$
Eigen decomposition (bottom-$K$) $O(n^2 \cdot K)$ via Lanczos $O(n \cdot K)$
K-means on embedding $O(n \cdot K^2 \cdot T_{\text{km}})$ $O(n \cdot K)$

For large $n$: use sparse affinity (KNN graph), then scipy.sparse.linalg.eigsh or LOBPCG.


5) Deep Learning for Data


5.1 Feedforward Networks (MLP)

Theoretical Foundation

A multilayer perceptron with $L$ hidden layers computes:

$$ \mathbf{h}^{(0)} = \mathbf{x}, \quad \mathbf{h}^{(\ell)} = \sigma(W^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}), \quad \hat{y} = W^{(L+1)} \mathbf{h}^{(L)} + \mathbf{b}^{(L+1)} $$

Activation functions: - ReLU: $\sigma(z) = \max(0, z)$. Sparse activation. Dead neuron problem. - GELU: $\sigma(z) = z \cdot \Phi(z)$. Smooth approximation to ReLU. Default in Transformers. - SiLU (Swish): $\sigma(z) = z \cdot \text{sigmoid}(z)$. Non-monotonic. Used in EfficientNet.

Loss functions: - Regression: MSE $= \frac{1}{n}\sum_i (y_i - \hat{y}i)^2$, Huber loss for robustness - Binary classification: BCE $= -\frac{1}{n}\sum_i [y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]$ - Multi-class: Cross-entropy $= -\frac{1}{n}\sum_i \sum_c y{ic}\log \hat{y}_{ic}$

Optimization: Adam (Kingma & Ba, 2015) with adaptive learning rates: $$ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 $$ $$ \theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} $$

Data Processing Application

  • Universal function approximator for tabular data (when tree methods plateau)
  • Entity embedding layers for categorical features (learned dense representations)
  • Multi-task learning (shared trunk with task-specific heads)
  • Recent work (TabNet, FT-Transformer) shows MLPs with proper regularization competitive with XGBoost on tabular benchmarks

Algorithmic Complexity

For layer widths $[p, h_1, h_2, \ldots, h_L, c]$:

Phase Time (per sample) Space
Forward $O(\sum_{\ell} h_\ell \cdot h_{\ell-1})$ $O(\sum_\ell h_\ell)$ activations
Backward $\approx 2\times$ forward + gradient buffers
Total training $O(E \cdot n \cdot \sum h_\ell \cdot h_{\ell-1})$ $O(\text{params} + \text{activations})$

5.2 Long Short-Term Memory (LSTM)

Theoretical Foundation

LSTM (Hochreiter & Schmidhuber, 1997) addresses the vanishing gradient problem in RNNs by introducing a gated cell state $\mathbf{c}_t$:

Forget gate: $\mathbf{f}t = \sigma(W_f [\mathbf{h}{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$

Input gate: $\mathbf{i}t = \sigma(W_i [\mathbf{h}{t-1}, \mathbf{x}_t] + \mathbf{b}_i)$

Candidate cell: $\tilde{\mathbf{c}}t = \tanh(W_c [\mathbf{h}{t-1}, \mathbf{x}_t] + \mathbf{b}_c)$

Cell state update: $\mathbf{c}t = \mathbf{f}_t \odot \mathbf{c}{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$

Output gate: $\mathbf{o}t = \sigma(W_o [\mathbf{h}{t-1}, \mathbf{x}_t] + \mathbf{b}_o)$

Hidden state: $\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$

The cell state $\mathbf{c}_t$ acts as a "conveyor belt" — information flows across time steps with only multiplicative interactions (no matrix multiplications), enabling gradient flow over hundreds of steps.

Parameter count: $4(p \cdot h + h^2 + h)$ for input dim $p$, hidden dim $h$.

Data Processing Application

  • Time-series forecasting (sensor data, financial markers, demand prediction)
  • Sequential anomaly detection (learn normal sequence patterns, flag deviations)
  • Log sequence analysis (IT operations, event correlation)
  • Encoder-decoder architectures for sequence-to-sequence data transformation

Note: For most time-series tasks, Transformers and temporal CNNs have overtaken LSTMs in accuracy. LSTMs remain relevant for streaming inference where the hidden state provides a natural incremental computation.

Algorithmic Complexity

Phase Time (sequence length $T$) Space
Forward (single sequence) $O(T \cdot (p \cdot h + h^2))$ $O(T \cdot h)$ (store states for BPTT)
BPTT $O(T \cdot (p \cdot h + h^2))$ $O(T \cdot h)$
Truncated BPTT (window $\tau$) $O(T \cdot (p \cdot h + h^2))$ fwd, $O(\tau \cdot (p \cdot h + h^2))$ bwd $O(\tau \cdot h)$

5.3 Gated Recurrent Unit (GRU)

Theoretical Foundation

GRU (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state, and combining the forget and input gates into a single update gate:

Update gate: $\mathbf{z}t = \sigma(W_z [\mathbf{h}{t-1}, \mathbf{x}_t])$

Reset gate: $\mathbf{r}t = \sigma(W_r [\mathbf{h}{t-1}, \mathbf{x}_t])$

Candidate hidden: $\tilde{\mathbf{h}}t = \tanh(W [\mathbf{r}_t \odot \mathbf{h}{t-1}, \mathbf{x}_t])$

Hidden state: $\mathbf{h}t = (1 - \mathbf{z}_t) \odot \mathbf{h}{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$

Parameter count: $3(p \cdot h + h^2 + h)$ — 25% fewer parameters than LSTM.

Data Processing Application

  • Same as LSTM with empirically similar (sometimes better) performance on shorter sequences
  • Preferred when parameter budget is constrained (edge devices, embedded inference)
  • Streaming telemetry analysis where model size affects deployment cost

Algorithmic Complexity

Same asymptotic as LSTM but with a smaller constant factor (3 gates vs. 4).


5.4 Convolutional Neural Networks (CNN) on Structured Data

Theoretical Foundation

1D convolution applies a filter $\mathbf{w} \in \mathbb{R}^{k \times c_{\text{in}}}$ across consecutive positions:

$$ (\mathbf{x} * \mathbf{w})t = \sum{\tau=0}^{k-1} \sum_{c=1}^{c_{\text{in}}} w_{\tau,c} \cdot x_{t+\tau, c} $$

For 2D data (images, spectrograms, or tabular data reshaped as a matrix):

$$ (X * W){i,j} = \sum{m=0}^{k_h-1}\sum_{n=0}^{k_w-1}\sum_{c} W_{m,n,c} \cdot X_{i+m, j+n, c} $$

Dilated (atrous) convolutions insert gaps of size $d-1$ between filter taps, expanding the receptive field to $k + (k-1)(d-1)$ without increasing parameters.

Data Processing Application

CNNs are not limited to images. On structured data: - 1D-CNN on time series: WaveNet-style dilated causal convolutions for temporal modeling (faster than RNNs, parallelizable) - 2D-CNN on tabular data: Reshape feature vectors into a 2D grid (e.g., by correlation-based ordering), then apply standard CV architectures. SuperTML and DeepInsight use this approach. - Spectral analysis: FFT of time series → spectrogram → 2D-CNN for frequency-domain pattern recognition - Log/event sequence classification: Treat tokenized log lines as 1D signals

Algorithmic Complexity

For input size $n$, $c_{\text{in}}$ channels, filter size $k$, $c_{\text{out}}$ filters:

Phase Time (per layer) Space
Forward (1D) $O(n \cdot k \cdot c_{\text{in}} \cdot c_{\text{out}})$ $O(n \cdot c_{\text{out}})$
Forward (2D, $H \times W$) $O(H \cdot W \cdot k^2 \cdot c_{\text{in}} \cdot c_{\text{out}})$ $O(H \cdot W \cdot c_{\text{out}})$

5.5 Attention Mechanisms & Transformers for Structured Data

Theoretical Foundation

Scaled Dot-Product Attention (Vaswani et al., 2017):

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$

where $Q, K, V \in \mathbb{R}^{n \times d_k}$ are queries, keys, and values respectively. The $1/\sqrt{d_k}$ scaling prevents softmax saturation.

Multi-Head Attention runs $h$ parallel attention heads with different learned projections: $$ \text{MHA}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O $$ $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

A transformer block = LayerNorm → MHA → Residual → LayerNorm → FFN → Residual.

For tabular data (FT-Transformer, TabTransformer): - Categorical features → learned embeddings - Numerical features → linear projection (or periodic encoding) - Feature tokens attend to each other via self-attention - [CLS] token aggregates for prediction

Data Processing Application

  • Feature interaction modeling: Self-attention explicitly learns pairwise and higher-order feature interactions without manual engineering
  • Handling heterogeneous features: Embedding layers handle mixed types naturally
  • Time series with irregular sampling: Positional encodings can encode actual timestamps, handling gaps gracefully
  • Tabular data: FT-Transformer matches or beats XGBoost on some benchmarks, particularly with large $n$ and complex interactions

Algorithmic Complexity

Phase Time Space
Self-attention (sequence/features = $L$) $O(L^2 \cdot d)$ $O(L^2 + L \cdot d)$
FFN block $O(L \cdot d \cdot d_{\text{ff}})$ $O(L \cdot d_{\text{ff}})$
Full transformer ($N$ layers) $O(N \cdot (L^2 d + L d \cdot d_{\text{ff}}))$ $O(N \cdot (L^2 + L \cdot d_{\text{ff}}))$

For long sequences: linear attention variants (Performer, Linformer) reduce to $O(L \cdot d^2)$.


5.6 Temporal Fusion Transformers (TFT)

Theoretical Foundation

TFT (Lim et al., 2021) is purpose-built for multi-horizon time-series forecasting with heterogeneous inputs. Architecture components:

  1. Variable Selection Networks (VSN): Gated Residual Networks (GRN) with softmax attention over input features, learning which variables matter at each time step: $$ \text{VSN}(\mathbf{x}t) = \sum{j=1}^p v_j(t) \cdot \tilde{x}_j(t), \quad v_j(t) = \text{softmax}(\text{GRN}(\mathbf{x}_t))_j $$

  2. Static covariate encoders: Entity-level features (e.g., store ID) are encoded once and injected into temporal processing via context vectors.

  3. Temporal self-attention with a decoder mask (causal attention for future steps).

  4. Quantile output: Predicts quantiles $\hat{y}{t+\tau}^{(q)}$ for each forecast horizon $\tau$, trained with quantile loss: $$ \mathcal{L}_q = \sum{\tau} \sum_i \left[q \cdot (y_{i,t+\tau} - \hat{y}{i,t+\tau}^{(q)})^+ + (1-q) \cdot (\hat{y}{i,t+\tau}^{(q)} - y_{i,t+\tau})^+\right] $$

Data Processing Application

  • Multi-step probabilistic time-series forecasting with exogenous variables
  • Demand forecasting (retail, energy) where known future inputs (holidays, promotions) exist
  • Interpretable attention weights reveal which time steps and features drive predictions

Algorithmic Complexity

Phase Time Space
Forward (lookback $T_{\text{enc}}$, horizon $T_{\text{dec}}$) $O((T_{\text{enc}} + T_{\text{dec}})^2 \cdot d + p \cdot d)$ $O((T_{\text{enc}} + T_{\text{dec}}) \cdot d)$

5.7 Neural Ordinary Differential Equations (Neural ODEs)

Theoretical Foundation

Neural ODEs (Chen et al., 2018) replace discrete hidden layers with a continuous dynamics parameterization:

$$ \frac{d\mathbf{h}(t)}{dt} = f_\theta(\mathbf{h}(t), t) $$

The hidden state at any time is obtained by solving the ODE: $\mathbf{h}(t_1) = \mathbf{h}(t_0) + \int_{t_0}^{t_1} f_\theta(\mathbf{h}(t), t) \, dt$, computed via adaptive ODE solvers (Dormand-Prince, RK45). Gradients are computed via the adjoint method, avoiding backpropagation through solver steps:

$$ \frac{d\mathcal{L}}{d\theta} = -\int_{t_1}^{t_0} \mathbf{a}(t)^T \frac{\partial f}{\partial \theta} dt, \quad \mathbf{a}(t) = \frac{d\mathcal{L}}{d\mathbf{h}(t)} $$

Memory cost is $O(1)$ in depth (only the final state and adjoint are stored, not intermediate activations).

Data Processing Application

  • Irregularly sampled time series: The ODE solver naturally handles non-uniform time gaps (no padding, no interpolation)
  • Physics-informed modeling: $f_\theta$ can encode known dynamics with learnable corrections
  • Continuous normalizing flows: Generative modeling with exact log-likelihood computation
  • Latent ODE models: Encode irregular sequences into a latent state, evolve with Neural ODE, decode at arbitrary query times

Algorithmic Complexity

Phase Time Space
Forward (adaptive solver, $S$ steps) $O(S \cdot \text{cost}(f_\theta))$ $O(\text{params})$ (adjoint: constant in depth)
Backward (adjoint method) $O(S \cdot \text{cost}(f_\theta))$ $O(\text{params})$

$S$ is adaptive — stiff dynamics or tight tolerances increase $S$ dramatically. This makes training time hard to predict, which is exactly the kind of thing that delights no one in production.


6) Generative AI in Data Processing

Generative models in data engineering serve a specific, pragmatic purpose: manufacturing data that doesn't exist or augmenting data that is pathologically imbalanced. This is not about generating art — it is about generating rows.


6.1 Generative Adversarial Networks (GANs)

Theoretical Foundation

GANs (Goodfellow et al., 2014) train two networks in a minimax game:

Generator $G: \mathbf{z} \to \mathbf{x}_{\text{fake}}$ maps noise $\mathbf{z} \sim p_z$ to data space.

Discriminator $D: \mathbf{x} \to [0, 1]$ classifies real vs. generated samples.

$$ \min_G \max_D \, V(D, G) = \mathbb{E}{\mathbf{x} \sim p{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p_z}[\log(1 - D(G(\mathbf{z})))] $$

At the Nash equilibrium, $G$ generates samples indistinguishable from real data, and $D$ outputs 0.5 everywhere. In practice, GANs suffer from: - Mode collapse: $G$ generates a narrow subset of the data distribution - Training instability: The discriminator may overpower the generator (or vice versa), collapsing the gradient signal - Evaluation difficulty: No tractable likelihood; FID and IS are approximate metrics

Wasserstein GAN (WGAN) addresses instability by replacing the JS divergence with the Earth Mover distance: $$ W(p_{\text{data}}, p_G) = \sup_{|f|L \leq 1} \mathbb{E}{\mathbf{x} \sim p_{\text{data}}}[f(\mathbf{x})] - \mathbb{E}_{\mathbf{x} \sim p_G}[f(\mathbf{x})] $$

enforcing the Lipschitz constraint via gradient penalty (WGAN-GP) or spectral normalization.

Data Processing Application

  • Synthetic data generation for privacy-preserving data sharing (generate statistically equivalent datasets without exposing individual records)
  • Data augmentation for severely imbalanced classes (generate minority samples)
  • Simulation of rare events (fraud, equipment failure, natural disasters) for model training

Algorithmic Complexity

Phase Time (per iteration) Space
Generator forward + backward $O(B \cdot \text{params}_G)$ $O(\text{params}_G + B \cdot p)$
Discriminator forward + backward $O(B \cdot \text{params}_D)$ $O(\text{params}_D + B \cdot p)$
Total ($E$ epochs, $n/B$ batches) $O(E \cdot n \cdot (\text{params}_G + \text{params}_D))$ $O(\text{params}_G + \text{params}_D)$

6.2 Conditional GANs (CTGAN) for Tabular Data

Theoretical Foundation

CTGAN (Xu et al., 2019) addresses the unique challenges of tabular data generation:

  1. Mode-specific normalization: Continuous columns are modeled as mixtures of Gaussians (via VGM — Variational Gaussian Mixture). Each value is represented as a (mode, normalized_value) pair, preventing mode collapse across multi-modal distributions.

  2. Conditional generator: The generator is conditioned on a one-hot vector indicating which column/category to generate, with a training-by-sampling strategy that ensures balanced representation of all category values: $$ G(\mathbf{z}, \mathbf{cond}) \to \mathbf{x}_{\text{fake}} $$

  3. PacGAN discriminator: Operates on "pacs" of multiple generated rows simultaneously to detect mode collapse.

Data Processing Application

CTGAN is the primary method for synthetic tabular data generation: - Generating realistic training data when original data is under privacy restrictions (GDPR, HIPAA) - Augmenting minority classes in imbalanced tabular datasets - Creating realistic test/staging datasets for data pipeline development - Statistical disclosure control (releasing synthetic data instead of anonymized real data)

Algorithmic Complexity

Same as standard GAN with additional overhead for VGM fitting ($O(n \cdot K_{\text{modes}})$ per continuous column) and conditional vector sampling ($O(C)$ per batch).


6.3 Variational Autoencoders (VAE) for Data Augmentation

Theoretical Foundation

See Section 2.7 for the full VAE formulation. For data augmentation specifically:

  1. Train VAE on the underrepresented class $C_{\text{min}}$
  2. Sample $\mathbf{z} \sim q_\phi(\mathbf{z} | \mathbf{x})$ for observed minority samples → interpolate in latent space
  3. Decode: $\hat{\mathbf{x}} = g_\theta(\mathbf{z})$ → generate new synthetic minority samples
  4. Alternatively: sample $\mathbf{z} \sim \mathcal{N}(0, I)$ from the prior → decode → filter via discriminative quality check

Data Processing Application

VAEs produce smoother, less artifact-prone synthetic data than GANs due to the continuous, regularized latent space. Preferred when: - Data fidelity is less critical than latent space regularity (e.g., for downstream model training, not privacy) - The data has meaningful continuous structure in latent space - Stable training is prioritized (VAEs converge reliably; GANs may not)

Algorithmic Complexity

Same as Section 2.7. Augmentation adds $O(m \cdot d)$ for sampling $m$ new points from the latent space and $O(m \cdot \text{decoder_cost})$ for decoding.


6.4 Diffusion Models for Tabular Synthesis

Theoretical Foundation

Denoising Diffusion Probabilistic Models (DDPM, Ho et al., 2020) define a forward process that gradually adds Gaussian noise over $T$ steps:

$$ q(\mathbf{x}t | \mathbf{x}{t-1}) = \mathcal{N}(\mathbf{x}t; \sqrt{1-\beta_t}\mathbf{x}{t-1}, \beta_t I) $$

and a learnable reverse process:

$$ p_\theta(\mathbf{x}{t-1} | \mathbf{x}_t) = \mathcal{N}(\mathbf{x}{t-1}; \mu_\theta(\mathbf{x}_t, t), \sigma_t^2 I) $$

The training objective reduces to predicting the noise:

$$ \mathcal{L} = \mathbb{E}{t, \mathbf{x}_0, \epsilon}\left[|\epsilon - \epsilon\theta(\mathbf{x}_t, t)|^2\right] $$

where $\mathbf{x}t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$, $\bar{\alpha}_t = \prod{s=1}^t (1-\beta_s)$.

For tabular data (TabDDPM, Kotelnikov et al., 2023): continuous features use Gaussian diffusion; categorical features use multinomial diffusion (discrete state transitions via transition matrices). The denoiser is typically an MLP with time embedding.

Data Processing Application

  • Emerging as superior to GANs for tabular synthesis (better density estimation, fewer mode collapse artifacts)
  • Privacy-preserving synthetic data (differential privacy guarantees can be integrated into the training)
  • Data augmentation with controllable diversity (noise schedule controls generation fidelity vs. novelty trade-off)

Algorithmic Complexity

Phase Time Space
Training (per step) $O(B \cdot \text{params}_\theta)$ $O(\text{params}_\theta + B \cdot p)$
Sampling ($T$ denoising steps) $O(T \cdot B \cdot \text{params}_\theta)$ $O(B \cdot p)$
DDIM (accelerated, $S \ll T$ steps) $O(S \cdot B \cdot \text{params}_\theta)$ $O(B \cdot p)$

Sampling is the bottleneck — $T = 1000$ denoising steps is standard. DDIM reduces this to $\sim 50$ steps with minimal quality loss.


6.5 SMOTE and Its Neural Successors

Theoretical Foundation

SMOTE (Chawla et al., 2002) generates synthetic minority samples by interpolating between existing minority class samples and their KNN neighbors:

$$ \mathbf{x}{\text{new}} = \mathbf{x}_i + \lambda \cdot (\mathbf{x}{nn} - \mathbf{x}_i), \quad \lambda \sim \text{Uniform}(0, 1) $$

Variants address specific failure modes: - Borderline-SMOTE: Only oversamples points near the decision boundary - ADASYN: Weights oversampling toward harder-to-learn regions (density-adaptive) - SMOTE-ENN: Combines SMOTE with Edited Nearest Neighbors (cleans overlapping regions post-synthesis)

Neural successors replace linear interpolation with learned latent-space interpolation (DEGAN, DeepSMOTE), mitigating SMOTE's tendency to generate samples in convex hulls of the minority class (useless if the minority manifold is non-convex).

Data Processing Application

  • The standard first-line approach for class imbalance in tabular data
  • Upstream of any classifier (applied only to training data, never to validation/test)
  • Combined with undersampling of the majority class (SMOTE + Tomek links)
  • Being gradually supplanted by CTGAN/TabDDPM for complex distributions, but remains the fastest and simplest option for moderate imbalance

Algorithmic Complexity

Phase Time Space
KNN computation $O(n_{\text{min}}^2 \cdot p)$ brute-force $O(n_{\text{min}} \cdot p)$
Synthesis ($m$ new points) $O(m \cdot p)$ $O(m \cdot p)$

Phase 2 — The Production Ecosystem (Python and C#)

Theory is nonnegotiable. But theory without a framework is a whiteboard exercise. This section maps the Phase 1 taxonomy to the concrete libraries that make production deployment possible.


7) Python Ecosystem

Python's dominance is infrastructure, not linguistic merit. It is a thin orchestration layer atop C/C++/CUDA kernels. The following frameworks represent the load-bearing walls of that infrastructure.


7.1 PyTorch

Primary Use Case: Dynamic computational graphs for research and production deep learning.

Architectural Philosophy: Imperative (eager) execution by default. Define-by-run: the graph is constructed on-the-fly during the forward pass. torch.compile() (TorchDynamo + TorchInductor) now bridges the gap to graph-mode performance via JIT tracing.

Hardware Acceleration: - CUDA: First-class support via torch.cuda. All tensor operations dispatch to cuBLAS/cuDNN when tensors reside on GPU. - ROCm: AMD GPU support via HIP translation layer. - MPS: Apple Silicon (M1/M2/M3) via Metal Performance Shaders backend. - XLA: Via torch_xla for TPU execution (experimental).

Key Components for Data Engineering:

Component Role
torch.nn Full neural network layer library (Linear, Conv, LSTM, Transformer, etc.)
torch.optim Optimizers (Adam, AdamW, SGD, LBFGS)
torch.utils.data.DataLoader Parallel data loading with prefetching, batching, and custom collation
torchvision.transforms Image augmentation pipeline (composable, GPU-accelerated via torchvision 0.16+)
torch.distributed Multi-GPU and multi-node training (DDP, FSDP, RPC)
torch.export + torch.compile Production graph export and ahead-of-time compilation
torch.ao.quantization Post-training and quantization-aware training (INT8/FP16)

Relevance to Phase 1: Implements MLPs, LSTMs, GRUs, CNNs, Transformers, VAEs, GANs, Neural ODEs (torchdiffeq), diffusion models, and any custom architecture.

import torch
import torch.nn as nn

# Tabular Transformer-style model
class TabularTransformer(nn.Module):
    def __init__(self, num_features, d_model=64, nhead=4, num_layers=3, num_classes=2):
        super().__init__()
        self.embed = nn.Linear(1, d_model)  # Per-feature embedding
        self.pos_enc = nn.Parameter(torch.randn(1, num_features, d_model))
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
        self.head = nn.Linear(d_model, num_classes)

    def forward(self, x):
        # x: (batch, num_features) → (batch, num_features, 1) → (batch, num_features, d_model)
        x = self.embed(x.unsqueeze(-1)) + self.pos_enc
        x = self.transformer(x)
        return self.head(x.mean(dim=1))  # Mean-pool over features

7.2 TensorFlow / Keras

Primary Use Case: Production-ready, graph-optimized deep learning with emphasis on deployment and serving.

Architectural Philosophy: Graph-first execution (TF 1.x heritage). TF 2.x defaults to eager mode with tf.function for graph compilation. Keras is the high-level API, now the sole official interface.

Hardware Acceleration: - CUDA: Via TensorRT integration for inference optimization. - TPU: First-class support via tf.distribute.TPUStrategy (TensorFlow's raison d'être for Google Cloud). - TF Lite: Mobile/edge deployment (INT8 quantization, delegate to GPU/NPU). - TF.js: Browser inference via WebGL/WASM.

Key Components for Data Engineering:

Component Role
tf.data High-performance input pipeline (prefetch, interleave, cache, parallel map)
tf.keras.layers Standard layer library + preprocessing layers (Normalization, CategoryEncoding)
tf.distribute Multi-GPU/TPU strategies (MirroredStrategy, MultiWorkerMirroredStrategy)
TensorFlow Serving Production model serving via gRPC/REST
TF Extended (TFX) End-to-end ML pipeline orchestration (ExampleGen → Transform → Trainer → Pusher)
TF Probability Bayesian layers, variational inference, distributions

Relevance to Phase 1: Same model coverage as PyTorch. TFX specifically addresses the data engineering lifecycle (data validation, feature engineering, model analysis).


7.3 JAX

Primary Use Case: Functional, composable numerical computing with automatic differentiation and hardware compilation.

Architectural Philosophy: JAX is not a deep learning framework — it is a NumPy replacement with three transformations:

Transformation Purpose
jax.grad Automatic differentiation (forward and reverse mode)
jax.jit Just-in-time compilation to XLA (fuses operations, optimizes for hardware)
jax.vmap Automatic vectorization (batch dimension added without explicit loops)
jax.pmap Parallel map across devices (multi-GPU/TPU SPMD)

JAX arrays are immutable. There is no in-place mutation. This functional purity enables aggressive compiler optimizations but requires a paradigm shift for imperative programmers.

Frameworks built on JAX: Flax (neural networks), Optax (optimizers), Haiku (DeepMind), Equinox (Pythonic modules).

Hardware Acceleration: - TPU: Native XLA compilation — JAX was born at Google for TPU workloads. - CUDA: Via XLA CUDA backend. - CPU: XLA optimizes for CPU as well (SIMD vectorization).

Relevance to Phase 1: JAX excels for custom research implementations (Neural ODEs via diffrax, normalizing flows, novel architectures). Its vmap eliminates manual batching and its jit eliminates Python overhead.

import jax
import jax.numpy as jnp
from jax import grad, jit, vmap

# Pure-functional linear regression with L2 regularization
def loss_fn(params, X, y, lam=0.01):
    pred = X @ params['w'] + params['b']
    return jnp.mean((pred - y) ** 2) + lam * jnp.sum(params['w'] ** 2)

# JIT-compiled gradient computation
grad_fn = jit(grad(loss_fn))

# vmap: vectorize over a batch of different regularization strengths
batched_loss = vmap(loss_fn, in_axes=(None, None, None, 0))

7.4 Scikit-Learn

Primary Use Case: Classical ML algorithms, preprocessing, model selection, and evaluation.

Architectural Philosophy: The Estimator API: every algorithm implements fit(), predict(), and optionally transform(). Pipelines (Pipeline, ColumnTransformer) compose preprocessing and modeling into a single object with coordinated fit/transform/predict.

Hardware Acceleration: CPU-only natively. Parallelism via joblib (thread or process). Intel acceleration via scikit-learn-intelex (drop-in patching of KNN, SVM, PCA, Random Forest to use Intel oneDAL).

Key Components for Data Engineering:

Component Phase 1 Methods
sklearn.impute MICE (IterativeImputer), KNN imputation (KNNImputer)
sklearn.preprocessing StandardScaler, RobustScaler, QuantileTransformer, PowerTransformer
sklearn.decomposition PCA, KernelPCA, SparsePCA, NMF, TruncatedSVD
sklearn.ensemble RandomForest, GradientBoosting, IsolationForest
sklearn.svm SVC, OneClassSVM, SVR
sklearn.cluster DBSCAN, AgglomerativeClustering, SpectralClustering, KMeans
sklearn.neighbors LocalOutlierFactor, KNeighborsClassifier
sklearn.manifold t-SNE
sklearn.mixture GaussianMixture
sklearn.linear_model ElasticNet, Lasso, Ridge, LogisticRegression
sklearn.feature_extraction FeatureHasher, DictVectorizer
sklearn.model_selection GridSearchCV, cross_val_score, TimeSeriesSplit
sklearn.pipeline Pipeline, ColumnTransformer
sklearn.metrics Every conceivable metric
from sklearn.pipeline import Pipeline
from sklearn.impute import IterativeImputer
from sklearn.preprocessing import RobustScaler
from sklearn.ensemble import IsolationForest

# Full preprocessing pipeline: impute → scale → detect anomalies
pipeline = Pipeline([
    ('imputer', IterativeImputer(max_iter=10, random_state=42)),
    ('scaler', RobustScaler()),
    ('anomaly', IsolationForest(n_estimators=200, contamination=0.05))
])

# fit_predict on raw data with missing values and outliers
labels = pipeline.fit_predict(X_raw)  # -1 = anomaly, 1 = normal

7.5 XGBoost / LightGBM / CatBoost (Native Libraries)

Each provides both a native API and a Scikit-Learn compatible wrapper.

Library Python Package Key Differentiator GPU Support
XGBoost xgboost Exact + histogram split. Broadest ecosystem. tree_method='gpu_hist' (CUDA)
LightGBM lightgbm GOSS + EFB. Fastest training on large data. device='gpu' (CUDA, OpenCL)
CatBoost catboost Native categoricals. Ordered boosting. Fastest CPU inference. task_type='GPU' (CUDA)

All three support: - Early stopping on validation loss - Feature importance (gain, split count, SHAP values) - Custom loss functions (first and second derivatives) - Distributed training (Spark, Dask, Ray integrations) - ONNX export for cross-platform inference

from catboost import CatBoostClassifier

model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    cat_features=['city', 'device_type', 'os'],  # No encoding needed
    auto_class_weights='Balanced',
    task_type='GPU',
    verbose=100
)
model.fit(X_train, y_train, eval_set=(X_val, y_val), early_stopping_rounds=50)

7.6 High-Performance Data Backends — Polars & cuDF

Polars

Aspect Detail
Language Rust core, Python bindings via PyO3
Architecture Apache Arrow columnar memory format. Lazy evaluation with query optimization.
Parallelism Automatic multithreading via Rayon. No GIL dependency.
Key Advantages 5-50x faster than Pandas for groupby/join/filter. Streaming mode for out-of-core data. Expression-based API.
import polars as pl

df = pl.scan_parquet("telemetry/*.parquet")  # Lazy frame — no data loaded yet
result = (
    df.filter(pl.col("temperature") > 80)
      .group_by("sensor_id")
      .agg([
          pl.col("vibration").mean().alias("mean_vib"),
          pl.col("failure").sum().alias("total_failures")
      ])
      .sort("total_failures", descending=True)
      .collect()  # Execution happens here — one optimized pass
)

cuDF (RAPIDS)

Aspect Detail
Architecture GPU-accelerated DataFrame library. Pandas-like API backed by CUDA kernels.
Integration Part of RAPIDS ecosystem (cuML for GPU ML, cuGraph for graph analytics).
Key Advantages 10-100x speedup over Pandas for large datasets that fit in GPU memory. Zero-copy interchange with PyTorch/TensorFlow via __cuda_array_interface__.

7.7 Supplementary Ecosystem

Library Purpose Phase 1 Coverage
UMAP (umap-learn) UMAP implementation Section 2.5
HDBSCAN (hdbscan) Hierarchical density clustering Section 4.2
imbalanced-learn (imblearn) SMOTE, ADASYN, undersampling Section 6.5
SDV (sdv) Synthetic Data Vault (CTGAN, TVAE, Copulas) Section 6.2
torchdiffeq Neural ODE solvers for PyTorch Section 5.7
diffrax JAX-based differential equation solvers Section 5.7
PyTorch Forecasting TFT, N-BEATS, DeepAR implementations Section 5.6
SHAP (shap) Shapley-value feature explanations All supervised methods
Optuna Bayesian hyperparameter optimization All methods
MLflow Experiment tracking, model registry, deployment All methods
ONNX Runtime (onnxruntime) Cross-platform model inference All methods
Ray (ray[tune], ray[train]) Distributed training and hyperparameter tuning All methods
DVC Data version control (large dataset tracking in ML pipelines) Data preprocessing

8) C# / .NET Ecosystem

The .NET ecosystem for ML is smaller, younger, and frankly less glamorous than Python's. But it fills a critical niche: embedding ML directly into high-performance backend services without the latency, operational complexity, and dependency nightmares of cross-language serving.


8.1 ML.NET

Primary Use Case: Classical ML and AutoML within .NET applications.

Architectural Philosophy: ML.NET is built around the IDataView abstraction — a lazy, cursorable, schematized data pipeline inspired by database query engines. Data flows through a chain of estimators (IEstimator<>) that produce transformers (ITransformer<>):

IDataView → Estimator.Fit() → Transformer → IDataView (transformed)

This is Scikit-Learn's Pipeline philosophy, implemented with .NET's type safety.

Key Capabilities:

Component Coverage
ML.NET AutoML Automated model selection + hyperparameter tuning. Supports classification, regression, ranking, recommendation, forecasting, image classification, object detection. Uses cost-frugal optimization (FLAML-style).
Transforms Normalization, one-hot encoding, feature hashing, text featurization (TF-IDF, word embeddings), image loading/resizing, missing value replacement, key-to-value mapping.
Trainers LightGBM (native integration), FastTree, SDCA (Stochastic Dual Coordinate Ascent), L-BFGS logistic regression, averaged perceptron, OLS, PCA, K-Means, anomaly detection (randomized PCA, time-series spike detection).
Time Series Spike detection (IID and SSA), change point detection, forecasting (SSA — Singular Spectrum Analysis).
Deep Learning Via ONNX Runtime integration (load PyTorch/TensorFlow models exported to ONNX). Also image classification via transfer learning (ResNet, InceptionV3).
Model Explainability Permutation Feature Importance (PFI)

Hardware Acceleration: CPU-bound natively. GPU via ONNX Runtime (CUDA execution provider). TorchSharp integration for native GPU model execution.

using Microsoft.ML;
using Microsoft.ML.AutoML;

var context = new MLContext(seed: 42);

// Load data via IDataView (lazy — no materialization until needed)
IDataView data = context.Data.LoadFromTextFile<SensorReading>(
    "telemetry.csv", hasHeader: true, separatorChar: ',');

// AutoML: let the framework find the best pipeline
var experiment = context.Auto()
    .CreateBinaryClassificationExperiment(maxExperimentTimeInSeconds: 300)
    .Execute(data, labelColumnName: "FailureStatus");

// Best model is immediately usable
var bestModel = experiment.BestRun.Model;
var engine = context.Model.CreatePredictionEngine<SensorReading, FailurePrediction>(bestModel);
var prediction = engine.Predict(new SensorReading { Temperature = 87.5, Vibration = 0.42 });

8.2 TorchSharp

Primary Use Case: PyTorch bindings for .NET — write PyTorch models in C#/F#.

Architectural Philosophy: TorchSharp wraps libtorch (PyTorch's C++ backend) via P/Invoke. The C# API mirrors PyTorch's Python API closely: torch.Tensor, torch.nn.Module, torch.optim, etc.

Hardware Acceleration: Full CUDA support (same libtorch backend as Python PyTorch). Install TorchSharp-cuda-windows or TorchSharp-cuda-linux NuGet packages.

Key Capabilities: - Define, train, and infer neural networks entirely in C# - Share models with Python PyTorch (save/load state_dict) - GPU tensor operations at native speed - Integration with ML.NET via Microsoft.ML.TorchSharp (transfer learning, text classification, object detection trainers)

using TorchSharp;
using static TorchSharp.torch;
using static TorchSharp.torch.nn;

public class TabularNet : Module<Tensor, Tensor>
{
    private readonly Module<Tensor, Tensor> _layers;

    public TabularNet(int inputDim, int hiddenDim, int outputDim) : base("TabularNet")
    {
        _layers = Sequential(
            Linear(inputDim, hiddenDim),
            ReLU(),
            Dropout(0.3),
            Linear(hiddenDim, hiddenDim),
            ReLU(),
            Dropout(0.3),
            Linear(hiddenDim, outputDim)
        );
        RegisterComponents();
    }

    public override Tensor forward(Tensor x) => _layers.call(x);
}

// Training on GPU
var device = cuda.is_available() ? CUDA : CPU;
var model = new TabularNet(inputDim: 50, hiddenDim: 128, outputDim: 2).to(device);
var optimizer = optim.Adam(model.parameters(), lr: 1e-3);
var criterion = CrossEntropyLoss();

8.3 TensorFlow.NET

Primary Use Case: TensorFlow bindings for .NET (part of SciSharp).

Architectural Philosophy: Wraps the TensorFlow C API. Provides tf.Session, tf.Graph, eager execution, and Keras-style sequential/functional model APIs.

Hardware Acceleration: CUDA via TensorFlow's native GPU support. Install SciSharp.TensorFlow.Redist-Windows-GPU NuGet.

Key Capabilities: - Full TensorFlow API available in C# - Keras-style model building (keras.Sequential, keras.layers) - TensorBoard logging from .NET - Pre-trained model loading (SavedModel format)

Practical Status: Less actively maintained than TorchSharp. For new .NET deep learning projects, TorchSharp is the recommended path. TensorFlow.NET is relevant for teams with existing TensorFlow model investments.


8.4 SciSharp Stack

The SciSharp organization provides .NET ports of the core Python numerical stack:

Library .NET Equivalent Purpose
NumPy NumSharp N-dimensional array operations
Pandas Pandas.NET DataFrame operations (experimental)
Matplotlib ScottPlot / Plotly.NET Visualization
SciPy MathNet.Numerics Linear algebra, distributions, optimization
TensorFlow TensorFlow.NET Deep learning

MathNet.Numerics deserves special mention: it provides production-grade implementations of: - Dense/sparse matrix operations (MKL-backed via MathNet.Numerics.MKL) - SVD, QR, Cholesky decompositions (relevant for PCA, linear regression) - Statistical distributions and hypothesis tests - Numerical integration and root finding

using MathNet.Numerics.LinearAlgebra;

// PCA via SVD using MathNet
var X = Matrix<double>.Build.DenseOfArray(rawData); // n × p
var centered = X.SubtractRowMeans();
var svd = centered.Svd(computeVectors: true);

// Top-d principal components (truncated SVD)
int d = 10;
var Z = centered.Multiply(svd.VT.SubMatrix(0, d, 0, svd.VT.ColumnCount).Transpose());

8.5 Accord.NET (Legacy)

Status: Effectively archived. Last significant update was pre-.NET Core. Included for completeness as it was historically the only comprehensive ML library for .NET.

Capabilities: Full implementations of SVM, decision trees, random forests, neural networks, KNN, PCA, K-Means, GMM, naive Bayes, HMMs, image processing, and statistical tests — all in pure C# with no native dependencies.

Modern Recommendation: Migrate to ML.NET (for classical ML), TorchSharp (for deep learning), and MathNet.Numerics (for linear algebra). Accord.NET's pure-managed implementations cannot compete with the native-backed alternatives on performance.


8.6 Production Integration Patterns

The core value proposition of the .NET ML ecosystem is in-process inference — no HTTP boundary, no serialization overhead, no Python process to babysit.

Pattern 1: ONNX Runtime in ASP.NET Core Middleware

Train in Python (PyTorch/TensorFlow/Scikit-Learn) → export to ONNX → load in ASP.NET Core via Microsoft.ML.OnnxRuntime:

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;

public class AnomalyDetectionMiddleware
{
    private readonly InferenceSession _session;

    public AnomalyDetectionMiddleware(string modelPath)
    {
        var options = new SessionOptions();
        options.AppendExecutionProvider_CUDA(deviceId: 0); // GPU inference
        _session = new InferenceSession(modelPath, options);
    }

    public float[] Score(float[] features)
    {
        var tensor = new DenseTensor<float>(features, new[] { 1, features.Length });
        var inputs = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor("input", tensor)
        };

        using var results = _session.Run(inputs);
        return results.First().AsTensor<float>().ToArray();
    }
}

Pattern 2: ML.NET PredictionEngine in gRPC Service

// ASP.NET Core gRPC service with ML.NET prediction engine (thread-safe via pooling)
public class PredictionService : Predictor.PredictorBase
{
    private readonly PredictionEnginePool<InputData, OutputPrediction> _pool;

    public PredictionService(PredictionEnginePool<InputData, OutputPrediction> pool)
        => _pool = pool;

    public override Task<PredictionReply> Predict(PredictionRequest request, ServerCallContext ctx)
    {
        var input = MapToInput(request);
        var result = _pool.Predict(input);
        return Task.FromResult(new PredictionReply { Score = result.Score });
    }
}

Pattern 3: Cross-Language Pipeline (Django/FastAPI → .NET Microservice)

For teams maintaining both Python and C# backends:

┌──────────────┐    gRPC/REST    ┌─────────────────────┐
│  Django App  │ ──────────────→ │  ASP.NET Core ML    │
│  (Python)    │                 │  Microservice       │
│              │ ←────────────── │  (ONNX/ML.NET)      │
│  - Data ETL  │   predictions   │  - Model inference  │
│  - Feature   │                 │  - Low latency      │
│    eng.      │                 │  - GPU acceleration  │
└──────────────┘                 └─────────────────────┘

This pattern is common in enterprises where data science teams operate in Python but serving infrastructure is .NET. The ONNX format is the lingua franca bridging the gap.


9) Cross-Language Interop & Serving Architecture

Serving Method Latency Complexity Language Agnostic
ONNX Runtime (in-process) μs–ms Low Yes (C#, C++, Java, Python, JS)
TorchServe (Python service) ms–10ms Medium Via HTTP/gRPC
TF Serving (C++ binary) ms Medium Via HTTP/gRPC
Triton Inference Server (NVIDIA) μs–ms High ONNX, TensorRT, PyTorch, TF
BentoML (Python framework) ms Medium Via containerized REST/gRPC
MLflow Serving ms Low Via REST
Custom gRPC (any language) μs–ms High By definition

Model Format Interop:

PyTorch (.pt) ──→ ONNX (.onnx) ──→ ONNX Runtime (C#/C++/Python)
                                  ──→ TensorRT (.engine) for NVIDIA GPUs
                                  ──→ Core ML (.mlmodel) for Apple
                                  ──→ TF Lite (.tflite) for mobile

TensorFlow (SavedModel) ──→ ONNX (.onnx) ──→ same targets
                         ──→ TF Lite
                         ──→ TF.js

Scikit-Learn (.pkl) ──→ ONNX (via skl2onnx)
                     ──→ PMML (legacy)

XGBoost/LightGBM ──→ ONNX (via onnxmltools)
                  ──→ Treelite (C compilation of tree models for ultra-low-latency)

ONNX is the universal serialization format. If your model cannot be exported to ONNX, your architecture has a problem. If your serving infrastructure cannot load ONNX, your operations team has a problem.


References

  1. Azur, M. J. et al. (2011). Multiple imputation by chained equations: what is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40-49.
  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  3. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.
  4. Breunig, M. M. et al. (2000). LOF: Identifying Density-Based Local Outliers. SIGMOD Record, 29(2), 93-104.
  5. Campello, R. J. G. B. et al. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. PAKDD 2013.
  6. Chawla, N. V. et al. (2002). SMOTE: Synthetic Minority Over-sampling Technique. JAIR, 16, 321-357.
  7. Chen, R. T. Q. et al. (2018). Neural Ordinary Differential Equations. NeurIPS 2018.
  8. Chen, T. & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016.
  9. Cho, K. et al. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014.
  10. Ester, M. et al. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996.
  11. Golovin, D. et al. (2017). Google Vizier: A Service for Black-Box Optimization. KDD 2017.
  12. Goodfellow, I. et al. (2014). Generative Adversarial Nets. NeurIPS 2014.
  13. Goodfellow, I. et al. (2016). Deep Learning. MIT Press.
  14. Ho, J. et al. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
  15. Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735-1780.
  16. Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 2017.
  17. Kingma, D. P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015.
  18. Kingma, D. P. & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR 2014.
  19. Kotelnikov, A. et al. (2023). TabDDPM: Modelling Tabular Data with Diffusion Models. ICML 2023.
  20. Lim, B. et al. (2021). Temporal Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting. International Journal of Forecasting.
  21. Liu, F. T. et al. (2008). Isolation Forest. ICDM 2008.
  22. McInnes, L. et al. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426.
  23. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
  24. Prokhorenkova, L. et al. (2018). CatBoost: Unbiased Boosting with Categorical Features. NeurIPS 2018.
  25. Schölkopf, B. et al. (2001). Estimating the Support of a High-Dimensional Distribution. Neural Computation, 13(7).
  26. van der Maaten, L. & Hinton, G. (2008). Visualizing Data using t-SNE. JMLR, 9, 2579-2605.
  27. Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017.
  28. Xu, L. et al. (2019). Modeling Tabular Data using Conditional GAN. NeurIPS 2019.