Skip to contents

This is the starting point for a four-article series analyzing 178 wines (3 cultivars, 13 chemicals) with Andrews curves and functional data analysis.

Article What It Does Outcome
Why Andrews Curves? (this article) Transform 13 chemicals into curves; verify distance preservation Each wine becomes a visual fingerprint; L2L^2 distances equal π×\sqrt{\pi} \times Euclidean — nothing lost
Outlier Detection Depth, outliergram, MS-plot 9 anomalies classified by type — mislabel, soil anomaly, or concentration — with corrective actions
Clustering & Variable Importance K-means, fuzzy c-means, permutation test, FPCA Cultivar recovery at 96% accuracy; top 5 chemicals identified for cost reduction
Quality Control Functional boxplots, depth rankings, tolerance bands Monitoring system that checks new batches against a validated specification in one chart

The Problem with Tables

A quality-control analyst reviews 178 wines, each tested for 13 chemical properties — that’s 2,314 numbers. The questions are simple: Any anomalous wines? Do the three cultivars (Barolo, Grignolino, Barbera) look chemically distinct? Which chemicals matter most?

The business needs are concrete:

  1. Detect production anomalies before a defective batch ships
  2. Validate cultivar identity for denomination-of-origin certification
  3. Reduce testing costs by identifying redundant assays
  4. Establish ongoing monitoring against a validated reference profile

Standard practice uses separate tools for each (spreadsheet, MANOVA, PCA, per-variable control charts), each with its own distance metric and assumptions. Nothing connects them.

The alternative: transform all 13 chemicals into Andrews curves and apply a unified pipeline of functional data analysis methods. Every step operates on the same fdata object with the same L2L^2 distance semantics.

# UCI Wine dataset: 178 wines, 3 cultivars, 13 chemical measurements
wine_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
col_names <- c("Cultivar", "Alcohol", "MalicAcid", "Ash", "Alkalinity",
               "Magnesium", "Phenols", "Flavanoids", "NonflavPhenols",
               "Proanthocyanins", "ColorIntensity", "Hue",
               "OD280_OD315", "Proline")
wine <- read.csv(wine_url, header = FALSE, col.names = col_names)

# Use real cultivar names
cultivar <- factor(wine$Cultivar,
                   levels = 1:3,
                   labels = c("Barolo", "Grignolino", "Barbera"))

# Standardize the 13 chemical variables
X <- scale(as.matrix(wine[, -1]))
chem_names <- colnames(wine)[-1]

cat(nrow(wine), "wines,", ncol(X), "chemicals,",
    nlevels(cultivar), "cultivars\n")
#> 178 wines, 13 chemicals, 3 cultivars

df_box <- data.frame(X) |>
  mutate(Cultivar = cultivar) |>
  tidyr::pivot_longer(-Cultivar, names_to = "Chemical", values_to = "Value")

ggplot(df_box, aes(x = Cultivar, y = Value, fill = Cultivar)) +
  geom_boxplot(alpha = 0.7, outlier.size = 0.8) +
  facet_wrap(~ Chemical, scales = "free_y", ncol = 4) +
  scale_fill_manual(values = c("Barolo" = "#8B0000",
                                "Grignolino" = "#DAA520",
                                "Barbera" = "#2E8B57")) +
  labs(x = NULL, y = "Standardized Value") +
  theme(legend.position = "bottom",
        strip.text = element_text(size = 9))

Some variables (Flavanoids, Proline, Color Intensity) clearly separate cultivars. Others (Ash, Magnesium) overlap heavily. But you can’t see a wine here — you see 13 disconnected box-and-whisker slices. Every decision about blending, pricing, or fraud detection requires a holistic view of each wine’s chemical fingerprint. That’s where Andrews curves come in.

Turning Rows into Curves

The Andrews transformation (Andrews, 1972) maps each pp-dimensional observation 𝐱=(x1,,xp)\mathbf{x} = (x_1, \ldots, x_p) to a curve on [π,π][-\pi, \pi] using a Fourier expansion:

f𝐱(t)=x12+x2sin(t)+x3cos(t)+x4sin(2t)+x5cos(2t)+f_{\mathbf{x}}(t) = \frac{x_1}{\sqrt{2}} + x_2\sin(t) + x_3\cos(t) + x_4\sin(2t) + x_5\cos(2t) + \cdots

The key guarantee: L2L^2 distance between curves equals π\sqrt{\pi} times the Euclidean distance between the original vectors. Nothing is lost. Nothing is distorted.

fd_wine <- andrews_transform(X)
cat("Andrews curves:", nrow(fd_wine$data), "observations,",
    ncol(fd_wine$data), "grid points\n")
#> Andrews curves: 178 observations, 200 grid points

n <- nrow(fd_wine$data)
m <- ncol(fd_wine$data)
t_grid <- fd_wine$argvals

df_curves <- data.frame(
  t = rep(t_grid, n),
  value = as.vector(t(fd_wine$data)),
  curve = rep(1:n, each = m),
  Cultivar = rep(cultivar, each = m)
)

ggplot(df_curves, aes(x = t, y = value, group = curve, color = Cultivar)) +
  geom_line(alpha = 0.35, linewidth = 0.4) +
  scale_color_manual(values = c("Barolo" = "#8B0000",
                                 "Grignolino" = "#DAA520",
                                 "Barbera" = "#2E8B57")) +
  labs(title = "Andrews Curves of Wine Chemical Profiles",
       x = expression(t), y = expression(f[x](t))) +
  theme(legend.position = "bottom")

Now each wine has a signature — a single visual object that encodes all 13 measurements. A quality manager can glance at a curve and compare it to the cultivar’s typical profile. Wines that “look different” are immediately suspicious. This is the multivariate equivalent of a chromatography trace — one picture captures the whole chemical identity.

Proving the Bridge Works

Pretty pictures are nice, but for regulatory or audit contexts, we need a mathematical guarantee. The Andrews distance preservation theorem says:

f𝐱f𝐲L2=π𝐱𝐲2\|f_{\mathbf{x}} - f_{\mathbf{y}}\|_{L^2} = \sqrt{\pi} \cdot \|\mathbf{x} - \mathbf{y}\|_2

Let’s verify this on all (1782)=15,753\binom{178}{2} = 15{,}753 pairwise distances.

# Compute pairwise distances in both domains
dist_andrews <- metric.lp(fd_wine)
dist_euclid <- as.matrix(dist(X))

# Extract upper triangle
idx_upper <- upper.tri(dist_andrews)
d_a <- dist_andrews[idx_upper]
d_e <- dist_euclid[idx_upper]

# Filter zero-distance pairs (if any duplicates)
nonzero <- d_e > 1e-10
ratio <- d_a[nonzero] / d_e[nonzero]

cat(sprintf("Distance ratio (Andrews / Euclidean):\n"))
#> Distance ratio (Andrews / Euclidean):
cat(sprintf("  Mean:     %.4f\n", mean(ratio)))
#>   Mean:     1.7725
cat(sprintf("  Median:   %.4f\n", median(ratio)))
#>   Median:   1.7725
cat(sprintf("  SD:       %.2e\n", sd(ratio)))
#>   SD:       3.97e-16
cat(sprintf("  sqrt(pi): %.4f\n", sqrt(pi)))
#>   sqrt(pi): 1.7725

df_dist <- data.frame(euclidean = d_e[nonzero], andrews = d_a[nonzero])

ggplot(df_dist, aes(x = euclidean, y = andrews)) +
  geom_point(alpha = 0.08, size = 0.5, color = "steelblue") +
  geom_abline(slope = sqrt(pi), intercept = 0, color = "red", linewidth = 1) +
  labs(title = "Andrews L² Distance vs Euclidean Distance",
       subtitle = sprintf("Red line: slope = √π ≈ %.4f", sqrt(pi)),
       x = "Euclidean Distance (standardized data)",
       y = "Andrews L² Distance") +
  coord_equal()

The ratio is π1.7725\sqrt{\pi} \approx 1.7725 to machine precision. Any conclusion drawn from the curves — outliers, clusters, similarities — translates exactly back to the original 13 chemical measurements. For regulatory or audit contexts, this mathematical proof is essential: you can report findings in chemical units, not abstract curve distances.

What This Means for Analysis

Because the transformation is isometric, distance-based methods give numerically equivalent results whether you apply them to the curves or to the raw 13-column matrix. FPCA on Andrews curves and prcomp() on the standardized data produce scores that correlate at ±1\pm 1. K-means recovers the same clusters. If all you need is PCA + clustering on a clean matrix, prcomp() and kmeans() are simpler and sufficient.

The value of the functional representation is in the tools that have no direct multivariate equivalent:

FDA Method Classical Equivalent What FDA Adds
outliers.depth.pond() Mahalanobis distance Same core idea, but combined with outliergram() and magnitudeshape() you can classify outliers by type (magnitude vs shape) — something Mahalanobis alone cannot do
cluster.kmeans() kmeans() on raw data Same clusters, but centroids are curves you can plot and overlay — a visual fingerprint, not 13 numbers
fdata2pc() prcomp() Same variance decomposition; eigenfunctions are the visual version of a loading table
boxplot() 13 separate control charts No equivalent. One chart monitors all 13 chemicals simultaneously via functional depth
tolerance.band() Multivariate tolerance region No equivalent. Defines a nonparametric envelope in function space; new wines are checked against a single band
group.test() MANOVA Nonparametric permutation test; bootstrap CIs show where on the curve profiles diverge

The bottom line: Andrews curves earn their keep when you use the functional toolbox — depth-based outlier classification, functional boxplots, tolerance bands, simultaneous monitoring. For methods where classical statistics already gives the same answer (PCA, k-means), the functional version adds visualization and pipeline integration, but not new statistical information.

Next Steps

The remaining articles apply FDA methods to these Andrews curves:

References

  • Andrews, D.F. (1972). Plots of high-dimensional data. Biometrics, 28(1), 125–136.