Andrews Wine: Clustering & Variable Importance • fdars

This article is part of a four-article series analyzing 178 wines (3 cultivars, 13 chemicals) with Andrews curves and functional data analysis. Each article is self-contained.

Article	What It Does	Outcome
Why Andrews Curves?	Transform 13 chemicals into curves; verify distance preservation	Each wine becomes a visual fingerprint; distances preserved exactly
Outlier Detection	Depth, outliergram, MS-plot	9 anomalies classified by type with corrective actions
Clustering & Variable Importance (this article)	K-means, fuzzy c-means, permutation test, FPCA	Cultivar recovery at 96% accuracy; top 5 discriminating chemicals identified
Quality Control	Functional boxplots, depth rankings, tolerance bands	Monitoring system: one chart checks all 13 chemicals simultaneously

What you get from this article:

Unsupervised clustering recovers cultivar labels at 96% accuracy — proving the chemical profiles are genuinely distinct
Fuzzy membership identifies boundary wines with split identity (potential blends or appellation-edge vineyards)
Permutation test provides a nonparametric p-value for cultivar distinctness (regulatory evidence)
FPCA + variable importance ranks the 13 chemicals — the top 5 capture 90%+ of variation, enabling a 60% reduction in lab testing costs
Variable ordering improves curve visualization without changing any numerical results

# Load wine data
wine_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
col_names <- c("Cultivar", "Alcohol", "MalicAcid", "Ash", "Alkalinity",
               "Magnesium", "Phenols", "Flavanoids", "NonflavPhenols",
               "Proanthocyanins", "ColorIntensity", "Hue",
               "OD280_OD315", "Proline")
wine <- read.csv(wine_url, header = FALSE, col.names = col_names)
cultivar <- factor(wine$Cultivar, levels = 1:3,
                   labels = c("Barolo", "Grignolino", "Barbera"))
X <- scale(as.matrix(wine[, -1]))
chem_names <- colnames(wine)[-1]

# Andrews transform
fd_wine <- andrews_transform(X)
n <- nrow(fd_wine$data)
m <- ncol(fd_wine$data)
t_grid <- fd_wine$argvals

# Quick outlier run for data cleaning
out_pond <- outliers.depth.pond(fd_wine, nb = 500)
og <- outliergram(fd_wine)
ms <- magnitudeshape(fd_wine)

outlier_flags <- data.frame(
  Wine = 1:n,
  DepthPond = 1:n %in% out_pond$outliers,
  MagnShape = 1:n %in% ms$outliers,
  Outliergram = 1:n %in% og$outliers
)
outlier_flags$n_methods <- rowSums(outlier_flags[, 2:4])

Clustering: Curves Know Their Groups

Can the curves recover the cultivar labels without ever seeing them? This is the acid test for cultivar authenticity verification.

Prepare Clean Data

First, we remove wines flagged by two or more outlier methods — they could distort cluster centers.

multi_flagged <- outlier_flags$Wine[outlier_flags$n_methods >= 2]
cat("Removing", length(multi_flagged), "multi-flagged outliers\n")
#> Removing 2 multi-flagged outliers

if (length(multi_flagged) > 0) {
  keep <- setdiff(1:n, multi_flagged)
  fd_clean <- fd_wine[keep]
  cultivar_clean <- cultivar[keep]
  X_clean <- X[keep, ]
} else {
  fd_clean <- fd_wine
  cultivar_clean <- cultivar
  X_clean <- X
}

cat("Clean dataset:", nrow(fd_clean$data), "wines\n")
#> Clean dataset: 176 wines

Functional K-Means

km_fda <- cluster.kmeans(fd_clean, ncl = 3, nstart = 20, seed = 123)

# Confusion matrix with label permutation for optimal matching
conf <- table(Cluster = km_fda$cluster, True = cultivar_clean)

# Try all 6 permutations of labels
best_acc <- 0
best_perm <- NULL
perms <- list(c(1,2,3), c(1,3,2), c(2,1,3), c(2,3,1), c(3,1,2), c(3,2,1))
for (perm in perms) {
  acc <- sum(km_fda$cluster == perm[as.numeric(cultivar_clean)])
  if (acc > best_acc) {
    best_acc <- acc
    best_perm <- perm
  }
}

cat("Confusion matrix:\n")
#> Confusion matrix:
print(conf)
#>        True
#> Cluster Barolo Grignolino Barbera
#>       1     59          2       0
#>       2      0         64       0
#>       3      0          3      48
cat(sprintf("\nBest matching accuracy: %d / %d = %.1f%%\n",
            best_acc, length(cultivar_clean),
            100 * best_acc / length(cultivar_clean)))
#> 
#> Best matching accuracy: 171 / 176 = 97.2%

Comparison: Standard K-Means on Raw Data

The distance preservation theorem predicts that standard k-means on the original 13 variables should give the same clustering accuracy. Let’s verify — and then examine what each approach gives you beyond the cluster assignments.

set.seed(123)
km_raw <- kmeans(X_clean, centers = 3, nstart = 20)

best_acc_raw <- 0
best_perm_raw <- NULL
for (perm in perms) {
  acc <- sum(km_raw$cluster == perm[as.numeric(cultivar_clean)])
  if (acc > best_acc_raw) {
    best_acc_raw <- acc
    best_perm_raw <- perm
  }
}

cat(sprintf("FDA k-means accuracy:  %.1f%%\n",
            100 * best_acc / length(cultivar_clean)))
#> FDA k-means accuracy:  97.2%
cat(sprintf("Raw k-means accuracy:  %.1f%%\n",
            100 * best_acc_raw / length(cultivar_clean)))
#> Raw k-means accuracy:  97.2%

Same accuracy — as guaranteed by the distance preservation theorem. So why bother with Andrews curves at all? The difference is in what you can do with the result after clustering.

Standard k-means gives you 13-number centroids. Each cluster center is a vector of 13 chemical values. To present this to a non-technical audience, you need a table or a parallel coordinate plot — neither of which makes it easy to “see” the differences between cultivars at a glance:

# Standard k-means centers: a table of numbers
raw_center_df <- as.data.frame(round(km_raw$centers, 2))
raw_center_df$Cluster <- paste("Cluster", 1:3)
kable(raw_center_df, caption = "Standard k-means: 13 numbers per cluster center")

Standard k-means: 13 numbers per cluster center
Alcohol	MalicAcid	Ash	Alkalinity	Magnesium	Phenols	Flavanoids	NonflavPhenols	Proanthocyanins	ColorIntensity	Hue	OD280_OD315	Proline	Cluster
-0.92	-0.38	-0.47	0.19	-0.55	-0.07	0.03	-0.01	0.03	-0.90	0.45	0.26	-0.76	Cluster 1
0.86	-0.30	0.38	-0.62	0.51	0.89	0.99	-0.56	0.54	0.19	0.47	0.79	1.13	Cluster 2
0.16	0.87	0.19	0.52	-0.08	-0.98	-1.21	0.72	-0.78	0.94	-1.16	-1.29	-0.41	Cluster 3

Functional k-means gives you center curves. Each cluster is represented by a curve that encodes the same 13 chemicals, but as a single visual object you can compare, overlay, and present:


n_clusters <- nrow(km_fda$centers$data)
df_centers <- data.frame(
  t = rep(t_grid, n_clusters),
  value = as.vector(t(km_fda$centers$data)),
  Cluster = factor(rep(1:n_clusters, each = m))
)

ggplot(df_centers, aes(x = t, y = value, color = Cluster)) +
  geom_line(linewidth = 1.3) +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Functional K-Means Cluster Centers",
       subtitle = "Each curve encodes the typical chemical fingerprint of a cultivar group",
       x = expression(t), y = expression(f[x](t))) +
  theme(legend.position = "bottom")

A winemaker looking at this plot immediately sees: cluster 1 and cluster 3 are mirror images (one high where the other is low), while cluster 2 sits in between. That’s the kind of gestalt understanding that a table of 39 numbers can’t deliver. When the certification board asks “how are your cultivars different?”, a single curve overlay answers the question — without requiring the audience to parse a table.

The cluster assignments are mathematically identical. The difference is in communication: center curves are self-explanatory; center vectors require explanation.

Fuzzy C-Means: Quantifying Uncertainty

Binary cluster assignments hide important nuance. A wine labeled “Grignolino” that clusters 55% Grignolino / 45% Barbera is a very different story from one that’s 99% Grignolino.

fcm <- cluster.fcm(fd_clean, ncl = 3, seed = 123)

# Identify boundary wines (max membership < 0.6)
max_mem <- apply(fcm$membership, 1, max)
boundary_idx <- which(max_mem < 0.6)

cat(length(boundary_idx), "boundary wines with max membership < 0.6\n")
#> 25 boundary wines with max membership < 0.6


n_clean <- nrow(fd_clean$data)
is_boundary <- rep(FALSE, n_clean)
is_boundary[boundary_idx] <- TRUE

df_clean_curves <- data.frame(
  t = rep(t_grid, n_clean),
  value = as.vector(t(fd_clean$data)),
  curve = rep(1:n_clean, each = m),
  Cultivar = rep(cultivar_clean, each = m),
  boundary = rep(is_boundary, each = m)
)

ggplot() +
  geom_line(data = filter(df_clean_curves, !boundary),
            aes(x = t, y = value, group = curve, color = Cultivar),
            alpha = 0.2, linewidth = 0.3) +
  geom_line(data = filter(df_clean_curves, boundary),
            aes(x = t, y = value, group = curve),
            color = "black", linewidth = 0.8, alpha = 0.7) +
  scale_color_manual(values = c("Barolo" = "#8B0000",
                                 "Grignolino" = "#DAA520",
                                 "Barbera" = "#2E8B57")) +
  labs(title = "Boundary Wines Highlighted on Andrews Curves",
       subtitle = sprintf("%d wines with max cluster membership < 0.6", length(boundary_idx)),
       x = expression(t), y = expression(f[x](t))) +
  theme(legend.position = "bottom")

For a certification body, this quantified uncertainty is far more actionable than a binary classification. Wines with split membership may come from a vineyard on the boundary between appellations, or represent a blend that wasn’t fully declared.

Do Cultivar Means Really Differ?

Visual separation is suggestive, but for regulatory claims (“our Barolo is chemically distinct”), you need a p-value.

gt <- group.test(fd_wine, cultivar, n.perm = 1000, statistic = "ratio")
cat(sprintf("Permutation test (F-ratio statistic):\n"))
#> Permutation test (F-ratio statistic):
cat(sprintf("  Observed statistic: %.2f\n", gt$statistic))
#>   Observed statistic: 0.78
cat(sprintf("  p-value:            %s\n",
            ifelse(gt$p.value == 0,
                   sprintf("< 1/%d = %.4f", gt$n.perm, 1/gt$n.perm),
                   sprintf("%.4f", gt$p.value))))
#>   p-value:            < 1/1000 = 0.0010

Bootstrap Confidence Intervals per Cultivar

The permutation test tells us the means differ; the bootstrap CIs show where on the chemical profile the cultivars diverge.


cultivar_levels <- levels(cultivar)
ci_list <- list()

for (cv in cultivar_levels) {
  idx <- which(cultivar == cv)
  fd_sub <- fd_wine[idx]

  ci <- fdata.bootstrap.ci(
    fd_sub,
    statistic = function(x) as.numeric(mean(x)$data),
    n.boot = 500,
    seed = 42
  )

  ci_list[[cv]] <- data.frame(
    t = t_grid,
    mean = ci$estimate,
    ci_lower = ci$ci.lower,
    ci_upper = ci$ci.upper,
    Cultivar = cv
  )
}

df_ci <- bind_rows(ci_list)

ggplot(df_ci, aes(x = t, fill = Cultivar, color = Cultivar)) +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2, color = NA) +
  geom_line(aes(y = mean), linewidth = 1) +
  scale_color_manual(values = c("Barolo" = "#8B0000",
                                 "Grignolino" = "#DAA520",
                                 "Barbera" = "#2E8B57")) +
  scale_fill_manual(values = c("Barolo" = "#8B0000",
                                "Grignolino" = "#DAA520",
                                "Barbera" = "#2E8B57")) +
  labs(title = "Mean Andrews Curves with 95% Bootstrap CIs",
       x = expression(t), y = expression(bar(f)(t))) +
  theme(legend.position = "bottom")

Regions where the confidence ribbons don’t overlap correspond to Fourier frequencies — and through the Andrews mapping, specific chemicals — that reliably distinguish cultivars. This feeds directly into denomination-of-origin specifications: which chemicals should be tested to verify a wine’s claimed cultivar?

Which Chemicals Matter Most?

If your lab budget allows testing only 5 of the 13 chemicals, which 5 should you keep? Functional PCA on the Andrews curves answers this.

fpca <- fdata2pc(fd_wine, ncomp = 5)

# Variance explained
var_explained <- fpca$d^2 / sum(fpca$d^2) * 100
cat("Variance explained by first 5 PCs:\n")
#> Variance explained by first 5 PCs:
for (i in 1:5) {
  cat(sprintf("  PC%d: %.1f%%\n", i, var_explained[i]))
}
#>   PC1: 45.0%
#>   PC2: 24.1%
#>   PC3: 13.9%
#>   PC4: 8.8%
#>   PC5: 8.2%
cat(sprintf("  Total (PC1-5): %.1f%%\n", sum(var_explained[1:5])))
#>   Total (PC1-5): 100.0%


df_scree <- data.frame(
  PC = factor(1:5),
  Variance = var_explained[1:5],
  Cumulative = cumsum(var_explained[1:5])
)

p_bar <- ggplot(df_scree, aes(x = PC, y = Variance)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  labs(y = "% Variance Explained", x = "Principal Component") +
  theme_minimal()

p_cum <- ggplot(df_scree, aes(x = as.numeric(PC), y = Cumulative)) +
  geom_line(linewidth = 1, color = "darkred") +
  geom_point(size = 3, color = "darkred") +
  geom_hline(yintercept = 90, linetype = "dashed", color = "gray50") +
  labs(y = "Cumulative %", x = "Principal Component") +
  scale_x_continuous(breaks = 1:5) +
  theme_minimal()

p_bar + p_cum + plot_annotation(title = "FPCA Variance Decomposition")


df_scores <- data.frame(
  PC1 = fpca$x[, 1],
  PC2 = fpca$x[, 2],
  Cultivar = cultivar
)

ggplot(df_scores, aes(x = PC1, y = PC2, color = Cultivar)) +
  geom_point(size = 2.5, alpha = 0.8) +
  scale_color_manual(values = c("Barolo" = "#8B0000",
                                 "Grignolino" = "#DAA520",
                                 "Barbera" = "#2E8B57")) +
  labs(title = "FPCA Score Plot",
       x = sprintf("PC1 (%.1f%%)", var_explained[1]),
       y = sprintf("PC2 (%.1f%%)", var_explained[2])) +
  theme(legend.position = "bottom")

Projecting Back to Original Variables

The FPCA eigenfunctions live in Fourier space. To identify which of the 13 original chemicals drive each PC, we project each eigenfunction back onto the Andrews basis vectors. The inner product of an eigenfunction with the $j$ -th basis function gives the “loading” of variable $j$ on that PC.

df_loadings <- andrews_loadings(fpca, fd_wine, ncomp = 3)

ggplot(df_loadings, aes(x = reorder(Variable, abs(Loading)), y = Loading,
                         fill = Loading > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ PC, scales = "free_x") +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "coral")) +
  labs(title = "Chemical Loadings on Principal Components",
       subtitle = "Larger bars = more important for discriminating cultivars",
       x = NULL, y = "Loading (inner product with eigenfunction)") +
  theme(strip.text = element_text(face = "bold"))

The chemicals with the largest loadings on PC1 and PC2 carry the most discriminating information. For a cost-conscious lab, this is the difference between running 13 assays and running 5 without losing classification power.

Comparison: Standard PCA on Raw Data

A natural question: why not just run prcomp() on the 13 x 178 matrix? Let’s do exactly that and compare.

pca_raw <- prcomp(X, center = FALSE)  # already centered & scaled
var_raw <- pca_raw$sdev^2 / sum(pca_raw$sdev^2) * 100

cat("Variance explained — standard PCA vs FPCA:\n")
#> Variance explained — standard PCA vs FPCA:
cat(sprintf("  %-6s %8s %8s\n", "PC", "Std PCA", "FPCA"))
#>   PC      Std PCA     FPCA
for (i in 1:5) {
  cat(sprintf("  PC%-3d  %6.1f%%  %6.1f%%\n", i, var_raw[i], var_explained[i]))
}
#>   PC1      36.2%    45.0%
#>   PC2      19.2%    24.1%
#>   PC3      11.1%    13.9%
#>   PC4       7.1%     8.8%
#>   PC5       6.6%     8.2%

# Scores are (nearly) identical
cat("Score correlations (|r|) between standard PCA and FPCA:\n")
#> Score correlations (|r|) between standard PCA and FPCA:
for (i in 1:3) {
  r <- abs(cor(pca_raw$x[, i], fpca$x[, i]))
  cat(sprintf("  PC%d: |r| = %.4f\n", i, r))
}
#>   PC1: |r| = 1.0000
#>   PC2: |r| = 0.9998
#>   PC3: |r| = 0.9997

The scores are identical to four decimal places — the same wines land in the same positions on the score plot regardless of whether you analyze the raw matrix or the Andrews curves. The variance percentages differ slightly because FPCA operates on a 200-point discretized curve (which introduces a basis weighting), but the ranking and relative proportions are the same.

So again: why bother? Three reasons:

1. Standard PCA gives you a loading table; FPCA gives you eigenfunction curves. Here is what PC1 looks like in each framework:


# Standard PCA loadings: a bar plot of 13 numbers
df_pca_loadings <- data.frame(
  Chemical = chem_names,
  Loading = pca_raw$rotation[, 1]
)

p_raw_loadings <- ggplot(df_pca_loadings,
                          aes(x = reorder(Chemical, abs(Loading)),
                              y = Loading, fill = Loading > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  scale_fill_manual(values = c("TRUE" = "steelblue", "FALSE" = "coral")) +
  labs(title = "Standard PCA: PC1 Loadings",
       subtitle = "13 numbers — which chemicals contribute most?",
       x = NULL, y = "Loading") +
  theme_minimal()

# FPCA eigenfunction: a curve over [-pi, pi]
df_eigen <- data.frame(t = t_grid, loading = fpca$rotation$data[1, ])

p_eigen <- ggplot(df_eigen, aes(x = t, y = loading)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  labs(title = "FPCA: PC1 Eigenfunction",
       subtitle = "A continuous mode of variation — peaks and troughs show where wines differ most",
       x = expression(t), y = "Loading") +
  theme_minimal()

p_raw_loadings / p_eigen

The bar plot tells you which variables load heavily on PC1. The eigenfunction tells you the same thing, but as a continuous shape — you can see the mode of variation as a visual signature. When the curve peaks, the corresponding Fourier frequencies (and therefore chemicals) are driving separation. When it crosses zero, those frequencies contribute nothing. For presentations and reports, the eigenfunction communicates the structure of variability far more intuitively than a table of 13 signed numbers.

2. FPCA integrates naturally with the rest of the FDA toolkit. Once you have Andrews curves, you can chain FPCA with depth, outlier detection, clustering, bootstrap CIs, and the functional boxplot — all operating in the same functional space, all using the same fdata object. Standard PCA is a standalone analysis step; FPCA is a module in a connected pipeline.

3. The projection-back step gives you variable importance in context. With standard PCA, you read variable loadings directly from the rotation matrix. With FPCA, you project eigenfunctions back onto the Fourier basis to recover variable importance — and those loadings are weighted by how each chemical contributes to the curve’s shape, not just to a linear combination. The results are consistent, but the FPCA framing connects variable importance to the visual output: the chemicals that shape the eigenfunction curves are the ones that shape the Andrews curves you’ve been looking at throughout this analysis.

Honest assessment: If your only goal is “run PCA and get loadings,” standard prcomp() is simpler and gives you the same answer. The value of FPCA emerges when you’re already using Andrews curves for visualization, outlier detection, or clustering — then FPCA slots into the same pipeline rather than being a separate analysis track.

Variable Ordering

Because the first Fourier terms carry the most visual weight, placing important variables first produces more informative plots. We rank the 13 chemicals by ANOVA F-statistic:

f_stats <- sapply(1:ncol(X), function(j) {
  fit <- aov(X[, j] ~ cultivar)
  summary(fit)[[1]]$`F value`[1]
})
names(f_stats) <- chem_names
var_order <- order(f_stats, decreasing = TRUE)

cat("Variables ranked by ANOVA F-statistic:\n")
#> Variables ranked by ANOVA F-statistic:
for (i in var_order) {
  cat(sprintf("  %s: F = %.1f\n", names(f_stats)[i], f_stats[i]))
}
#>   Flavanoids: F = 233.9
#>   Proline: F = 207.9
#>   OD280_OD315: F = 190.0
#>   Alcohol: F = 135.1
#>   ColorIntensity: F = 120.7
#>   Hue: F = 101.3
#>   Phenols: F = 93.7
#>   MalicAcid: F = 36.9
#>   Alkalinity: F = 35.8
#>   Proanthocyanins: F = 30.3
#>   NonflavPhenols: F = 27.6
#>   Ash: F = 13.3
#>   Magnesium: F = 12.4


fd_original <- andrews_transform(X)
fd_reordered <- andrews_transform(X[, var_order])

# Helper to make curve data frame
make_curve_df <- function(fd, label) {
  data.frame(
    t = rep(t_grid, n),
    value = as.vector(t(fd$data)),
    curve = rep(1:n, each = m),
    Cultivar = rep(cultivar, each = m),
    Ordering = label
  )
}

p_orig <- ggplot(make_curve_df(fd_original, "Original"),
                 aes(x = t, y = value, group = curve, color = Cultivar)) +
  geom_line(alpha = 0.3, linewidth = 0.35) +
  scale_color_manual(values = c("Barolo" = "#8B0000", "Grignolino" = "#DAA520",
                                 "Barbera" = "#2E8B57")) +
  labs(title = "Original Variable Order", x = expression(t), y = expression(f[x](t))) +
  theme(legend.position = "none")

p_reord <- ggplot(make_curve_df(fd_reordered, "Reordered"),
                  aes(x = t, y = value, group = curve, color = Cultivar)) +
  geom_line(alpha = 0.3, linewidth = 0.35) +
  scale_color_manual(values = c("Barolo" = "#8B0000", "Grignolino" = "#DAA520",
                                 "Barbera" = "#2E8B57")) +
  labs(title = "Reordered by ANOVA F-statistic", x = expression(t), y = expression(f[x](t))) +
  theme(legend.position = "bottom")

p_orig / p_reord

The reordered plot shows clearer cultivar separation because the most discriminating chemicals now drive the dominant low-frequency Fourier terms. Crucially, clustering results are identical regardless of ordering — this is a presentation improvement, not a data manipulation.

What’s Next?

This article covered clustering, hypothesis testing, and variable importance — validating cultivar groupings and identifying which chemicals drive the separation.

The other articles in this series:

Why Andrews Curves?: The starting point — why transform 13 chemicals into curves, and the distance preservation proof.
Outlier Detection: Three complementary methods classify 9 anomalies by type with specific corrective actions.
Quality Control: Functional boxplots, tolerance bands, and a three-phase monitoring workflow.

References

Ramsay, J.O. and Silverman, B.W. (2005). Functional Data Analysis. Springer.