Curse of Dimensionality

Today, I hope to present a quick glimpse at the phenomenon called the “Curse of Dimensionality”. For this demonstration, I am simply calculating how much random data stays within two standard deviations (in the Euclidean norm) as we go from one dimension to higher dimensions.

Random Data

Here are 10 vectors of 100 random numbers each sampled from the standard normal distribution stored as a matrix …

X <- matrix(rnorm(1000), nrow = 100, ncol = 10)

… and as a data frame.

df <- data.frame(X)
head(df)
##           X1          X2          X3          X4         X5         X6
## 1  0.0256453  0.28398713  0.67057945  1.09584061  0.3824753  1.4061483
## 2 -0.6176034 -0.40257645  1.13756561 -0.25761710 -0.1488482 -0.1959629
## 3 -0.5235474 -0.68893524 -0.70737197  0.80125649  0.4926702 -1.9662689
## 4 -0.3737617  0.06833939 -0.02937106 -0.60665832  0.2656111  1.2102051
## 5 -0.9429225  0.05136859  1.89588703  0.30911255 -0.2143345 -0.2801334
## 6  0.4515558 -0.31945406 -2.42236506  0.06607036  0.3371893 -0.3928504
##             X7         X8          X9         X10
## 1  0.671601688 -1.9165923 -0.81464512  0.01496321
## 2 -0.005786507  0.2001263  1.59233921  0.53711798
## 3 -0.712297275  0.8505601 -2.58287100  1.04165643
## 4  0.892004771 -1.0628529  0.37102924  0.19434494
## 5 -1.365184181  0.3970239  0.08127754  2.02421067
## 6  0.746595767 -1.1335422 -0.54150029 -0.56877836

One Dimension

For normally distributed data, we expect that about 95% of data falls within two standard deviations.

x1 <- X[,1]
within2sd <- abs(x1) <= 2
df1 <- data.frame(x1, within2sd)
mean(within2sd)
## [1] 0.96

In this example, 96 percent of the data in the first vector is within two standard deviations of the mean.

library(tidyverse)
df1 %>% 
  ggplot(aes(x = x1, , y = 0, color = within2sd)) +
  geom_point() + 
  labs(title = "One Dimension of Normal Distribution Data")

Two Dimensions

However, when we go into two or more dimensions, the colloquial “95%” expection starts to fade. To aid calculations, the row_norms function in the slam package uses the Euclidean norm by default. To aid visualization, we will use a helper function (found on Stack Overflow at https://stackoverflow.com/questions/6862742/draw-a-circle-with-ggplot2) to draw one circle.

library(slam)
within2sd <- row_norms(X[,1:2]) <= 2

gg_circle <- function(r = 1, xc = 0, yc = 0, color="black", fill=NA, ...) {
    x <- xc + r*cos(seq(0, pi, length.out=100))
    ymax <- yc + r*sin(seq(0, pi, length.out=100))
    ymin <- yc + r*sin(seq(0, -pi, length.out=100))
    annotate("ribbon", x=x, ymin=ymin, ymax=ymax, color=color, fill=fill, ...)
}

df2 <- data.frame(X[,1:2], within2sd)
df2 %>%
  ggplot(aes(x = X1, y = X2, color = within2sd)) +
  geom_point() + 
  gg_circle(r = 2, color = "red") +
  coord_fixed() + 
  labs(title = "Two Dimensions of Normal Distribution Data")

mean(within2sd)
## [1] 0.88

In this example, 88 percent of the data in the first 2 vectors is within two standard deviations of the mean.

Higher Dimensions

Plotting scatterplots in higher dimensions is much more complicated, but we can still perform the norm calculations pretty quickly.

N <- 10 #total number of dimensions
within2sd <- rep(0, N) #initialization

# one dimension
within2sd[1] <- mean(abs(x1) <= 2)

# higher dimensions
for(d in 2:N){
  within2sd[d] <- mean(row_norms(X[,1:d]) <= 2)
}

# plot
dimensions <- 1:N
df <- data.frame(dimensions, within2sd)
df %>%
  ggplot(aes(x = dimensions, y = within2sd)) +
  geom_bar(stat = "identity", fill = "blue") +
  scale_x_continuous("Dimensions", breaks = 1:N)

Related