# Curse of Dimensionality

Today, I hope to present a quick glimpse at the phenomenon called the “Curse of Dimensionality”. For this demonstration, I am simply calculating how much random data stays within two standard deviations (in the Euclidean norm) as we go from one dimension to higher dimensions.

### Random Data

Here are 10 vectors of 100 random numbers each sampled from the standard normal distribution stored as a matrix …

X <- matrix(rnorm(1000), nrow = 100, ncol = 10)

… and as a data frame.

df <- data.frame(X)
head(df)
##           X1          X2          X3          X4         X5         X6
## 1  0.0256453  0.28398713  0.67057945  1.09584061  0.3824753  1.4061483
## 2 -0.6176034 -0.40257645  1.13756561 -0.25761710 -0.1488482 -0.1959629
## 3 -0.5235474 -0.68893524 -0.70737197  0.80125649  0.4926702 -1.9662689
## 4 -0.3737617  0.06833939 -0.02937106 -0.60665832  0.2656111  1.2102051
## 5 -0.9429225  0.05136859  1.89588703  0.30911255 -0.2143345 -0.2801334
## 6  0.4515558 -0.31945406 -2.42236506  0.06607036  0.3371893 -0.3928504
##             X7         X8          X9         X10
## 1  0.671601688 -1.9165923 -0.81464512  0.01496321
## 2 -0.005786507  0.2001263  1.59233921  0.53711798
## 3 -0.712297275  0.8505601 -2.58287100  1.04165643
## 4  0.892004771 -1.0628529  0.37102924  0.19434494
## 5 -1.365184181  0.3970239  0.08127754  2.02421067
## 6  0.746595767 -1.1335422 -0.54150029 -0.56877836

### One Dimension

For normally distributed data, we expect that about 95% of data falls within two standard deviations.

x1 <- X[,1]
within2sd <- abs(x1) <= 2
df1 <- data.frame(x1, within2sd)
mean(within2sd)
## [1] 0.96

In this example, 96 percent of the data in the first vector is within two standard deviations of the mean.

library(tidyverse)
df1 %>%
ggplot(aes(x = x1, , y = 0, color = within2sd)) +
geom_point() +
labs(title = "One Dimension of Normal Distribution Data")

### Two Dimensions

However, when we go into two or more dimensions, the colloquial “95%” expection starts to fade. To aid calculations, the row_norms function in the slam package uses the Euclidean norm by default. To aid visualization, we will use a helper function (found on Stack Overflow at https://stackoverflow.com/questions/6862742/draw-a-circle-with-ggplot2) to draw one circle.

library(slam)
within2sd <- row_norms(X[,1:2]) <= 2

gg_circle <- function(r = 1, xc = 0, yc = 0, color="black", fill=NA, ...) {
x <- xc + r*cos(seq(0, pi, length.out=100))
ymax <- yc + r*sin(seq(0, pi, length.out=100))
ymin <- yc + r*sin(seq(0, -pi, length.out=100))
annotate("ribbon", x=x, ymin=ymin, ymax=ymax, color=color, fill=fill, ...)
}

df2 <- data.frame(X[,1:2], within2sd)
df2 %>%
ggplot(aes(x = X1, y = X2, color = within2sd)) +
geom_point() +
gg_circle(r = 2, color = "red") +
coord_fixed() +
labs(title = "Two Dimensions of Normal Distribution Data")

mean(within2sd)
## [1] 0.88

In this example, 88 percent of the data in the first 2 vectors is within two standard deviations of the mean.

### Higher Dimensions

Plotting scatterplots in higher dimensions is much more complicated, but we can still perform the norm calculations pretty quickly.

N <- 10 #total number of dimensions
within2sd <- rep(0, N) #initialization

# one dimension
within2sd[1] <- mean(abs(x1) <= 2)

# higher dimensions
for(d in 2:N){
within2sd[d] <- mean(row_norms(X[,1:d]) <= 2)
}

# plot
dimensions <- 1:N
df <- data.frame(dimensions, within2sd)
df %>%
ggplot(aes(x = dimensions, y = within2sd)) +
geom_bar(stat = "identity", fill = "blue") +
scale_x_continuous("Dimensions", breaks = 1:N)