Recently, I’ve been revisiting some of my older code and have noticed how much my understanding of efficient programming has grown. While I’m still exploring whether practice truly leads to perfection, it’s clear that my skills have improved 😊. In examining my past projects, I discovered a seemingly minor oversight that had a significant impact on performance: the choice between iterating over matrix rows versus columns in matrix operations.
The short answer: it does make a difference!
Illustration: mean computation
Best Practice R Functions
To assess the computational cost of row-wise versus column-wise operations, I implemented custom functions in base R. The row_means
and col_means
functions calculate the mean of each row and column in a matrix, respectively. These implementations follow R programming best practices, using only base R functions and explicit looping constructs for clarity.
row_means <- function(x) {
n <- nrow(x)
out <- matrix(nrow = n, ncol = 1)
for(i in 1:n) {
out[i, ] <- sum(x[i, ])/n
}
return(out)
}
col_means <- function(x) {
n <- ncol(x)
out <- matrix(nrow = 1, ncol = n)
for(j in 1:n) {
out[, j] <- sum(x[, j])/n
}
return(out)
}
RcppArmadilo
By writing custom C++ functions, we can perform these operations much faster than using the base R functions. In the following code, we define two functions using Rcpp: arma_rowMeans
and arma_colMeans
. These functions are just wrappers to highly optimized functions from the Armadillo library.
Rcpp::cppFunction(
'arma::mat arma_rowMeans(const arma::mat& X) {
return arma::mean(X, 1);
}',
depends = "RcppArmadillo"
)
Rcpp::cppFunction(
'arma::mat arma_colMeans(const arma::mat& X) {
return arma::mean(X, 0);
}',
depends = "RcppArmadillo"
)
In the following chunk, I use the bench
package to systematically compare the performance of row-wise and column-wise mean computations for matrices of different sizes. bench::press()
allows us to run benchmarks across a range of matrix sizes (N = c(1e2, 1e3, 1e4)
). For each matrix size, I generate a random matrix X
and its transpose Y
. This enables us to measure the computational cost of calculating row means and column means using the base R functions and the optimized Armadillo functions.
results <- bench::press(
N = c(1e2, 1e3, 1e4),
{
X <- matrix(rnorm(N^2), ncol = N)
Y <- t(X)
bench::mark(
".rowMeans" = .rowMeans(X, m = N, n = N),
".colMeans" = .colMeans(Y, m = N, n = N),
"rowsR" = row_means(X),
"colsR" = col_means(Y),
"rowsARMA" = arma_rowMeans(X),
"colsARMA" = arma_colMeans(Y),
iterations = 10,
check = F
)
}
)
r <- results %>%
transmute(
N = N,
dim = ifelse(grepl("row", results$expression), "rows", "columns"),
time = median,
func = str_extract(as.character(expression), pattern = "[A-z,\\.,_]+")
) %>%
mutate(
func = case_when(
func %in% c(".rowMeans", ".colMeans") ~ "R internal",
func %in% c("rowsARMA", "colsARMA") ~ "RcppArmadillo",
func %in% c("rowsR", "colsR") ~ "Base R function"
)
)
r %>%
ggplot() +
geom_line(aes(x = N, y = time, col = func, linetype = dim), size = 1) +
labs(linetype = '', col = '', y = "median computing time (seconds)")
Figure 1 reveals a clear performance hierarchy: The base R implementations significantly underperform compared to both R’s internal functions and Armadillo-based solutions. While the Armadillo variants consistently outperform R’s internal functions, column-wise operations consistently execute faster than their row-wise counterparts across implementations.
What’s going on?
R and libraries like Armadillo (used in Rcpp) store matrices in column-major order, meaning all elements in a column are stored consecutively in memory, followed by the elements of the next column. This simple detail explains why column-wise operations are consistently faster than row-wise operations, especially for large matrices:
Column-wise operations align perfectly with this memory layout. Since the CPU can access consecutive memory locations efficiently, it uses caching mechanisms and prefetching to speed up computations. The CPU reads chunks of contiguous data into its cache, minimizing the time spent accessing slower main memory.
In contrast, row-wise operations disrupt this flow. To access elements across rows, the CPU must jump between non-contiguous memory locations. This leads to cache misses and slower performance, as the CPU repeatedly fetches new blocks of memory instead of processing it sequentially.
The takeaway is simple: prefer column-wise operations whenever possible! By doing we, align computations with the way data is stored in memory, enabling the CPU to work efficiently. This small optimization can lead to significant performance gains, particularly for large datasets or high-performance tasks.
And that’s it for today 😊.