--- title: "H2O4GPU: Machine Learning with GPUs in R" author: "Navdeep Gill, Erin LeDell, Yuan Tang" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{H2O4GPU: Machine Learning with GPUs in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- **H2O4GPU** is a collection of GPU solvers by [H2O.ai](https://www.h2o.ai/) with APIs in Python and R. The Python API builds upon the easy-to-use [scikit-learn](https://scikit-learn.org/) API. The **h2o4gpu** R package is a wrapper around the **h2o4gpu** Python package. The R package makes use of RStudio's [reticulate](https://rstudio.github.io/reticulate/) package for facilitating access to Python libraries through R. Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability. **H2O4GPU** is a new project under active development and we are looking for contributors! If you find a bug, please check that we have not already fixed the issue in the bleeding edge version and then check that we do not already have an issue opened for this topic. If not, then please file a new issue with a reproducible example. - Here is the main [GitHub repo](https://github.com/h2oai/h2o4gpu). If you like the package, please 🌟 the repo on GitHub! - If you're looking to contribute, check out the [CONTRIBUTING.md](https://github.com/h2oai/h2o4gpu/blob/master/CONTRIBUTING.md) file. - All open issues that are specific to the R package are [here](https://github.com/h2oai/h2o4gpu/labels/R). - All open issues are [here](https://github.com/h2oai/h2o4gpu/issues?utf8=%E2%9C%93&q=is%3Aopen). ## Installation There are a few [system requirements](https://github.com/h2oai/h2o4gpu#requirements), including Linux with glibc 2.17+, Python >=3.6, R >=3.1, CUDA 8 or 9, and a machine with Nvidia GPUs. The code should still run if you have CPUs, but it will fall back to scikit-learn CPU based versions of the algorithms. The **h2o4gpu** Python module is a prerequisite for the R package. So first, follow the instructions [here](https://github.com/h2oai/h2o4gpu#user-installation) to install the **h2o4gpu** Python package (either at the system level or in a Python virtual envivonment). The easiest thing to do is to `pip install` the stable release `whl` file. To ensure compatibility, the Python package version number should match the R package version number. The recomended way of installing the R package can is from CRAN using `install.packages("h2o4gpu")`. To install the development version of the **h2o4gpu** R package, you can install directly from GitHub as follows: ```{r, eval = FALSE} library(devtools) devtools::install_github("h2oai/h2o4gpu", subdir = "src/interface_r") ``` ### Virtual environments Using a Python [virtual environment](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) is a good solution if you don't want to upgrade your main Python installation to 3.6. If you installed the **h2o4gpu** Python module into a virtual environment, you will have to add a line of code to tell R which Python envivonment you want to use: ```{r, eval = FALSE} library(reticulate) use_virtualenv("/home/username/venv/h2o4gpu") # set this to the path of your venv ``` If you have installed **h2o4gpu** Python module using Anaconda, then you can use the `use_condaenv()` function instead. More information about Python environment configuration is available in the reticulate [user guide](https://rstudio.github.io/reticulate/articles/versions.html). ## Quickstart Here's a quick demo of how to train and evaluate a GPU-based Random Forest classifier model. We will use the classic Iris dataset, which is a three-class classification problem and evaluate the performance of the model using classification error. ```{r, eval = FALSE} library(h2o4gpu) library(reticulate) # only needed if using a virtual Python environment use_virtualenv("/home/username/venv/h2o4gpu") # set this to the path of your venv # Prepare data x <- iris[1:4] y <- as.integer(iris$Species) # all columns, including the response, must be numeric # Initialize and train the classifier model <- h2o4gpu.random_forest_classifier() %>% fit(x, y) # Make predictions pred <- model %>% predict(x) # Compute classification error using the Metrics package (note this is training error) library(Metrics) ce(actual = y, predicted = pred) ``` ## Supervised Learning **H2O4GPU** contains a collection of popular algorithms for supervised learning: Random Forest, Gradient Boosting Machine (GBM) and Generalized Linear Models (GLMs) with Elastic Net regularization. There are methods for regression and classification for each of these algorithms. Both Random Forest and GBM support multiclass clasification, however the GLM currently only supports binomial classification (a ticket for multinomial support is open [here](https://github.com/h2oai/h2o4gpu/issues/505)). The tree based models (Random Forest and GBM) are built on top of the very powerful [XGBoost](https://xgboost.readthedocs.io/en/latest/) library, and the Elastic Net GLM has been built upon the POGS solver. [Proximal Graph Solver (POGS)](https://stanford.edu/~boyd/papers/pogs.html) is a solver for convex optimization problems in graph form using Alternating Direction Method of Multipliers (ADMM). We have found that this method is not as fast as we'd like it to be, so we are working on implementing an entirely new GLM from scratch (follow progress [here](https://github.com/h2oai/h2o4gpu/issues/356)). The **h2o4gpu** R package does not include a suite of internal model metrics functions, therefore we encourage users to use a third-party model metrics package of their choice. For all the examples below, we will use the [Metrics](https://CRAN.R-project.org/package=Metrics) R package. This package has a large number of model metrics functions, all with a very simple, unified API. ### Binary Classification In this example, we will train and test three different models on a subset of the [HIGGS](https://archive.ics.uci.edu/ml/datasets/HIGGS) dataset. The goal in this dataset is to distinguish between signal "1" and background "0", so this is a binary classification problem. The features are all numeric. **H2O4GPU** requires all feature and response columns to be numeric, so in this case, we don't have to do any pre-processing of the data. If your response column is a factor, then you can simply convert the levels to integer values using `as.integer()`. If you have categorical/factor columns among your features, you must apply an encoding method to convert the columns into numeric data. Some options are label encoding (simply convert the levels to integers) or one hot encoding (binary indicator columns, one for each categorical level). For simplicity, in this tutorial, we will always use label encoding, however you can read more about different types of encodings [here](https://dzone.com/articles/handling-character-data-for-machine-learning). ```{r, eval = FALSE} # Load a sample dataset for binary classification # Source: https://archive.ics.uci.edu/ml/datasets/HIGGS train <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") test <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") # Create train & test sets (column 1 is the response) x_train <- train[, -1] y_train <- train[, 1] x_test <- test[, -1] y_test <- test[, 1] ``` Below we see that the **h2o4gpu** modeling functions follow a two-phased functional apporach. The two phased approach to modeling (first initialize model, then train) is more common in Python, and we borrow that paradigm here. We blend this with the the functional pipe syntax in R. First you define the model with it's hyperparameters, for example, `h2o4gpu.gradient_boosting_classifier(n_estimators = 500, subsample = 0.8)`. Then we pipe the initialized model object to the `fit(x, y)` function to train the model, and save the resulting object. ```{r, eval = FALSE} # Train three different binary classification models model_gbc <- h2o4gpu.gradient_boosting_classifier() %>% fit(x_train, y_train) model_rfc <- h2o4gpu.random_forest_classifier() %>% fit(x_train, y_train) model_enc <- h2o4gpu.elastic_net_classifier() %>% fit(x_train, y_train) ``` We pipe our trained models to the familiar `predict()` method. In binary classification, we are often more interested in the numeric predicted values, rather than the predicted class labels. We follow the same design as the `predict()` function in the popular **caret** package, which allows the user to specify which type of predictions they want to return using the `type` argument. This defaults to `"raw"` which in classification, yields predicted class labels. When we set it to `"prob"`, it returns the (uncalibrated) class probabilities. This is not mentioned often in modeling software documentation, but you should note that despite using the term "probabilities", these predicted values do not represent actual probabilities unless some method like [Platt scaling](https://en.wikipedia.org/wiki/Platt_scaling) is used for calibration. This is true for all machine learning packages, including **caret**, **h2o**, and **h2o4gpu** (though we do offer the option to perform Platt scaling inside the **h2o** R package). ```{r, eval = FALSE} # Generate predictions (type "prob" gives predicted values instead of predicted label) pred_gbc <- model_gbc %>% predict(x_test, type = "prob") pred_rfc <- model_rfc %>% predict(x_test, type = "prob") pred_enc <- model_enc %>% predict(x_test, type = "prob") ``` Let's take a look at what the output of the `predict()` function looks like in binary classification. It will be a two-column matrix with the column names set to the names of the classes. ```{r, eval = FALSE} head(pred_rfc) ``` To compute AUC of a binary classification model, we use the predicted values of the second column (the "positive" class) and pass that to the `Metrics::auc()` function. ```{r, eval = FALSE} # Compare test set performance using AUC auc(actual = y_test, predicted = pred_gbc[, 2]) auc(actual = y_test, predicted = pred_rfc[, 2]) auc(actual = y_test, predicted = pred_enc[, 2]) ``` ### Multiclass Classification Now that we are familiar with binary classification, there is not much more to say about multiclass classification. The predict output will have the same format as binary classification, except that if you use `type = "prob"` number of columns will match the number of classes. Often in multiclass classification, you may be interested in the predicted class label and misclassification error, which we've demonstrated already in the Quickstart section. ### Regression In this next exercise, we will compare a GBM and GLM regression model. Until [this issue](https://github.com/h2oai/h2o4gpu/issues/493) is respolved, we don't recommend that you use the Random Forest regressor, as there are some bugs that are severely affecting model performance. We will predicting the age of abalone from physical measurements, using the [Abalone](https://archive.ics.uci.edu/ml/datasets/Abalone) dataset. ```{r, eval = FALSE} # Load a sample dataset for regression # Source: https://archive.ics.uci.edu/ml/datasets/Abalone df <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", header = FALSE) str(df) ``` There is one categorical/factor column in this dataset, so we will first convert those values to integers (label encoding). Recall that label encoding is just one way of encoding the categorical column and that there may be other ways that produce better results in terms of model performance. ```{r, eval = FALSE} df[, 1] <- as.integer(df[, 1]) #label encode the one factor column ``` In this case, we started with a single data frame, so we should break the data into train and test splits at random. We can do that easily in R by sampling 80% of the row indices and subsetting the data frame by row. ```{r, eval = FALSE} # Randomly sample 80% of the rows for the training set set.seed(1) train_idx <- sample(1:nrow(df), 0.8*nrow(df)) # Create train & test sets (column 9 is the response) x_train <- df[train_idx, -9] y_train <- df[train_idx, 9] x_test <- df[-train_idx, -9] y_test <- df[-train_idx, 9] ``` ```{r, eval = FALSE} # Train two different regression models model_gbr <- h2o4gpu.gradient_boosting_regressor() %>% fit(x_train, y_train) model_enr <- h2o4gpu.elastic_net_regressor() %>% fit(x_train, y_train) # Generate predictions pred_gbr <- model_gbr %>% predict(x_test) pred_enr <- model_enr %>% predict(x_test) ``` In regression, the `predict()` function always returns a vector of predictions (not a data frame). ```{r, eval = FALSE} head(pred_gbr) ``` In regression problems, Mean Squared Error (MSE), is a common metric for model evaluation. We will use test set MSE to evaluate and compare our two models. ```{r, eval = FALSE} # Compare test set performance using MSE mse(actual = y_test, predicted = pred_gbr) mse(actual = y_test, predicted = pred_enr) ``` In this case, which is not usual, the GBM drastically outperforms the GLM. ## Unsupervised Learning The unsupervised learning algorithms in **h2o4gpu** include K-Means, Principal Component Analysis (PCA), and Truncated Singular Value Decompostion (SVD). ### K-Means Clustering First we will train a K-Means model. Let's create a train and test set from the iris dataset. ```{r, eval = FALSE} # Prepare data iris$Species <- as.integer(iris$Species) # convert to numeric data # Randomly sample 80% of the rows for the training set set.seed(1) train_idx <- sample(1:nrow(iris), 0.8*nrow(iris)) train <- iris[train_idx, ] test <- iris[-train_idx, ] ``` Train a K-Means model with three clusters. ```{r, eval = FALSE} model_km <- h2o4gpu.kmeans(n_clusters = 3L) %>% fit(train) ``` Once you have trained a K-Means model, applying the `transform()` function to a dataset transforms your points into distances from each centroid. So your `n`x`p` matrix becomes `n`x`k` (`n` is the number of observations,`p` the number of features and `k` the number of clusters). ```{r, eval = FALSE} test_dist <- model_km %>% transform(test) head(test_dist) ``` ### Principal Compoment Analysis (PCA) Let's use the HIGGS train and test datasets again for demonstration. ```{r, eval = FALSE} # Load a sample dataset for binary classification # Source: https://archive.ics.uci.edu/ml/datasets/HIGGS train <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") test <- read.csv("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") ``` Train a PCA model with 4 components and apply the transformation onto a dataset. Once you have created a projection model from a dataset, you can apply that transformation to a new dataset (such as a test set) using the `transform()` function. ```{r, eval = FALSE} model_pca <- h2o4gpu.pca(n_components = 4) %>% fit(train) test_transformed <- model_pca %>% transform(test) ``` ### Truncated Singular Value Decomposition (SVD) Train a truncated SVD model with 4 components and apply the transformation on a test set. ```{r, eval = FALSE} model_tsvd <- h2o4gpu.truncated_svd(n_components = 4) %>% fit(train) test_transformed <- model_tsvd %>% transform(test) ```