Reducing costs of Azure Databricks (FinOps in practice)
Big data is what powers companies. Lots of data mean lots of insights, which enables better decision-making. Having ...
After Microsoft took over Revolution Analytics in 2015, they created a lot of R implementations, usage applications and extensions of available tools. The amount of them means that users may struggle to select and adapt the most appropriate of R distributions. This newest in our Talking about R… series of articles will help you make the right choice for your use case.
First, let’s analyze the differences between R distributions. Microsoft has its own repository called Microsoft R Application Network. There, the previous and current versions of Microsoft R Open can be found, as well as a set of packages along with their snapshots thanks to CRAN Time Machine.
Microsoft R Open is an enhanced version of the open source R, that in addition to dedicated functionalities available in the classic R distribution achieves higher efficiency. This is thanks to the possibility of installing Intel’s Math Kernel Library (Intel MKL).
It allows for computing optimization in R matrices, as well as introducing multithread processing which increases efficiency up to 45 times depending on application. You can read more about the differences in efficiency on the official website.
An important benefit of the MRAN repo is the restore functionality thanks to the R and packages’ version snapshots – these are maintained using CRAN Time Machine. Thus, because of it being officially distributed, R package is also supported by Microsoft.
For multithread processing and computing Microsoft R Server uses the same functions and libraries as Microsoft R Open, i.e. Intel Kernel Math Library. Additionally, what makes Microsoft R Server different to R Open is the ability to process data on several nodes, e.g. using a number of computers.
The key element of the R Server solution is the possibility to install it on many platforms, including Linux, Windows, Hadoop, Teradata DB. This allows to deliver as accurate analysis as possible with received data.
Additionally, with Microsoft R Server it is possible to conduct a sequence of operationalizing tasks thanks to the DeployR package. It allows for code implementation according to best practices and managing the code in a clear and transparent way.
The package also enables management of the data processing environment on more than one server. R Server already has the Intel Kernel Math Library implemented, and so it does not require additional installation.
Another element that is important when using R Server is the ability to use a dedicated RevoScaleR library. It is a set of enhanced functions for importing, transforming and analyzing data on a larger scale, as it is optimized by multithread processing.
The entry level for using these functions is quite low as the only difference from the classic functions is the rx prefix. For instance, for the kmeans function we use rxKmeans, for correlation rxCor etc.
If you are an analyst working on tasks involving machine learning, processing and cleansing data, and want to work locally with functions available in the RevoScaleR, you can do so with the Microsoft R Client. It allows for using a whole range of functions of this package and multithread processing of data with a maximum limit of 2 threads.
However, if we need to take the processing to a higher level, it is possible to switch the compute context from within R Client to R Server built e.g. on Hadoop clusters and make use of all the R Server tools.
R Client is best used for local data processing with the RevoScaleR package (with the possibility of limiting processing to two threads).
R Services is nothing other than a SQL Server functionality that allows for running R scripts in SQL Server procedures. Thanks to this functionality it is possible to analyze data more accurately.
It also enhances efficiency because it is no longer required to move the whole data source to R memory (which is where the data was analyzed), using the RODBC library to connect to the database. It was often a considerable limitation when the memory resources were limited and there was a large amount of analyzed data.
It is worth noting that after recent changes Microsoft introduced to the names of their services, there are new names for R Services and R Server:
This is due to the fact that these tools were enriched with the functionality of writing Python scripts.
This article has hopefully shed some light on the different R distributions available on the market. If you’d like further information or advice, or are thinking of doing a project utilizing R, don’t hesitate to contact us!
Read other similar articles