R (the language)

Printer-friendly version

R is the current hotness in statistics. I may as well use it, as 2/3 of all
statistical algorithms I’ve run into of late are implemented in it. Of those
that remain, most of the rest are written for MATLAB, which is, IMO, some kind
of weird con job pulled on the maths community by disgruntled scientific
computation graduates who want to double bill you for the use of your own
floating point unit. C, Python and Java seem to vie for 3rd spot,
or possibly one of the sommercial alternatives such as S, SPSS or stata.
There are some disconvertingly enthusiastic persons advocating Julia,
and a few super old school command-line thingies.

Pros and cons

Good

  • combines unparalleled breadth and community, at least as pertains to
    statisticians, data miners, machine learners and other such
    assorted folk as I am pleased to call my colleagues. To get some sense of
    this thriving scene, check out R-bloggers. This community alone is
    enough to sell R, whatever you think of the language
    (cf “Your community
    is your best asset
    “)
    And believe me, I have reservations about everything else.
  • amazing, statistically-useful plotting (cf, e.g., the awful battle to
    get error bars in mayavi)
  • online web-app visualization: shiny

Bad

  • Seems, from my personal aesthetic, to have been written by a team who
    prioritise delivering statistical functionality right now over making an
    elegant, fast or consistent language to access that functionality.
    (“Elegant”, “fast”, “consistent”; you can choose… uh…
    Oh look, it’s lunch break! So what are you doing this weekend?)
    I’d rather access those same beautiful libraries through a language which has
    had as many computer scientists winnowing its ugly bits as Python or Ruby
    has had.
    Or indeed Go, Julia, even javascript has managed to drag itself out of hell
    theses days.
    And, for that matter, I’d like as many amazing third-party
    libraries for non-statistical things as these other languages promise.
  • Poetically, R has random scope amongst other
    parser and syntax weirdness.
  • Call-by-value semantics (in a “big-data” processing language?)
  • …ameliorated not even by array views,
  • …exacerbated by bloaty design
  • Object model tacked on after the fact… in fact, several object models,
    which is fine? I guess? maybe, but…
  • …if the object model stuff is multi-standard compatibility disaster,
    I’d like the trade-off to be speed, or functional design features, or some
    other such modern convenience. Nah.
  • One of the worst names to google for ever (cf Processing, Pure)

Tips

Easy project reload

Devtool for lazy people:

Make a folder called MyCode with a DESCRIPTION file.
Make a subfolder called R.
Put R code in .R files in there.
Edit, load_all(“MyCode”), use the functions.

Functional prog hacks

  • purr “A FP package for R in the spirit of underscore.js”

  • magrittr brings a compose (“pipe”) operator to R:

    %>%
    

split/apply

useful functions: semi_join etc
plyr and dplyr are the essential package.

subsetting hell

To subset a list based object:

x[1]

to subset and optionally downcast the same:

x[[1]]

to subset a matrix-based object:

x[1, , drop=FALSE]

to subset and optionally downcast the same:

x[1]

plotting

ggvis is the latest iteration of the ggplot family, AFAICT.

Pro tip:
It’s worth having an install of R around just for the grammar of graphics packages.

How to pass sparse matrices between R and Python

https://gist.github.com/howthebodyworks/9e89e65bfc58fded46ae

This FS-backed method was a couple of orders of magnitude faster than rpy2 last time I tried to pass more than a few MB of data.

Upgrading R breaks the installed packages

This is the fix:

update.packages(checkBuilt=TRUE, ask=FALSE)

Bioconductor’s horrifyingly pwnable install

In fact, the default package management might not be much better, but the
secondary R package repository makes it terrifyingly clear:

What, you’d like to install some biostatistics software on
your campus supercomputing cluster? Easy! Simply download and run this
unverifiable obligatedly unencrypted unsigned script from a webserver of unknown provenance!

source("http://bioconductor.org/biocLite.R")
biocLite("RBGL")

It is probably usually not often script kiddies spoofing you so as to to trojan
your campus computing cluster to steal CPU cycles. After all,
who would do that?

On an unrelated note, I am looking for investors in a distributed bitcoin
mining operation. Contact me privately.

There are step debuggers and other such modern conveniences

  • inspecting frames post hoc: recover
    In fact, pro-tip, you can invoke it in 3rd party code gracefully:

    options(error = utils::recover)
    
  • Interactive debugger: browser

  • Graphical interactive optionally-web-based debugger available in RStudio and if it had any more buzzwords in it would socially tag your instagram and upload in to the NSA’s Internet Of Things to be 3D printed.

  • easy command-line invocation: Rio —- Loads CSV from stdin into R as a data.frame, executes given commands, and gets the output as CSV or PNG on stdout

R for Pythonistas

Many things about R are surprising to me, coming as I do most recently from
Python. I’m documenting my perpetual surprise here, in order that it may save
someone else the inconvenience of going to all that trouble to be personally surprised.

Opaque imports

Importing an R package, unlike importing a python module, brings in random
cruft that may have little to do with the names of the thing you just imported.
That is, IMO, poor planning, although history indicates that most language
designers don’t agree with me on that:

> npreg
Error: object 'npreg' not found
> library("np")
Nonparametric Kernel Methods for Mixed Datatypes (version 0.40-4)
> npreg
function (bws, ...) #etc

Further, Data structures in R can do, and are intended to, provide first class scopes
for looking up of names. You are, as apt of your explorations into data to
bring the names of columns in a data set into scope just as much as the names
of functions in a library. This is kind of useful, although the scoping
proceedings do make my eyes water when this intersects with function definition.

Formulas are cool and ugly, like Adult Swim, and intimately bound up in the
prior point.

assignment to function calls

I need to learn the R terminology to describe this.

R fosters a style of programming where attributes and metadata of data objects
are set by using accessor functions, e.g. in matrix column naming:

> m=matrix(0, nrow=2,ncol=2)
> m
    [,1] [,2]
[1,]   0   0
[2,]   0   0
> colnames(m)
NULL
> colnames(m)=c('a','b')
> colnames(m)
[1] "a" "b"
> m
     a b
[1,] 0 0
[2,] 0 0

If you want to know by observing its effects whether an apparent function
returns some massaged product of is argument, or whether it decorates the
argument, well, check the manual. As a rule, the accessor functions operate on
one object and return null, although so can, e.g., plotting functions.

No scalar types…

A float is a float vector of size 1:

> 5
[1] 5

…yet verbose vector literal syntax

You makes vectors by using a call to a function called c. Witness:

> c('a', 'b', 'c', 'd')
[1] "a" "b" "c" "d"

If you type a literal vector in though, it will throw an error:

> 'a', 'b', 'c', 'd'
Error: unexpected ',' in "'a',"

I’m sure there are Reasons for this;
it’s just that they are reasons that I don’t care about.

In short,

A powerful, effective, diverse, well-supported nightmare.

OTOH, the as far as statistical languages go, this is wonderful;
The others are less supported, less diverse,
and R is now the de facto standard,
so I count my blessings.

To read

See original: The Living Thing / Notebooks R (the language)