Statistical estimation of Information and other fiddly functionals
Wed, 29/06/2016  4:31am  by dan mackinlaySay I would like to know the mutual information of the process generating two streams of observations, with weak assumptions on the form of the generation process.
(Why would I want to do this by itself? I don’t know. I’m sure a use case will come along.)
Because observations with low frequency have high influence on the estimate, this can be tricky. It is easy to get a uslessly biassed — or even inconsistent — estimator, especially in the nonparametric case.
A typical technique, is to construct a joint histogram from your
samples, treat the bins as as a finite alphabet and then do the usual
calculation.
That throws out a lot if information, and it feels clunky and stupid, especially if you suspect your distributions might have some other kind of smoothness that you’d like to exploit.
Moreover this method is highly sensitive and can be arbitrarily wrong if you don’t do it right (see Paninski, 2003).
So, better alternatives?
To consider:
 Based on autorship alone, KKPW14 is the best place to start.
 Kraskov’s (2004) NNmethod looks nice, but don’t yet have any guarantees that I know of
 the relationship between mutual information and 2dimensional
spatial statistics.  relationship between mutual information and copula entropy.
 those occasional mentions of calculating mutual information from recurrence plots
how do they work?
To read
 BaBo12
 Barnett, L., & Bossomaier, T. (2012) Transfer Entropy as a Loglikelihood Ratio. arXiv:1205.6339.
 BDGM97
 Beirlant, J., Dudewicz, E. J., Györfi, L., & van der Meulen, E. C.(1997) Nonparametric entropy estimation: An overview. Journal of Mathematical and Statistical Sciences, 6(1), 17–39.
 ChSh03
 Chao, A., & Shen, T.J. (2003) Nonparametric estimation of Shannon?s index of diversity when there are unseen species in sample. Environmental and Ecological Statistics, 10(4), 429–443. DOI.
 DaVa99
 Darbellay, G. A., & Vajda, I. (1999) Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory, 45, 1315–1321. DOI.
 DaWu00
 Darbellay, G. A., & Wuertz, D. (2000) The entropy as a tool for analysing statistical dependences in financial time series. Physica A: Statistical Mechanics and Its Applications, 287(3?4), 429–439. DOI.
 DSSK04
 Daub, C. O., Steuer, R., Selbig, J., & Kloska, S. (2004) Estimating mutual information using Bspline functions  an improved similarity measure for analysing gene expression data. BMC Bioinformatics, 5(1), 118. DOI.
 DoJR13
 Doucet, A., Jacob, P. E., & Rubenthaler, S. (2013) DerivativeFree Estimation of the Score Vector and Observed Information Matrix with Application to StateSpace Models. arXiv:1304.5768 [Stat].
 GaVG00
 Gao, S., Ver Steeg, G., & Galstyan, A. (n.d.) Estimating Mutual Information by Local Gaussian Approximation.
 HaSt09
 Hausser, J., & Strimmer, K. (2009) Entropy Inference and the JamesStein Estimator, with Application to Nonlinear Gene Association Networks. Journal of Machine Learning Research, 10, 1469.
 JVHW14
 Jiao, J., Venkat, K., Han, Y., & Weissman, T. (2014) Maximum Likelihood Estimation of Functionals of Discrete Distributions. arXiv:1406.6959 [Cs, Math, Stat].
 JVHW15
 Jiao, J., Venkat, K., Han, Y., & Weissman, T. (2015) Minimax Estimation of Functionals of Discrete Distributions. IEEE Transactions on Information Theory, 61(5), 2835–2885. DOI.
 KKPW14
 Kandasamy, K., Krishnamurthy, A., Poczos, B., Wasserman, L., & Robins, J. M.(2014) Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations. arXiv:1411.4342 [Stat].
 KSAC05
 Kennel, M. B., Shlens, J., Abarbanel, H. D. I., & Chichilnisky, E. J.(2005) Estimating Entropy Rates with Bayesian Confidence Intervals. Neural Computation, 17(7). DOI.
 KrSG04
 Kraskov, A., Stögbauer, H., & Grassberger, P. (2004) Estimating mutual information. Physical Review E, 69, 66138. DOI.
 LiVa06
 Liese, F., & Vajda, I. (2006) On Divergences and Informations in Statistics and Information Theory. IEEE Transactions on Information Theory, 52(10), 4394–4412. DOI.
 LiPZ08
 Lizier, J. T., Prokopenko, M., & Zomaya, A. Y.(2008) A framework for the local information dynamics of distributed computation in complex systems.
 MaSh94
 Marton, K., & Shields, P. C.(1994) Entropy and the consistent estimation of joint distributions. The Annals of Probability, 22(2), 960–977.
 MoRL95
 Moon, Y. I., Rajagopalan, B., & Lall, U. (1995) Estimation of mutual information using kernel density estimators. Physical Review E, 52, 2318–2321. DOI.
 NeBR04
 Nemenman, I., Bialek, W., & de Ruyter Van Steveninck, R. (2004) Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5), 56111.
 NeSB02
 Nemenman, I., Shafee, F., & Bialek, W. (2002) Entropy and inference, revisited. In Advances in Neural Information Processing Systems 14 (Vol. 14). Cambridge, MA, USA: The MIT Press
 Pani03
 Paninski, L. (2003) Estimation of entropy and mutual information. Neural Computation, 15(6), 1191–1253. DOI.
 PSMP07
 Panzeri, S., Senatore, R., Montemurro, M. A., & Petersen, R. S.(2007) Correcting for the sampling bias problem in spike train information measures. Journal of Neurophysiology, 98, 1064–1072. DOI.
 PaTr96
 Panzeri, S., & Treves, A. (1996) Analytical estimates of limited sampling biases in different information measures. Network: Computation in Neural Systems, 7(1), 87–107.
 Robi91
 Robinson, P. M.(1991) Consistent Nonparametric EntropyBased Testing. The Review of Economic Studies, 58(3), 437. DOI.
 Roul99
 Roulston, M. S.(1999) Estimating the errors on measured entropy and mutual information. Physica D: Nonlinear Phenomena, 125(3–4), 285–294. DOI.
 Schü15
 Schürmann, T. (2015) A Note on Entropy Estimation. Neural Computation, 27(10), 2097–2106. DOI.
 StLe08
 Staniek, M., & Lehnertz, K. (2008) Symbolic transfer entropy. Physical Review Letters, 100(15), 158101. DOI.
 VePa08
 Vejmelka, M., & Paluš, M. (2008) Inferring the directionality of coupling with conditional mutual information. Phys. Rev. E, 77(2), 26214. DOI.
 Vict02
 Victor, J. D.(2002) Binless strategies for estimation of information from neural data. Physical Review E, 66, 51903. DOI.
 WoWo94a
 Wolf, D. R., & Wolpert, D. H.(1994a) Estimating Functions of Distributions from A Finite Set of Samples, Part 2: Bayes Estimators for Mutual Information, ChiSquared, Covariance and other Statistics. arXiv:compgas/9403002.
 WoWo94b
 Wolpert, D. H., & Wolf, D. R.(1994b) Estimating Functions of Probability Distributions from a Finite Set of Samples, Part 1: Bayes Estimators and the Shannon Entropy. arXiv:compgas/9403001.
 WuYa14
 Wu, Y., & Yang, P. (2014) Minimax rates of entropy estimation on large alphabets via best polynomial approximation. arXiv:1407.0381 [Cs, Math, Stat].
See original: Statistical estimation of Information and other fiddly functionals
Content aggregators
Wed, 29/06/2016  2:17am  by dan mackinlayUpon the efficient consumption and summarizing of news from around the world.
I have been told to do this through twitter or facebook, but, seriously… no.
Those are systems designed to waste time with stupid distractions to benefit someone else.
Contrarily, I would like to find ways to summarise and condense information to save time for myself.
Feed readers
The classic.
You know what podcasts are?
Podcasts are a type of feed. An audio feed.
If I care about news articles and tumblr posts and whatever, not just audio, then I use feeds, feeds of text instead of audio. Any website can have a feed. Many do.
So…
Aside:
Remember when we thought the web would be a useful tool for researching and learning, and that automated research assistants would trawl the web for us?
RSS Feeds were often discussed as piece of that machine.
Little updates dripped from the web, to be sliced, diced, prioritised and analysed by our software to keep us aware of… whatever.
Most feed readers don’t do any of that fancy analysis though,
they just give you a list of new items ordered by date.
Still, whatever. Better than nothing.

commercial offerings
 feedly is the current boss. Targets commercial uses, like web “community managers” or marketing types. Probably works for humans too. This is how you would subscribe to my site in Feedly
 newsblur is a quirky little option that I happen to use currently. The interface defies the last 10 years of user interface conventions, which is confusing, but it works and is cheap. This is how you would subscribe to my site in Newsblur
 Feeder is a browser extension that reads feeds.
 The old reader reads feeds and this includes activity updates for people you follow on social media. Not sure if that is the worst or best of all worlds.

Indiestyle
I will run a server if the application is good enough, but it has to be worth the time investment. Let’s say between backups, security issues, confusing DNS failures etc, that’s 8 hours per year of miscellaneous computer wrangling, best case, and more hours if you have complicated things like some multiuser database like MySQL. Very few things are good enough to be worth the opportunity cost of that time.
Why people insist on running enterprise databases to hold a reading list is an ongoing mystery to me. The capacity to scale to many users is nice, I suppose, but by that logic everyone should drive everywhere in a school bus. miniflux is opensource, but also offers a hosted version for $15/year.
 stringer looks like a nice little ruby app but need postgresql. Bloat!
 tinytinyrss is the original “minimalist” RSS reader; it still need more databases than is sensible.
 fever is a weird commercial ($30) application that you host on your own server. It claims to learn your information preferences, negating my previous complaint. But I cannot be arsed installing some databasewanting app with suspiciously machinelearninginappropriate language requirements (PHP3) that also costs money to try, so I will never know.
See original: Content aggregators
Practical workshop in magnetite nanoparticles preparation
Tue, 28/06/2016  10:03am  by Wesam Ahmed TawfikNaqaa Nanotechnology Network is organizing a Practical Workshop in the Magnetite nanoparticles preparation for one day from
10:30 am till 3:30pm on Saturday 16 July 2016 which will contains lectures about different applications of magnetite nanoparticles and practical preparation of
Magnetite Nanoparticles
Important: Don't forget to get your lab coat with you for the practical part
Fees are 200 EGP
Spaces will be limited to 12 participants, so we ask attendees to register ahead of time
Fees include: Lectures on CD+ Practical part + lunch break+ Certificate.
Certificates will be accredited by NNN
For more information please call 01098915757, 01115831621
Those who would like to register:
Just send us an email at naqaafoundation@gmail.com containing:
1 Your full triple name as you want in Certificate
2 Your position
3Your mobile
4your email
Subject of email:Practical Workshop i
email message: I want to attend
Best regards
Practical workshop in magnetite nanoparticles preparation
Tue, 28/06/2016  10:00am  by Wesam Ahmed TawfikNaqaa Nanotechnology Network is organizing a Practical Workshop in the Magnetite nanoparticles preparation for one day from
10:30 am till 3:30pm on Saturday 16 July 2016 which will contains lectures about different applications of magnetite nanoparticles and practical preparation of
Magnetite Nanoparticles
Important: Don't forget to get your lab coat with you for the practical part
Fees are 200 EGP
Spaces will be limited to 12 participants, so we ask attendees to register ahead of time
Fees include: Lectures on CD+ Practical part + lunch break+ Certificate.
Certificates will be accredited by NNN
For more information please call 01098915757, 01115831621
Those who would like to register:
Just send us an email at naqaafoundation@gmail.com containing:
1 Your full triple name as you want in Certificate
2 Your position
3Your mobile
4your email
Subject of email:Practical Workshop i
email message: I want to attend
Best regards
Composition, music theory, mostly Western.
Mon, 27/06/2016  3:54am  by dan mackinlaySometime you don’t want to generate a chord, or measure a chord, or
learn a chord,
you just want to write a chord.
Helpful software for the musically vexed
 Fabrizio Poce’s
J74 progressive and J74 bassline
are some chord progression
generators from his library of very clever chord generators linked in to
Ableton Live’s scripting engine,
so if you
are using Ableton they might be very handy.
They are cheap (EUR12 + EUR15).
I use them myself, but they DO make Ableton crash a wee bit, so not really
suited for live performance, which is a pity because that would be a
wonderful unique selling point.
The realtimeoriented J74 HarmoTools from the same guy
are less sophisticated but worth trying, especially since they are free, and
he has lot of other clever hacks there too.
Basically, just go to this guy’s
site and try his stuff out. You don’t have to stop there.  Odesi
(USD49) has been doing lots of advertising and has a very nice popinterface.
It’s like Synfirelite with a library of pop tricks and rhythms.
The desktop version tries to install gigabytes of synths of meagre merit on your machine,
which is a giant waste of space an time if you are using a computer with synths on,
which you are because this is 2016.  Helio is free and cross platform and totally worth a shot.
There is a chord model in there and version control (!) but you might not notice the chord thing if you aren’t careful  Mixtikl / Noatikl are grandaddy apps for this, although the creators doubtless put much effort into the sleek user interfaces, their complete inability to explain their app or provide compelling demonstrations or use cases leave me cold.
I get the feeling they had highart aspirations but have ended up basically doing ambient noodles in order to sell product; Maybe I’m not being fair. (USD25/USD40)  Rapid Compose (USD99/USD249) might make decent software, but can’t really explain why their app is nice or provide a demo version.
 synfire explains how it uses music theory to do largescale scoring etc. Get the string section to behave itself or you’ll replace them with MIDIbots. (EUR996, so I won’t be buying it, but great demo video.)
 harmony builder does classical music theory for you.
USD39USD219 depending on heinously complex pricing schemes.
Will pass your conservatorium finals.  You can’t resist rolling your own?
sharp11 is a node.js music theory library for javascript with demo application to create jazz improv.  Supercollider of course does this and everything else, but designing user interfaces for it will take years off your life. OTOH, if you are happy with text, this might be a goer.
Arpeggiators
 Bluearp vst does 2note chord extrapolation (free)
 Hypercyclic is an LFOable arpeggiator (free)
 kirnu (free) and kirnu cream
 Polyrhythmus
Constraint Composition
All of that too mainstream? Try a weird alternative formalism!
How about constrain composition? That’s
declarative musical composition by defining constraints which the notes must satisfy.
Sounds fun in the abstract but the details don’t grab me somehow.
The reference here is strasheela built on an obscure, unpopular, and apparently discontinued Prologlike language called “Oz” or “Mozart”, because using existing languages is not a grand a gesture as claiming none of them are quire Turing complete enough for your special thingy.
That is a bit of a ghost town;
If you wanted to actually do this, you’d probably use overtone + minikanren (prologforlisp) to do this, as with
the composing schemer,
or to be even more mainstream, just use a normal constraint solver in a normal language.
I am fond of python and ncvx.
Anyway, prolog fans read on.
 Anders, T., & Miranda, E. R.(2008). HigherOrder Constraint Applicators for Music Constraint Programming. In Proceedings of the 2008 International Computer Music Conference. Belfast, UK.
 Anders, T., & Miranda, E. R.(2010). Constraint Application with HigherOrder Programming for Modeling Music Theories. Computer Music Journal, 34(2), 25–38. DOI. Online.
 Anders, T., & Miranda, E. R.(2011). Constraint programming systems for modeling music theories and composition. ACM Computing Surveys, 43(4), 1–38. DOI. Online.
 Anders, T., & Miranda, E. R.(2009). A computational model that generalises Schoenberg’s guidelines for favourable chord progressions. In Proceedings of the Sound and Music Computing Conference. Citeseer. Online.
See original: Composition, music theory, mostly Western.
Gaussian distribution and Erf and Normality
Mon, 27/06/2016  3:52am  by dan mackinlayStunts with Gaussian distributions.
Let’s start here with the basic thing.
The (univariate) standard Gaussian pdf
\psi:x\mapsto \frac{1}{sqrt{2\pi}}\text{exp}\left(\frac{x^2}{2}\right)
\end{equation*}
We define
.. math:
\Psi:x\mapsto \int_{\infty}^x\psi{t} dt
This erf function is popular, isn’t it?
Unavoidable if you do computer algebra.
But I can never remember what it is.
There’s this scaling factor tacked on.
Well…
\operatorname{erf}(x)\; =\; \frac{1}{\sqrt{\pi}} \int_{x}^x e^{t^2} \, dt
\end{equation*}
\sqrt{\frac{\pi }{2}} \left(\text{erf}\left(\frac{x}{\sqrt{2}}\right)+1\right)
\end{equation*}
Differential representation
Nonlinear univariate DE represention.
\begin{align*}
\sigma ^2 f'(x)+f(x) (x\mu )&=0\\
f(0) &=\frac{e^{\mu ^2/(2\sigma ^2)}}{\sqrt{2 \sigma^2\pi } }\\
L(x) &=(\sigma^2 D+x\mu)
\end{align*}
\end{equation*}
Linear PDE representation as a diffusion equation (see, e.g. BoGK10)
\begin{align*}
\frac{\partial}{\partial t)f(x;t) &=\frac{1}{2}\frac{\partial^2}{\partial x^2}f(x;t)\\
f(x;0)&=\delta(x\mu)
\end{align*}
\end{equation*}
Look, it’s the diffusion equation of Wiener process.
Roughness
\begin{align*}
\ \frac{d}{dx}\phi_\sigma \_2 &= \frac{1}{4\sqrt{\pi}\simga^3}\\
\ \left(\frac{d}{dx}\right)^n \phi_\sigma \_2 &= \frac{\prod_{i<n}2n1}{2^{n+1}\sqrt{\pi}\simga^{2n+1}}
\end{align*}
\end{equation*}
Refs
 Bote16
 Botev, Z. I.(2016) The Normal Law Under Linear Restrictions: Simulation and Estimation via Minimax Tilting. Journal of the Royal Statistical Society: Series B (Statistical Methodology), n/an/a. DOI.
 BoGK10
 Botev, Z. I., Grotowski, J. F., & Kroese, D. P.(2010) Kernel density estimation via diffusion. The Annals of Statistics, 38(5), 2916–2957. DOI.
See original: Gaussian distribution and Erf and Normality
Sparse regression and things that look a bit like it.
Thu, 23/06/2016  7:50am  by dan mackinlayRelated to compressed sensing but here we consider sampling complexity and the effect of measurement noise.
See also matrix factorisations,
optimisation,
model selection,
multiple testing,
concentration inequalities,
sparse flavoured icecream.
To discuss:
LARS, LASSO, debiassed LASSO, Elastic net, etc.
Implementations
I’m not going to mention LASSO in (generalised) linear regression,
since everything does that these days (Oh alright,
Jerome Friedman’s glmnet for R is the fastest,
and has a MATLAB version.
But SPAMS (C++, MATLAB, R, python) by Mairal himself, looks interesting.
It’s an optimisation library for many various in sparse problems.
See original: Sparse regression and things that look a bit like it.
Eating Japanese Knotweed (and other daft ideas)
Wed, 22/06/2016  6:38pm  by sciencewriterIRImage: Wikipedia 
There have been a number of calls(1,2,3,4) in recent weeks and months to control the invasive plant Japanese Knotweed, at least partially, by eating it. In recent days, Kerry County Council in Ireland heard from one member who, albeit with tongueincheek, urged citizens to make wine, jelly and other sweet treats from the plant.
This strikes me as a terrible idea.
The plant itself is certainly edible  the Japanese have been eating it for years. It's Japanese name, itadori, means 'well being' and it seems to have some medicinal properties. It also tastes a bit like rhubarb apparently. I wouldn't know, I haven't tried it.
I haven't tried it for the same reason I don't advise you try it. Encouraging people to harvest and transport a regulated, invasive species is the perfect recipe (if you'll pardon the pun) for its continued and accelerated spread.
Japanese Knotweed (Fallopia japonica) is, as you will have guessed, native to Japan and the neighbouring region. It was introduced to the UK in the mid19th century and quickly spread to Ireland and other parts of the world. Introduced as an ornamental plant, it quickly became a real problem.
The plant is capable of growing at a tremendous rate  1 metre in a month and forms big stands 23 metres in height. The early shoots are spear like, similar to asparagus in appearance and the plants produce delicate white flowers in late Summer. The real problem is underground where the plant forms tough rhizomes, adapted rootlike organs, which remain in the soil even during the Winter when the rest of the plant dies back.
Japanese Knotweed thrives on disturbance and it is mainly spread by fragments of rhizome, crown or stem being accidentally or deliberately moved. This leads to some real (and expensive) problems including a massive reduction in biodiversity under the alien canopy; structural damage to buildings and infrastructure; and the significant cost of its removal.
Data from 2010 suggest that the plant costs the UK £165 million a year to control. If the plant were to be eradicated in the UK by current methods it would cost £1.56 billion. For one site alone, the 2012 London Olympic site, it cost £88 million to deal with this one invasive plant. Nobody wants Japanese Knotweed on their land.
Image: Wikipedia 
Imagine you go to the supermarket and buy a bunch of rhubarb. The first thing you do is chop the top and bottom off the stalks and chuck them on your compost heap. Do this with Japanese Knotweed and you end up costing yourself (and potentially your neighbours) thousands in a cleanup bill.
Harvesting Japanese Knotweed from the wild, no matter how careful you are, is also fraught with problems. The plant can easily regrow from small fragments the size of your fingernail. If we're lucky, you'll drop these fragments at the original, infested site. If not, you'll drop them on your walk back to the car or in your front garden when you unload the car.
Simply put, encouraging people to mess around with an invasive species like Japanese Knotweed is, in my view, irresponsible. It may also be illegal.
In Ireland, it is an offence to "plant, disperse or cause to disperse or otherwise cause to grow" the plant. It is also an offence if "he/she has in his/her possession for sale or for breeding/reproduction/transport....anything from which the plant can be reproduced or propagated".
In the meantime, there are chemical and physical control options and scientists in the UK are developing a biological control approach using a sapsucking insect called Aphalara itadori. This is an old enemy of the plant, found in Japan and currently being tested in the UK to see if it will do the same job in this part of the world (and not eat anything else, by accident). The trials haven't been a total success with numbers surviving over winter too low to have much of an effect, but the tests are ongoing. Hopefully, before too long we will have a sustainable control option for this invasive plant. In the meantime, stop eating it.
See original: Eating Japanese Knotweed (and other daft ideas)
Smoothing, regularisation, penalization and friends
Tue, 21/06/2016  11:04am  by dan mackinlayIn nonparametric statistics we might estimate simultaneously what look like
many, many parameters, which we constrain in some clever fashion,
which usually boils down to something we can interpret as a “smoothing”
parameters, controlling how many parameters we still have to model
from a subset of the original.
The “regularisation” nomenclature claims descent from Tikhonov, (eg TiGl65 etc) who wanted to solve illconditioned integral and differential equations, so it’s slightly more general.
“Smoothing” seems to be common in the
spline and
kernel estimate communities of
Wahba (Wahb90) and Silverman (Silv84) et al,
who usually actually want to smooth curves.
“Penalization” has a geneology unknown to me, but is probably the least abstruse for common usage.
These are, AFAICT, more or less the same thing.
“smoothing” is more common in my communities which is fine,
but we have to remember that “smoothing” an estimator might not always infer smooth dynamics in the estimand;
it could be something else being smoothed, such as variance in the estimate of parameters of a rough function.
In every case, you wish to solve an illconditioned inverse problem, so you tame it by adding a penalty to solutions you feel one should be reluctant to accept.
TODO: make comprehensible
TODO: examples
TODO: discuss connection with model selection
TODO: discuss connection with compressed sensing.
The real classic approach here is spline smoothing of functional data.
More recent approaches are things like sparse regression.
Refs
 Bach00
 Bach, F. (n.d.) ModelConsistent Sparse Estimation through the Bootstrap.
 ChHS15
 Chernozhukov, V., Hansen, C., & Spindler, M. (2015) Valid PostSelection and PostRegularization Inference: An Elementary, General Approach. Annual Review of Economics, 7(1), 649–688. DOI.
 EHJT04
 Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004) Least angle regression. The Annals of Statistics, 32(2), 407–499. DOI.
 FlHS13
 Flynn, C. J., Hurvich, C. M., & Simonoff, J. S.(2013) Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models. arXiv:1302.2068 [Stat].
 FrHT10
 Friedman, J., Hastie, T., & Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. DOI.
 JaFH13
 Janson, L., Fithian, W., & Hastie, T. (2013) Effective Degrees of Freedom: A Flawed Metaphor. arXiv:1312.7851 [Stat].
 KaRo14
 Kaufman, S., & Rosset, S. (2014) When does more regularization imply fewer degrees of freedom? Sufficient conditions and counterexamples. Biometrika, 101(4), 771–784. DOI.
 KoMi06
 Koenker, R., & Mizera, I. (2006) Density estimation by total variation regularization. Advances in Statistical Modeling and Inference, 613–634.
 LiRW10
 Liu, H., Roeder, K., & Wasserman, L. (2010) Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 23 (pp. 1432–1440). Curran Associates, Inc.
 MeBü10
 Meinshausen, N., & Bühlmann, P. (2010) Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417–473. DOI.
 Meye08
 Meyer, M. C.(2008) Inference using shaperestricted regression splines. The Annals of Applied Statistics, 2(3), 1013–1033. DOI.
 Silv84
 Silverman, B. W.(1984) Spline Smoothing: The Equivalent Variable Kernel Method. The Annals of Statistics, 12(3), 898–916. DOI.
 SmSM98
 Smola, A. J., Schölkopf, B., & Müller, K.R. (1998) The connection between regularization operators and support vector kernels. Neural Networks, 11(4), 637–649. DOI.
 TKPS14
 Tansey, W., Koyejo, O., Poldrack, R. A., & Scott, J. G.(2014) False discovery rate smoothing. arXiv:1411.6144 [Stat].
 TiGl65
 Tikhonov, A. N., & Glasko, V. B.(1965) Use of the regularization method in nonlinear problems. USSR Computational Mathematics and Mathematical Physics, 5(3), 93–107. DOI.
 Geer14
 van de Geer, S. (2014) Weakly decomposable regularization penalties and structured sparsity. Scandinavian Journal of Statistics, 41(1), 72–86. DOI.
 Wahb90
 Wahba, G. (1990) Spline Models for Observational Data. . SIAM
 WeMZ16
 Weng, H., Maleki, A., & Zheng, L. (2016) Overcoming The Limitations of Phase Transition by Higher Order Analysis of Regularization Techniques. arXiv:1603.07377 [Cs, Math, Stat].
 Wood00
 Wood, S. N.(2000) Modelling and smoothing parameter estimation with multiple quadratic penalties. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(2), 413–428. DOI.
 Wood08
 Wood, S. N.(2008) Fast stable direct fitting and smoothness selection for generalized additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3), 495–518. DOI.
 ZoHa05
 Zou, H., & Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. DOI.
 ZoHT07
 Zou, H., Hastie, T., & Tibshirani, R. (2007) On the “degrees of freedom” of the lasso. The Annals of Statistics, 35(5), 2173–2192. DOI.
See original: Smoothing, regularisation, penalization and friends
DJing
Tue, 21/06/2016  5:09am  by dan mackinlayYet our sounds are also a vocabulary for those who detest the walledoff concentrations of wealth, and steal property back: the collectives that build their own sound systems, stage free parties, and invite DJs to perform. The international DJ becomes emblematic of global capitalism’s complicated cultural dimension. On flights and at the free Continental breakfasts in hotels, often the same souldestroying hotel chains in each city, we get stuck chatting with our fellow Americans and Western Europeans, the executives eager to find compatriots. We make small talk with these consultants and dealmakers in the descending elevators in the evening—then go out to the city’s deadend and unowned spaces or its luxury venues to soundtrack the night of the region’s youth, hungry for something new. DJ music is now the common art form of squatters and the nouveau riche; it is the soundtrack both for capital and for its opposition.
http://www.ibrahimshaath.co.uk/keyfinder/
tangerine echonest
see also machine listening,
audio software
DJing software
So many choices, now. I use Ableton, but Traktor and Serrato are more designed for this.
Open source/ lower cost alternatives?
 flow8deck is made by the people who made mixedinkey, software for the musically vexed. It handles keychanges good.
 Traktor
 Serrato
 Djay
See original: DJing
Prepping
Sun, 19/06/2016  9:43am  by dan mackinlaySurviving the collapse of civilisation
Kickstarter for a New Civilization
https://emergentbydesign.com/2015/01/14/kickstarterforanewcivilization/
https://medium.com/emergentculture/reinventeverything556860b63308#.7cttqqium
See original: Prepping
Moving the poors to marginal electorate
Fri, 17/06/2016  7:32am  by dan mackinlayOK, Let’s start treating politics like the favour machine it is and behave accordingly;
NSW under Mike baird is a system wherew you buy favours with leverage.
I’d like it to be otherwise, buyt let’s look
Optimal marginalness.
Invade marginal electorates
Oerganised opposition menas we are more likely to claim council seats as a side benefit.
See original: Moving the poors to marginal electorate
Recurrent neural networks
Fri, 17/06/2016  6:21am  by dan mackinlayFeedback neural networks structured to have memory and a notion of “current” and “past” states, which can encode time (or whatever).
As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of lots of interesting DSP applications.
The connection between these (IIR) and “convolutional” (FIR) neural networks is suggestive for the same reason.
 Awesome RNN is a curated links list of implementations.
 Andrej Karpathy: The unreasonable effectiveness of RNN
 Christopher Olah: Understanding LTSM RNNs
 Jeff Donahue Long term recurrent NN
 Ross Gibson Adventures in narrated reality gives an overview of text generation using RNNs
Flavours
Vanilla
The main problem here is that they are unstable in the training phase unless you are clever.
See BeSF94. One solution is LSTM; see next.
Long Short Term Memory (LSTM)
As always, Christopher Olah wins the visual explanation prize:
Understanding LSTM Networks
LSTM Networks for Sentiment Analysis:
In a traditional recurrent neural network, during the gradient backpropagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.[…]
These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell…]. A memory cell is composed of four main elements: an input gate, a neuron with a selfrecurrent connection (a connection to itself), a forget gate and an output gate. […]The gates serve to modulate the interactions between the memory cell itself and its environment.
GridRNN
A minigenre.
KaDG15 et al connect recurrent cells across multiple axes, leading to a higherrank MIMO system;
This is natural in many kinds of spatial random fields, and I am amazed it was uncommon enough to need formalizing in a paper; but it was and it did and good on Kalchbrenner et al.
Gate Recurrent Unit (GRU)
TBD
Liquid/ Echo State Machines
This sounds deliciously lazy;
Very roughly speaking, your first layer is a reservoir of random saturating IIR filters.
You fit a classifier on the outputs of this.
Easy to implement, that.
I wonder when it actually works, constraints on topology etc.
I wonder if you can use some kind of sparsifying transform on the recurrence operator?
These claim to be based on spiky models, but AFAICT this is not at all necessary.
Various claims are made about how hard they avoid the training difficulty of similarly basic RNNs by being essentially untrained; you use them as a feature factory for another supervised output algorithm.
Suggestive parallel with random projections.
From a dynamical systems perspective, there are two main classes of RNNs.
Models from the first class are characterized by an energyminimizing
stochastic dynamics and symmetric connections.
The best known instantiations are Hopfield networks, Boltzmann machines, and
the recently emerging Deep Belief Networks.
These networks are mostly trained in some unsupervised learning scheme.
Typical targeted network functionalities in this field are associative
memories, data compression, the unsupervised modeling of data distributions,
and static pattern classification, where the model is run for multiple time
steps per single input instance to reach some type of convergence or
equilibrium
(but see e.g., TaHR06 for extension to temporal data).
The mathematical background is rooted in statistical physics.
In contrast, the second big class of RNN models typically features a
deterministic update dynamics and directed connections.
Systems from this class implement nonlinear filters, which
transform an input time series into an output time series.
The mathematical background here is nonlinear dynamical systems.
The standard training mode is supervised.
This survey is concerned only with RNNs of this second type, and
when we speak of RNNs later on, we will exclusively refer to such systems.
Other
It’s still the wild west. Invent a category, name it and stake a claim.
Practicalities
Variable sequence length:
https://gist.github.com/evanthebouncy/8e16148687e807a46e3f
Danijar Hafner:
* Introduction to Recurrent Networks in TensorFlow
* https://danijar.com/variablesequencelengthsintensorflow/
seq2seq models with GRUs : Fun with Recurrent Neural Nets: One More Dive into CNTK and TensorFlow
Refs
 AuBM08
 Auer, P., Burgsteiner, H., & Maass, W. (2008) A learning rule for very simple universal approximators consisting of a single layer of perceptrons. Neural Networks, 21(5), 786–795. DOI.
 BeSF94
 Bengio, Y., Simard, P., & Frasconi, P. (1994) Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. DOI.
 BoBV12
 BoulangerLewandowski, N., Bengio, Y., & Vincent, P. (2012) Modeling Temporal Dependencies in HighDimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
 BoLe06
 Bown, O., & Lexer, S. (2006) ContinuousTime Recurrent Neural Networks for Generative and Interactive Musical Performance. In F. Rothlauf, J. Branke, S. Cagnoni, E. Costa, C. Cotta, R. Drechsler, … H. Takagi (Eds.), Applications of Evolutionary Computing (pp. 652–663). Springer Berlin Heidelberg
 BuMe05
 Buhusi, C. V., & Meck, W. H.(2005) What makes us tick? Functional and neural mechanisms of interval timing. Nature Reviews Neuroscience, 6(10), 755–765. DOI.
 CMBB14
 Cho, K., van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014) On the properties of neural machine translation: Encoderdecoder approaches. arXiv Preprint arXiv:1409.1259.
 CGCB14
 Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. In NIPS.
 CGCB15
 Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2015) Gated Feedback Recurrent Neural Networks. arXiv:1502.02367 [Cs, Stat].
 DoPo15
 Doelling, K. B., & Poeppel, D. (2015) Cortical entrainment to music and its modulation by expertise. Proceedings of the National Academy of Sciences, 112(45), E6233–E6242. DOI.
 DuPW14
 Duan, Q., Park, J. H., & Wu, Z.G. (2014) Exponential state estimator design for discretetime neural networks with discrete and distributed timevarying delays. Complexity, 20(1), 38–48. DOI.
 Gal15
 Gal, Y. (2015) A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv:1512.05287 [Stat].
 GeSC00
 Gers, F. A., Schmidhuber, J., & Cummins, F. (2000) Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10), 2451–2471. DOI.
 GDGR15
 Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., & Wierstra, D. (2015) DRAW: A Recurrent Neural Network For Image Generation. arXiv:1502.04623 [Cs].
 GCWK09
 Grzyb, B. J., Chinellato, E., Wojcik, G. M., & Kaminski, W. A.(2009) Which model to use for the Liquid State Machine?. In 2009 International Joint Conference on Neural Networks (pp. 1018–1024). DOI.
 HaMa12
 Hazan, H., & Manevitz, L. M.(2012) Topological constraints and robustness in liquid state machines. Expert Systems with Applications, 39(2), 1597–1606. DOI.
 HDYD12
 Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI.
 HoSc97
 Hochreiter, S., & Schmidhuber, J. (1997) Long ShortTerm Memory. Neural Computation, 9(8), 1735–1780. DOI.
 JoZS15
 Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015) An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML15) (pp. 2342–2350).
 KaDG15
 Kalchbrenner, N., Danihelka, I., & Graves, A. (2015) Grid Long ShortTerm Memory. arXiv:1507.01526 [Cs].
 KaJF15
 Karpathy, A., Johnson, J., & FeiFei, L. (2015) Visualizing and Understanding Recurrent Networks. arXiv:1506.02078 [Cs].
 Lecu98
 LeCun, Y. (1998) Gradientbased learning applied to document recognition. Proc. IEEE, 86(11), 2278–2324. DOI.
 LeNM05
 Legenstein, R., Naeger, C., & Maass, W. (2005) What Can a Neuron Learn with SpikeTimingDependent Plasticity?. Neural Computation, 17(11), 2337–2382. DOI.
 LiBE15
 Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv:1506.00019 [Cs].
 LuJa09
 Lukoševičius, M., & Jaeger, H. (2009) Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3), 127–149. DOI.
 MaNM04
 Maass, W., Natschläger, T., & Markram, H. (2004) Computational Models for Generic Cortical Microcircuits. In Computational Neuroscience: A Comprehensive Approach (pp. 575–605). Chapman & Hall/CRC
 Mico15
 Miconi, T. (2015) Training recurrent neural networks with sparse, delayed rewards for flexible decision tasks. arXiv:1507.08973 [QBio].
 Mnih15
 Mnih, V. (2015) Humanlevel control through deep reinforcement learning. Nature, 518, 529–533. DOI.
 MoDH12
 Mohamed, A. r, Dahl, G. E., & Hinton, G. (2012) Acoustic Modeling Using Deep Belief Networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22. DOI.
 OIMT15
 Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T.(2015) Visually Indicated Sounds. arXiv:1512.08512 [Cs].
 RoRS15
 Rohrbach, A., Rohrbach, M., & Schiele, B. (2015) The LongShort Story of Movie Description. arXiv:1506.01698 [Cs].
 Schw07
 Schwenk, H. (2007) Continuous space language models. Computer Speech Lang., 21, 492–518. DOI.
 TaHR06
 Taylor, G. W., Hinton, G. E., & Roweis, S. T.(2006) Modeling human motion using binary latent variables. In Advances in neural information processing systems (pp. 1345–1352).
 ThBe15
 Theis, L., & Bethge, M. (2015) Generative Image Modeling Using Spatial LSTMs. arXiv:1506.03478 [Cs, Stat].
 VKCM15
 Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., & Bengio, Y. (2015) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. arXiv:1505.00393 [Cs].
 Waib89
 Waibel, A. (1989) Phoneme recognition using timedelay neural networks. IEEE Trans. Acoustics Speech Signal Process., 37(3), 328–339. DOI.
 YTCB15
 Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015) Describing Videos by Exploiting Temporal Structure. arXiv:1502.08029 [Cs, Stat].
See original: Recurrent neural networks
Generalised linear models
Wed, 15/06/2016  5:38am  by dan mackinlayUsing the machinery of linear regression to predict in
somewhat more general regressions.
This means you are still doing Maximum Likelihood regression,
but outside the setting of homoskedastic gaussian noise and linear response.
Not quite as fancy as generalised additive models,
but if you have to implement such models yourself,
less work. If you are using R this is not you.
To learn:
 When we can do this? e.g. Must the response be from an exponential family for really real? Wikipedia mentions the “overdispersed exponential family” which is no such thing.
 Does anything funky happen with regularisation?
 Whether to merge this in with quasilikelihood.
 Fitting variance parameters.
Pieces of the method follow.
Response distribution
TBD. What constraints do we have here
Linear Predictor
Link function
An invertible (monotonic?) function
relating the mean of the linear predictor and
the mean of the response distribution.
Refs
 BuHT89
 Buja, A., Hastie, T., & Tibshirani, R. (1989) Linear Smoothers and Additive Models. The Annals of Statistics, 17(2), 453–510.
 CuDE06
 Currie, I. D., Durban, M., & Eilers, P. H. C.(2006) Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(2), 259–280. DOI.
 FrHT10
 Friedman, J., Hastie, T., & Tibshirani, R. (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. DOI.
 Hans10
 Hansen, N. R.(2010) Penalized maximum likelihood estimation for generalized linear point processes. arXiv:1003.0848 [Math, Stat].
 Hoss09
 Hosseinian, Sahar. (2009) Robust inference for generalized linear models: binary and poisson regression. . ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
 LeNP06
 Lee, Y., Nelder, J. A., & Pawitan, Y. (2006) Generalized linear models with random effects. . Boca Raton, FL: Chapman & Hall/CRC
 Mccu84
 McCullagh, P. (1984) Generalized linear models. European Journal of Operational Research, 16(3), 285–292. DOI.
 NeBa04
 Nelder, J. A., & Baker, R. J.(2004) Generalized Linear Models. In Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc.
 NeWe72
 Nelder, J. A., & Wedderburn, R. W. M.(1972) Generalized Linear Models. Journal of the Royal Statistical Society. Series A (General), 135(3), 370–384. DOI.
 PrLu13
 Proietti, T., & Luati, A. (2013) Generalised Linear Spectral Models (CEIS Research Paper No. 290). . Tor Vergata University, CEIS
 Wedd74
 Wedderburn, R. W. M.(1974) Quasilikelihood functions, generalized linear models, and the Gauss—Newton method. Biometrika, 61(3), 439–447. DOI.
 Wedd76
 Wedderburn, R. W. M.(1976) On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika, 63(1), 27–32. DOI.
 Wood08
 Wood, S. N.(2008) Fast stable direct fitting and smoothness selection for generalized additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3), 495–518. DOI.
 XiWJ14
 Xia, T., Wang, X.R., & Jiang, X.J. (2014) Asymptotic properties of maximum quasilikelihood estimator in quasilikelihood nonlinear models with misspecified variance function. Statistics, 48(4), 778–786. DOI.
See original: Generalised linear models
Artificial neural networks
Tue, 14/06/2016  6:54am  by dan mackinlayModern computational neural network methods reascend the hype phase transition.
a.k.a deep learning or extreme learning or double plus fancy brainbots or please can our department have a bigger computation budget it’s not to play video games i swear?.
Style transfer will be familiar to anyone who has ever taken hallucinogens or watched movies made by those who have, but you can’t usually put hallucinogens or film nights on the departmental budget so we have to make do with gigantic computing clusters.
But what are “artificial neural networks”?
Either
 a collection of incremental improvements machine learning techniques loosely inspired by real brains, that suurprisingly elicit the kind of results from machine learning networks that everyone was hoping we’d get by at least 20 years ago, or,
 the stateoftheart in artificial kitten recognition.
Why bother?
There are many answers here.
A classic —
The ultimate regression algorithm
Common answer:
It turns out that this particular learning model (class of learning models),
while often not apparently well suited to a given problem,
does very well on general on lots of things,
and very often can keep on doing better and better the more resources you throw at it.
Why burn three grad students on a perfect regression algorithm when you can use
one algorithm to solve a whole bunch of regression problems just as well?
This is more interesting for the businessdev people.
Cool maths
Regularisation, function approximations, interesting manifold inference.
Even the stuff I’d assumed was trivial like backpropagation has a few wrinkles in practice.
See
Michael Nielson’s chapter and
Chrisopher Olah’s visual summary
Insight into the mind
TBD. Maybe.
Trippy art projects
See next.
Generative art applications
 generating music
 messing with copyright lawyers’ minds by copressing films to vectors (More technical version)
The nice hack here is called “generative adversarial networks”
Most neural networks are invertible, giving you generative models.
(e.g.
run the model forwards, it recognises melodies;
run it “backwards”, it composes melodies.
It’s not quite running it backwards, in this vein, the “deep dreaming” project does this.
See, say, the above image from
google’s trippedout image recognition systems) or
Gatys, Ecker and Bethge’s deep art
Neural networks do Monet quite well.
I’ve a weakness for ideas that give me plausible deniability for making
generative art while doing my maths homework.
Hip keywords for NN models
Not necessarily mutually exclusive;
some design patterns you can use.
See Tomasz Malisiewicz’s summary of Deep Learning Trends @ ICLR 2016
Adversarial
Train two networks to beat each other.
I have some intuitiuons why this might work, but need to learn more.
Convolutional
Signal processing baked in to neural networks. Not so complicated if you have ever done signal processing, apart from the abstruse use of “depth” to mean 2 different things in the literature.
Generally uses FIR filters plus some smudgy “pooling”
(which is nonlinear downsampling),
although IIR is also making an appearance by running RNN on multiple axes.
Spikebased
Most simulated neural networks are based on a continuous activation potential and discrete time, unlike spiking biological ones, which are driven by discrete events in continuous time.
There are a great many other differences.
What difference does this in particular make?
I suspect it make a difference regarding time.
Recurrent neural networks
Feedback neural networks with memory and therefore a notion of time and state.
As someone who does a lot of signal processing for music, the notion that these generalise linear systems theory is suggestive of lots of interesting DSP applications.
The connection with these and convolutional neural networks is suggestive for the same reason.
 Awesome RNN is a curated links list of implementations.
 Andrej Karpathy: The unreasonable effectiveness of RNN
 Christopher Olah: Understanding LTSM RNNs
 Jeff Donahue Long term recurrent NN
 Ross Gibson Adventures in narrated reality gives an overview of text generation using RNNs
Vanilla
The main problem here is that they are unstable in the training phase unless you are clever.
See BeSF94. One solution is LSTM; see next.
Gate Recurrent Unit (GRU)
TBD
Long Short Term Memory (LSTM)
LSTM Networks for Sentiment Analysis:
In a traditional recurrent neural network, during the gradient backpropagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.[…]
These issues are the main motivation behind the LSTM model which introduces a new structure called a memory cell…]. A memory cell is composed of four main elements: an input gate, a neuron with a selfrecurrent connection (a connection to itself), a forget gate and an output gate. […]The gates serve to modulate the interactions between the memory cell itself and its environment.
Cortical learning algorithms
Is this a real thing, or pure hype? How does it distinguish itself from other deep learning techniques aside from namechecking biomimetic engineering?
NuPIC has made a big splash with their open source brainesque learning, and have opensourced it;
on that basis alone looks like it could be fun to explore.
 NuPIC is an open source entrant in the field
 How it works
 More How it works
Extreme learning machines
Dunno.
Autoencoding
Optimisation methods
TBD
Related questions
 Artificial neural network are usually layers of linear projections
sandwiched between saturating nonlinear maps.
Why not more general nonlinearities?.  Can you know in advance how long it will take to fit a classifier
or regression model for data of a given sort?
The process looks so mechanical…
Regularisation in neural networks
L_1, L_2, dropout…
Compression of neural networks
It seems we should be able to do better than a gigantic network with millions of parameters;
Once we have trained the graph, how can we simplify it, compress it, or prune it?
Quantizing to single bits.
Encoding for neural networks
Neural networks take an inconvenient encoding format,
so general data has to be massaged.
Convolutional models are an important implicit encoding;
what else can we squeeze [in there/out of there]?
 Radial basis functions
 probabilities
Software stuff
Too many. Neural networks are intuitive enough that everyone builds their own library.
I use Tensorflow, plus a side order of Keras.

R/MATLAB/Python/everything: MXNET.

Lua: Torch

MATLAB/Python: Caffe claims to be a “de facto standard”

Python: Theano
 Tastes better with Lasagne
 which in turn likes nolearn
 …Or this minute’s flavour, keras. Keras is a (probably temporary) de facto standard for transporting trained neural networks to new architectures.
 Less trendy (?) — Pylearn2: Machine Learning library based on Theano and Python
 python/cuda: deepnet
 https://github.com/dmlc/cxxnet and https://github.com/tqchen/mshadow: numpy interface, multiple GPU targets.
 Tastes better with Lasagne

Python/C++: tensorflow seems to be the same thing as Theano,
but it’s backed by google so probably has better longterm prospects.
The construction of graphs is more explicit than in Theano, which I find easier to understand, although this means that you use the nearpython syntax of Theano.
Also claims to compile to smartphones etc, although that looks buggy atm. Keras supports tensorflow as a backend too, for comfort and convenience
 tensorflowslim eases some boring bits.
 tflearn wraps the tensorflow machine in scikitlearn

Javascript (!) inference and training: convnetjs
* plus bonus interview
* sister project for recurrent networks: recurrentjs 
synapticjs is a very fullfeature javasceript training, inference and visualisation of neural network, with really good documentation. Great learning resource, with plausible examples.

javascript inference only, neocortexjt in the browser. Civilised.

brainjs is unmaintained now but looked like a nice simple javascript neural netowrk library.

mindjs is a simple one where you can see the moving parts.

iphone: DeepBeliefSDK
Examples
data
precomputed/trained models
 Caffe format:
 The Caffe Zoo has lots of nice models, pretrained on their wiki
 Here’s a great CV one, Andrej Karpathy’s image captioner, Neuraltalk2
 for the NVC dataset: http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.html  pretrained feature model at http://www.stat.ucla.edu/~junhua.mao/projects/child_learning_folder/NVC_v201509_image_feat_VGGnet.npy)
 Alexnet http://arxiv.org/abs/1412.2302
 For lasgne: https://github.com/Lasagne/Recipes/tree/master/modelzoo
 For Keras:
Howtos
 Beginners guide by google staffers
 What’s wrong with deep learning? is a high speed diagrammatic introductory presentation with clickbait title, by one of the founding fathers, Yann LeCunn
 Yarin Gal on uncertainty quantification
 not exactly a “deep” network, but a great generative hack in this vein:
Generating Sequences With Recurrent Neural Networks  Memkit’s Deep learning bibliography
 deeplearning.net’s reading list…
 and their tutorials are pretty clear
 Michael Nielson has a free online textbook with code examples in python
 Dürr’s tutorial
 Geoffrey Hinton’s video draws the connection between Markov Random Fields and neural networks, and also links to lots of other video tutorials in the sidebar
 The cat recogniser team lead, Quoc Le, has some nice lectures
To read

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoderdecoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. […] The end result is an offtheshelf encoder that can produce highly generic sentence representations that are robust and perform well in practice

Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images

Jeff Dean’s Large Scale Deep Learning at Google
The vector embedding is cool:
\begin{equation*}
E(Rome)  E(Italy) + E(Germany) \approx E(Berlin)
\end{equation*}
Refs
 Amar98
 Amari, S. (1998) Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2), 251–276. DOI.
 Arau00
 Araujo, L. (2000) Evolutionary parsing for a probabilistic context free grammar. In Proc. of the Int. Conf. on on Rough Sets and Current Trends in Computing (RSCTC2000), Lecture Notes in Computer Science 2005 (p. 590).
 ArRK10
 Arel, I., Rose, D. C., & Karnowski, T. P.(2010) Deep Machine Learning  A New Frontier in Artificial Intelligence Research [Research Frontier]. IEEE Computational Intelligence Magazine, 5(4), 13–18. DOI.
 AGMM15
 Arora, S., Ge, R., Ma, T., & Moitra, A. (2015) Simple, Efficient, and Neural Algorithms for Sparse Coding. arXiv:1503.00778 [cs, Stat].
 BLPB12
 Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., … Bengio, Y. (2012) Theano: new features and speed improvements. arXiv:1211.5590 [cs].
 Beng09
 Bengio, Y. (2009) Learning deep architectures for AI. Foundations and Trends® in Machine Learning, 2(1), 1–127. DOI.
 BeCV13
 Bengio, Y., Courville, A., & Vincent, P. (2013) Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Machine Intell., 35, 1798–1828. DOI.
 BeLe07
 Bengio, Y., & LeCun, Y. (2007) Scaling learning algorithms towards AI. LargeScale Kernel Machines, 34, 1–41.
 BeSF94
 Bengio, Y., Simard, P., & Frasconi, P. (1994) Learning longterm dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. DOI.
 Bose91
 Boser, B. (1991) An analog neural network processor with programmable topology. J. Solid State Circuits, 26, 2017–2025. DOI.
 Bott14
 Bottou, L. (2014) From machine learning to machine reasoning. Mach. Learn., 94, 133–149. DOI.
 BoBV12
 BoulangerLewandowski, N., Bengio, Y., & Vincent, P. (2012) Modeling Temporal Dependencies in HighDimensional Sequences: Application to Polyphonic Music Generation and Transcription. In 29th International Conference on Machine Learning.
 Cadi14
 Cadieu, C. F.(2014) Deep neural networks rival the representation of primate it cortex for core visual object recognition. PLoS Comp. Biol., 10, e1003963. DOI.
 CHMB15
 Choromanska, A., Henaff, Mi., Mathieu, M., Ben Arous, G., & LeCun, Y. (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
 Ciod12
 Ciodaro, T. (2012) Online particle detection with neural networks based on topological calorimetry information. J. Phys. Conf. Series, 368, 012030. DOI.
 Cire12
 Ciresan, D. (2012) Multicolumn deep neural network for traffic sign classification. Neural Networks, 32, 333–338. DOI.
 Dahl12
 Dahl, G. E.(2012) Contextdependent pretrained deep neural networks for large vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process., 20, 33–42. DOI.
 DoSB14
 Dosovitskiy, A., Springenberg, J. T., & Brox, T. (2014) Learning to Generate Chairs with Convolutional Neural Networks. arXiv:1411.5928 [cs].
 Fara13
 Farabet, C. (2013) Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell., 35, 1915–1929. DOI.
 Fell91
 Felleman, D. J.(1991) Distributed hierarchical processing in the primate cerebral cortex. Cereb. Cortex, 1, 1–47. DOI.
 Fuku82
 Fukushima, K. (1982) Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15, 455–469. DOI.
 Garc04
 Garcia, C. (2004) Convolutional face finder: a neural architecture for fast and robust face detection. IEEE Trans. Pattern Anal. Machine Intell., 26, 1408–1423. DOI.
 GaEB15
 Gatys, L. A., Ecker, A. S., & Bethge, M. (2015) A Neural Algorithm of Artistic Style. arXiv:1508.06576 [cs, QBio].
 GiSB14
 Giryes, R., Sapiro, G., & Bronstein, A. M.(2014) On the Stability of Deep Networks. arXiv:1412.5896 [cs, Math, Stat].
 Hads09
 Hadsell, R. (2009) Learning longrange vision for autonomous offroad driving. J. Field Robot., 26, 120–144. DOI.
 HaCL06
 Hadsell, R., Chopra, S., & LeCun, Y. (2006) Dimensionality Reduction by Learning an Invariant Mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 2, pp. 1735–1742). DOI.
 Helm13
 Helmstaedter, M. (2013) Connectomic reconstruction of the inner plexiform layer in the mouse retina. Nature, 500, 168–174. DOI.
 Hint10
 Hinton, G. (2010) A practical guide to training restricted Boltzmann machines. In Neural Networks: Tricks of the Trade (Vol. 9, p. 926). Springer Berlin Heidelberg
 HDYD12
 Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. DOI.
 Hint95
 Hinton, G. E.(1995) The wakesleep algorithm for unsupervised neural networks. Science, 268, 1558–1161. DOI.
 Hint07
 Hinton, G. E.(2007) To recognize shapes, first learn to generate images. In T. D. and J. F. K. Paul Cisek (Ed.), Progress in Brain Research (Vol. Volume 165, pp. 535–547). Elsevier
 HiSa06
 Hinton, G. E., & Salakhutdinov, R. R.(2006) Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. DOI.
 HiOT06
 Hinton, G., Osindero, S., & Teh, Y. (2006) A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18(7), 1527–1554. DOI.
 HoSc97
 Hochreiter, S., & Schmidhuber, J. (1997) Long ShortTerm Memory. Neural Computation, 9(8), 1735–1780. DOI.
 HuSi05
 Huang, G.B., & Siew, C.K. (2005) Extreme learning machine with randomly assigned RBF kernels. International Journal of Information Technology, 11(1), 16–24.
 HuWL11
 Huang, G.B., Wang, D. H., & Lan, Y. (2011) Extreme learning machines: a survey. International Journal of Machine Learning and Cybernetics, 2(2), 107–122. DOI.
 HuZS04
 Huang, G.B., Zhu, Q.Y., & Siew, C.K. (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In 2004 IEEE International Joint Conference on Neural Networks, 2004. Proceedings (Vol. 2, pp. 985–990 vol.2). DOI.
 HuZS06
 Huang, G.B., Zhu, Q.Y., & Siew, C.K. (2006) Extreme learning machine: Theory and applications. Neurocomputing, 70(1–3), 489–501. DOI.
 Hube62
 Hubel, D. H.(1962) Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. DOI.
 HuPC15
 Hu, T., Pehlevan, C., & Chklovskii, D. B.(2015) A Hebbian/AntiHebbian Network for Online Sparse Dictionary Learning Derived from Symmetric Matrix Factorization. arXiv:1503.00690 [cs, QBio, Stat].
 KaRL10
 Kavukcuoglu, K., Ranzato, M., & LeCun, Y. (2010) Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition. arXiv:1010.3467 [cs].
 KWKT15
 Kulkarni, T. D., Whitney, W., Kohli, P., & Tenenbaum, J. B.(2015) Deep Convolutional Inverse Graphics Network. arXiv:1503.03167 [cs].
 Lawr97
 Lawrence, S. (1997) Face recognition: a convolutional neuralnetwork approach. IEEE Trans. Neural Networks, 8, 98–113. DOI.
 Lecu98
 LeCun, Y. (1998) Gradientbased learning applied to document recognition. Proc. IEEE, 86, 2278–2324. DOI.
 LeBH15
 LeCun, Y., Bengio, Y., & Hinton, G. (2015) Deep learning. Nature, 521(7553), 436–444. DOI.
 LCHR06
 LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., & Huang, F. (2006) A tutorial on energybased learning. Predicting Structured Data.
 LBRN07
 Lee, H., Battle, A., Raina, R., & Ng, A. Y.(2007) Efficient sparse coding algorithms. Advances in Neural Information Processing Systems, 19, 801.
 LGRN00
 Lee, H., Grosse, R., Ranganath, R., & Ng, A. Y.(n.d.) Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations. . Presented at the Proceedings of the 26th International Confer ence on Machine Learning, 2009
 Leun14
 Leung, M. K.(2014) Deep learning of the tissueregulated splicing code. Bioinformatics, 30, i121–i129. DOI.
 Ma15
 Ma, J. (2015) Deep neural nets as a method for quantitative structureactivity relationships. J. Chem. Inf. Model., 55, 263–274. DOI.
 Mall12
 Mallat, S. (2012) Group Invariant Scattering. Communications on Pure and Applied Mathematics, 65(10), 1331–1398. DOI.
 Mall16
 Mallat, S. (2016) Understanding Deep Convolutional Networks. arXiv:1601.04920 [cs, Stat].
 MaMD14
 Marcus, G., Marblestone, A., & Dean, T. (2014) Neuroscience The atoms of neural computation. Science (New York, N.Y.), 346(6209), 551–552. DOI.
 MCCD13
 Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013) Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs].
 MiLS13
 Mikolov, T., Le, Q. V., & Sutskever, I. (2013) Exploiting Similarities among Languages for Machine Translation. arXiv:1309.4168 [cs].
 Mnih15
 Mnih, V. (2015) Humanlevel control through deep reinforcement learning. Nature, 518, 529–533. DOI.
 Moha12
 Mohamed, A.R. (2012) Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process., 20(1), 14–22. DOI.
 Mont14
 Montufar, G. (2014) When does a mixture of products contain a product of mixtures?. J. Discrete Math., 29, 321–347. DOI.
 Ning05
 Ning, F. (2005) Toward automatic phenotyping of developing embryos from videos. IEEE Trans. Image Process., 14, 1360–1371. DOI.
 OlFi96a
 Olshausen, B. A., & Field, D. J.(1996a) Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. DOI.
 OlFi96b
 Olshausen, B. A., & Field, D. J.(1996b) Natural image statistics and efficient coding. Network (Bristol, England), 7(2), 333–339. DOI.
 OlFi04
 Olshausen, B. A., & Field, D. J.(2004) Sparse coding of sensory inputs. Current Opinion in Neurobiology, 14(4), 481–487. DOI.
 PaVe14
 Paul, A., & Venkatasubramanian, S. (2014) Why does Deep Learning work?  A perspective from Group Theory. arXiv:1412.6621 [cs, Stat].
 PeCh15
 Pehlevan, C., & Chklovskii, D. B.(2015) A Hebbian/AntiHebbian Network Derived from Online NonNegative Matrix Factorization Can Cluster and Discover Sparse Features. arXiv:1503.00680 [cs, QBio, Stat].
 RaBC08
 Ranzato, M. aurelio, Boureau, Y. la., & Cun, Y. L.(2008) Sparse Feature Learning for Deep Belief Networks. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in Neural Information Processing Systems 20 (pp. 1185–1192). Curran Associates, Inc.
 Ranz13
 Ranzato, M. (2013) Modeling natural images using gated MRFs. IEEE Trans. Pattern Anal. Machine Intell., 35, 2206–2222. DOI.
 Rume86
 Rumelhart, D. E.(1986) Learning representations by backpropagating errors. Nature, 323, 533–536. DOI.
 SGAL14
 Sagun, L., Guney, V. U., Arous, G. B., & LeCun, Y. (2014) Explorations on high dimensional landscapes. arXiv:1412.6615 [cs, Stat].
 Schw07
 Schwenk, H. (2007) Continuous space language models. Computer Speech Lang., 21, 492–518. DOI.
 SiOl01
 Simoncelli, E. P., & Olshausen, B. A.(2001) Natural Image Statistics and Neural Representation. Annual Review of Neuroscience, 24(1), 1193–1216. DOI.
 SDBR14
 Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014) Striving for Simplicity: The All Convolutional Net. arXiv:1412.6806 [cs].
 Tura10
 Turaga, S. C.(2010) Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Comput., 22, 511–538. DOI.
 Waib89
 Waibel, A. (1989) Phoneme recognition using timedelay neural networks. IEEE Trans. Acoustics Speech Signal Process., 37, 328–339. DOI.
 WiBö15
 Wiatowski, T., & Bölcskei, H. (2015) A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction. arXiv:1512.06293 [cs, Math, Stat].
 Xion15
 Xiong, H. Y.(2015) The human splicing code reveals new insights into the genetic determinants of disease. Science, 347, 6218. DOI.
 ZhCL14
 Zhang, S., Choromanska, A., & LeCun, Y. (2014) Deep learning with Elastic Averaging SGD. arXiv:1412.6651 [cs, Stat].
See original: Artificial neural networks