Artwork by MFG Labs

Learning to learn, or the advent of augmented data scientists

simon.benhamou
11 min readSep 17, 2015

--

The job of a data scientist involves finding patterns in data, often in order to automate or augment human decision-making. Is it possible to find patterns in the way data scientists work in order to automate their own job ?

The concept of automatic machine learning (autoML) is compelling, and we at MFG Labs pay close attention to the development of this field, because the way we work and design our processes might be disrupted by theoretical or practical breakthroughs in this area. Furthermore it incites us to step back, deconstruct our typical workflow, and then question each part of it. At ICML (International Conference of Machine Learning), we had the chance to hear the take of the most brilliant minds on this subject. In this article we will present a brief outlook on how algorithms might replace us data scientists, or most likely assist us in doing our job better. Our intent is by no means to comprehensively survey the field of automatic machine learning, but rather to showcase a couple of specific topics that resonated particularly with our current interests.

The typical data scientist workflow, when you consider it from defining the problem at hand to debugging a live production system is, in our experience, very intricate and certainly not linear. Thankfully, this process can easily be broken down into distinct parts. Rich Caruana (Microsoft Research) formulates the following pipeline, which feels very familiar for us:

  • Problem definition
  • Data collection
  • Data cleaning
  • Data coding (feature engineering)
  • Metric selection
  • Algorithm selection
  • Parameter optimization
  • Post-processing
  • Deployment
  • Online evaluation
  • Debug

Again the workflow is usually not followed linearly: for example data quality problems are often brought to light during feature engineering or model tuning, which implies going back to the data collection & cleaning process, improve it, go back to whatever you were doing before, discover another flaw in the data or in the algorithm choice, rinse, repeat.

Some of the necessary steps outlined above look way out of the reach of automation right now: problem definition, data collection, metric selection, deployment, debug… Those tasks involve a kind of general intelligence only found in humans so far. We’re still far from having machines define a problem relevant for us, nor are we even close to have them industrialize an algorithm from a sandbox-like environment to a live production system.

Data cleaning

AutoML focuses only on those tasks that are both time-consuming and automatable. One with which every data scientist should be all too familiar with is data cleaning. In the real world data sets are messy, data specifications are not followed or do not exist, values are missing or mistyped. Currently the way data scientists deal with these issues is very manual. It involves first understanding the dataset and the signification of the features, drawing univariate plots of the data and looking for anomalies. It is however virtually impossible to catch them all in this way. Often data quality issues are still found late in development, for example when debugging the predictions of an algorithm. Surprisingly, not much attention is given by research to tackle this issue.

We feel though that data cleaning is the kind of task that is definitely automatable because it feels so repetitive. Tools that systematically detect typical anomalies in a dataset and suggest relevant cleaning decisions to the data scientist could be huge time savers. Also going further than looking at univariate distributions in the dataset would help in catching more subtle errors. So far the available tools to perform outlier detection, either assume a given joint distribution (like fitting an elliptic envelope), which is bound to fail on high dimensional complex datasets, or are actually novelty-detection algorithms (like one-class support vector machine), which tends to overfit to the outliers during training. To overcome these issues, research to advance the state of the art in high dimension density estimation (which is a very challenging research area by itself) in the practical context of data cleaning will yield significant benefits for data scientists.

Hyperparameter optimization

Another task that data scientists either spend too much time on tinkering manually or completely ignore is the tuning of algorithm parameters (also called hyperparameters). For example when you run a k-NN (k nearest neighbors) classification algorithm, the hyperparameter is k, the number of neighbors considered for classifying a new data point. In this case hyperparameter optimization is straightforward since there is only a single parameter to optimize: it consists of running a cross-validation on many values of k and choosing the value maximizing the generalization performance. This method works fine when there is a single hyperparameter and when model fitting is cheap. When there are several hyperparameters, then running an exhaustive grid search is exponentially expensive with the number of hyperparameters, which is impractical when you work with reasonably-sized datasets.

For example if you work with a more sophisticated model than k-NN, let’s say random forests, you are dealing with about 6–8 hyperparameters, which can be categorical (Gini criterion or entropy? ) or numeric (proportion of features to draw for each tree, maximum depth of the trees). Depending on the way you discretise the hyperparameter space, the number of combinations can easily go up to several hundreds of millions possible choices. This kind of state space is absolutely overwhelming for the mere humans constituting the majority of data scientists.

However this choice is absolutely fundamental and should not be expedited too quickly. The algorithm performance varies widely with hyperparameters choices, and it is quite unlikely that you will stumble upon good ones by chance, or that the software default ones will be the best, or even reasonable. Thus most data scientists in industry either use sub-optimal models, for example by using the default parameters suggested by the specific implementation they run, or by optimizing each parameter independently, thus reducing drastically the hyperparameter space but ignoring parameter interactions. Incidentally, comparing algorithms does not make much sense when they are not both correctly tuned, since you tend to negatively bias and discard algorithms whose hyperparameters are difficult to optimize.

Traditionally, hyperparameter tuning was seen as a kind of “black art” requiring time, experience and often unwritten rules of thumb. Automating this task would improve data scientists’ models and free up time for them to focus on other less automatable tasks. Fortunately, recent research applied a technique from the 70s to this task and actually automates it completely, even performing it better than human experts : Bayesian optimization.

The algorithm consists in modeling the generalization performance as a smooth function of the hyperparameters, and aims at finding the maxima of this function in as few steps as possible. To do that, it chooses to evaluate the function (which involves running the full machine learning algorithm) on a set of hyperparameters on which the result is the most uncertain (exploration) and that is the most likely to have higher values (exploitation). The uncertainty factor implies modeling the generalization performance as a stochastic process (in practice a Gaussian process is a good choice because of the available closed form formulas for marginal and conditional probabilities), choosing a reasonable prior and updating the posterior via Bayes’ formula each time a new set of hyperparameters is evaluated. This process is typically repeated iteratively until a convergence criterion is met.

The reason Bayesian optimization vastly outperforms classic grid search for a given computation time is because it carefully chooses the next set of hyperparameters to evaluate at the next iteration. Since the bottleneck in the context of hyperparameter optimization is the function evaluation time (which consists in running a full machine learning algorithm for each iteration), this algorithm saves a lot of computation time by avoiding evaluating hyperparameters that are unlikely to be optimal.

In practice this technique works very well, as has been shown in many papers and in data science competitions. It can be applied to any complex machine learning algorithm, since they all involve hyperparameters. Efficient distributed implementations like Spearmint or Hyperopt are available for python users. There is thus no reason anymore for a data scientist to spend more time on hyperparameter tuning than is necessary for setting up the Bayesian optimization and run it.

Algorithm selection

The automation of hyperparameters selection is a major step forward. What if we could go even further and not even have to choose the specific machine learning algorithm ? Michèle Sebag presented at ICML a novel approach for dealing with this subject: ALORS (Algorithmic Recommender System). The main idea is to formulate the problem of algorithm selection as a recommender system problem, a theme we frequently encounter at MFG Labs: a set of users can evaluate or “like” products, and the objective is to predict the unknown ratings of all users for all products from existing ratings. Recommender systems are widely used today (think “you might also like…” after an Amazon purchase) and have been the subject of increasing academic research and experimentation since the Netflix Prize Competition in 2006.

A convenient way to understand the task of collaborative filtering is to view it as a matrix completion problem. The matrix has users as its rows and products as its columns. The available ratings populate this matrix, which is usually very sparse: typically in practice you could have only about 0.001% of known values in the matrix. To solve this problem, the main idea is to assume the matrix you wish to complete has redundant information (i.e it is of low rank). That makes sense in the context of movie recommendation for example: a user liking one science fiction movie might like other science fiction movies. In this framework collaborative filtering becomes a linear algebra problem, and thus old linear algebra algorithms like Singular Value Decomposition work very well in practice.

In the context of algorithm selection, users are machine learning algorithms, and products are problem instances. The rating is the generalization performance of an algorithm for a problem instance (after hyperparameter optimization). Algorithm selection in this framework then consists in predicting, given a fraction of the values in the algorithm/problem instance matrix, which algorithm will perform best for each problem. The experimental results of this strategy are promising: on a portfolio of about 30 algorithms and 200 problem instances, the algorithm recommended by the recommender system consistently outperforms the best overall algorithm. The implications are profound. Sticking to a single algorithm regardless of the problem at hand is a bad strategy, even if you choose the best overall algorithm. You need to adapt the algorithm to the problem if you’re targeting optimal performance. The good news is, there is no need to run all algorithms. Thanks to the recommender system, running a couple ones (2–3) is enough in many cases, and quite practical if you run them in parallel.

Which one is your favorite algorithm ?

This kind of research definitely goes in the right direction towards automation of the machine learning workflow. By having an algorithm manipulate a portfolio of other algorithms, we are able to abstract away the algorithm selection phase. This kind of meta-algorithms could be the future of data science, a future in which there are fewer choices to be made once a problem and a performance metric have been defined, because machines could make better choices than us.

Meta-learning: towards a general-purpose artificial intelligence ?

Algorithm selection methods require having a portfolio of algorithms to begin with. Those algorithms have each been designed by humans in order to solve a specific task. They are composed of logical instructions consuming data, and resulting in a mapping between input to output in the framework of a specific task. For example the gradient descent, a very common technique for calibrating weights in machine learning algorithms, implements very specific instructions to minimize a smooth loss function. But depending on the problem, other optimization algorithms can work significantly better, like Nonlinear Conjugate Gradient or L-BFGS. Those algorithms have also been designed by humans. Would it be that far-fetched to imagine that there are better learning algorithms out there, but that we haven’t figured them out yet ? Can we design algorithms to find those algorithms ?

This problem has seen some research in the last couple of decades. The first approach is genetic programming. It consists in evolving a computer program with biology-inspired evolutions, like crossovers and mutations. A “fitness” of the program is defined, which is usually its ability to perform a given task. Only the fittest programs survive and breed, which leads to even better programs. As you can easily imagine, the bottleneck with this kind of approach is computing power, since you run many programs, many times. Up until the 90s, only very simple problems could be solved with genetic programming. But the exponential growth of CPU power and refinements in the algorithms recently lead to impressive achievements, for example in quantum computing or in soccer-playing (see this link for more examples).

Jürgen Schmidhuber, who presented some of these approaches at ICML, went even further with Meta-Genetic Programming: the meta-program modifying the main program can itself be subject to mutations, and these mutations can themselves be modified… in a potentially infinite recursive loop.

A Meta-meta-meta … meta-learner can be difficult to grasp

Another approach, also the subject of substantial research by Jürgen Schmidhuber and his lab, is even more abstract: fully self-referential learners. This kind of algorithm blurs the line between the main program and the meta-program modifying the main program. In this framework, the learner is conscious of its own code, is able to modify itself, and even improve the codebase responsible for the modifications. To be able to do that, it must have a way of proving that a change in the program will lead to better results. In the Gödel machine, a theoretically optimal self-referential learner designed by Schmidhuber, the starter program basically consists in a set of axioms, describing its own hardware and software (including a theorem prover), and an utility function supplied by the user. Even the utility function could be modified, if the theorem prover manages to deduce that a rewrite would be beneficial, for example if simplifying the utility function would save computing power but still preserve the properties of the original utility. In the framework of self-referential algorithms, the learner can interact with data as well as with an environment, like in reinforcement learning. This algorithm has huge potential, since it is emulating one of the specific features of the human mind: consciousness.

What about the feasibility of a Gödel machine implementation? Recent work by Marcus Hutter (former Schmidhuber student) showed that with a specific initialization of the the Gödel Machine, the complexity is O(n^3). However the constant hidden in this complexity is absolutely unmanageable for any reasonable problem. If we’re looking for practical algorithms though, the closest we have to a self-referential learner is the very popular Recurrent Neural Network(RNN). It has been shown that it can behave as a general Turing machine, and the way it references its own weights make it self-referential and explains many of its successes (LSTM, a variation of RNNs, won many machine learning competitions). It is not as ideal as the Gödel machine though, since it is limited in expressiveness by the structure of the network.

What’s next

Many of the techniques mentioned in this article are not that novel, but are not used widely because they are not well known outside of academic circles, or do not have an easy-to-use implementation. We expect though that in the next decade it will be as easy for a data scientist to apply all of these techniques to their problem instance as it is for them today to run a random forest algorithm. For that to happen, we should increase the communication between industry and the academic (see the inaugurating article of this series) and promote open-source development.

We are also excited about the development of general self-referential intelligence, which could change the world in so many awesome ways (and probably some appalling ones too).

--

--