Dienstag, 18. September 2012

My Google Summer of Code (2012) Experience with scikit-learn

1) Google Summer of Code
Google Summer of Code (GSOC) is a three month program, where selected
college students get a stipend from Google to work on ambitious open
source software projects. The program gives students the opportunity to
participate in a real-live projects, thereby learning to work
- in a non-university environment,
- in big distributed teams,
- in open source communities with the guidance of an advanced mentor.

After the program, one should work self-motivated & self-reliantly on
such projects.
For further information on GSOC, visit the website:
http://www.google-melange.com/gsoc/homepage/google/gsoc2012


2) Scikit-learn.org
Scikit-learn is a well established Python machine learning library that
is becoming a part of the Python scientific computing eco system such as Numpy and
SciPy. Scikit-learn is very user-friendly designed with a simple and consistent API and extensive documentation.
Strictly enforced coding standards and high test coverage grantee high
quality implementations. Behind Scikit-learn is a very active community,
steadily improving the library.

3) My Contribution
My goal was to implement cutting age research results to the penalized
linear models in Scikit-learn. During the three month period, my main implementations have been:

- Extending the coordinate descent solver for the penalized linear
models with covariance updates [0] by:
- prototyping the solver in Python,
- gaining a speedup using Cython,
- using BLAS calls, where appropriate,
- varying the ways to cache vector operations.

The final implemented solver was, however, not faster than the current
implementation in Scikit-learn. Since my time for this task was limited,
I had to move forward to the next implementation. The speedup here
remains an open challenge.

- Implementing Strong-Rules [1], which is a heuristic method to estimate
non-zero coefficients for elastic-net penalized linear regression
models. The aim of Strong-Rules is to reduce the input variables to the
relevant ones before fitting the model, thereby considerably reducing
the computing time of the solver. By implementing the Strong Rules
together with an active-set strategy, I achieved considerable speedups
for certain problems. For no problem, the performance was worse than in
the current Scikit-learn implementation, so that overall the solver is
an improvement for the library.

During the project, I ran into some problems and took away the following
learnings:
- Don't underestimate difficulties not explicitly mentioned in research
papers!
- High performance solver are much harder to implement than their
classroom counterparts ;)


4) Why I would recommend GSOC & Scikit-learn.
First of all, the mentoring was really great. Many thanks to Alexandre
Gramfort, who responded to my problems very quickly and supported me at
day- and nighttimes ;). Further, there is a very active mailing-list for
code reviews on Git-hub.

3 reasons to participate in Scikit-learn:
- You can work on a clean code base,
- you have fast development cycles,
- you get support by a very active and friendly developer base.

I was lucky to meet some of them personally shortly after GSOC at the
EuroSciPy conference (thanks to NumFocus for the sponsoring).


5) Future Contribution
I intend to do some code enhancements for my Strong-Rule implementation,
so that it can be included in the next Scikit-learn release and I will
certainly continue to contribute to this great project in the future.

[0] Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33, no. 1 (2010): 1.

[1] Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).

Dienstag, 14. August 2012

Burst Of Speed

The end of GSOC is coming closer with lightening speed and there is still work to be done.

I'm currently implementing a method (Strong Rules [0]) to predict which coefficients of a regularized linear regression model will be zero at the solution.
Knowing this, makes it possible to remove the corresponding features (variables) from the data-set. The resulting data-set can lead to a much small problem that can be solved a lot faster then the original problem.

The method is working properly according to a number of implemented tests. This tests give now valuable feedback during the ongoing re-factoring. The re-factoring has two goals, make the implementation integrate nicely with the existing code and make the potential speedup reality.


The figure above shows, that the implementation with strong rule filtering is still slower then the version without. My primary goal for the remaining days it to bring the red line significantly below the blue line.

[0]
Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).




Donnerstag, 28. Juni 2012

Time Is Flying By

Good news, the refactoring of the elastic-net and lasso classes is done and has been merged into the scikit-learn master some time ago.

I invested some time learning about the hdf5 data format to upload some dataset to mldata.org . Sadly the documentation of mldata.org is very sparse and in case of the used hdf5 data format, outdated too. I have been in contact with the maintainer but I still don't have working specs. I decided to put this task at rest for the moment.

I have a Python prototype working for the new coordinate descent algorithm that  I'm now about to integrate step by step into scikit-learn .



This version will be written in Cython to speed thins up. I hope that I can beat the execution time of the current implementation soon.

Sonntag, 10. Juni 2012

Refactoring linear model code and first benchmark plots

During the last week, I spend some time refactoring the Elastic Net and Lasso classes in scikit-learn. The goal is to obtain a unique interface for the sparse and dense implementations of this model. The format of the input data is used to determine, whether the sparse implementation is used or not. This has the advantage that the user can use the same class for sparse and dense data.

The first benchmark reports are available thanks to Vlads work to adapt vbench to work with scikit-learn ( check out his blog post ).


The plot shows how the execution time changed over time during the past 180 days, for fitting an elastic-net penalized logistic regression on the leukemia data set.

Scikit-learn currently wraps the LIBLINEAR library to fit logistic regression models. As a comparison, the implementation of glmnet 1.7.4 using the R package glmnet took 188 milliseconds to fit the leukemia data set. Next week will bring more benchmark evaluations on different data sets.



Montag, 28. Mai 2012

 Setting up benchmark tests


This GSOC project aims to improve the execution time of some linear models implemented in scikit-learn. Before I start with any code enhancements I want to have an evaluation on how the actual implementations compare with other state of the art implementations. These evaluations will be based on real world data sets with different characteristics. The following shows some of the ideas and the questions that will be addressed next.


1. data sets

binary:

namesizeN (train/test)p#nz (train)used informat
Leukemia1.9M723571dense[1]RData
Newsgroup9.4M11,314777,8110.05%[1]RData
Internet-Ad49K235914301.2%[1]RData
a9a2.3M/1.1M32,561 / 16,281 123451,592 (11%)[2]libsvm
real-sim33.6M72,30920,9583,709,083 (0.2%)[2]libsvm
rcv113.1M/432M20,242 / 677,39947,23649,556,258 (0.15%)[2]libsvm



multiclass:

namesize#classN(train/test)p#nzused in format
news203.6M/0.9M2015,935 /
3,993
1,355,191
9,097,916 (0,03%)
[2]libsvm
Cancer22M1414416,063dense[1]RData




regression:

namesizeNp#nzused in format
Prostate Cancer Data979dense[3]RData


ref:
[1] Friedman, J., T. Hastie, and R. Tibshirani. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33, no. 1 (2010): 1.
p.20 download data
[2] Tibshirani, R., J. Bien, J. Friedman, T. Hastie, N. Simon, J. Taylor, and R.J. Tibshirani. “Strong Rules for Discarding Predictors in Lasso-type Problems.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) (2011).
p.3214 download data
[3] Tibshirani, R. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267–288.

Q: Which data sets should be used?
Q: Format to store them?
Q: Other regression type data sets?


2 problems to benchmark
  • l2 loss*
  • log loss*
  • multi-logit*
with l1 and l1 & l2 penalty

Q: Settings to use in benchmarking (penalty value etc. ) ?

2.1 external reference implementations
  • glmnet
    • glmnet-python ( version? )
    • R glmnet package + rpy2 (latest version)
  • liblinear
    • liblinear + python interface (latest version)
Q: How to time execution to achieve a fair comparison ?
Q: Which glmnet interface should be used?


3 scikit-learn implementations speed development tracking
tracking of execution times for the scikit-learn implementations of (2) on data sets (1)

vbench

Q: Already some example code available for scikit-learn ?

Sonntag, 29. April 2012

Samstag, 14. April 2012

This is my proposal for the Google Summer of Code 2012.

Optimizing sparse linear models using coordinate descent and strong rules.

Abstract

Scikit-learn is a Python machine learning library that aims to be easy to use through a well designed interface. Dependencies are kept to a minimum and the extensive use of NumPy, SciPy and Matplotlib give great computing power and ease the understanding of the codebase.
Data sets with far more features than samples are rapidly gaining on popularity and bring the class of linear models back to focus. This project will bring additional state of the art optimization routines for sparse linear models to scikit-learn and even further reduce dependencies. This task includes high test coverage, documentation, examples and intensive benchmarking.

Linear models are used for regression as well as for classification tasks. In both cases penalty terms can be used to obtain an implicit feature selection. To efficiently solve these penalized linear models is the main focus of this project.

The here proposed improvements will be beneficial for a wide range of general problems and specialized domains such as gene expression analysis, compressed sensing, and sparse encoding.

Outline
During the project, I will implement glmnet[1], a cyclic coordinate descent algorithm to solve penalized linear regression and classification problems. Considerable speedup is achieved for example through the use of weighted updates and active set of features.
With the implementation of strong rules[2] a large number of variables can be excluded before the optimization. This rule is especially powerful in sparse problems. A speedup factor of 20 and more is reported[1] for certain problems over the standard glmnet implementation.
As a complement to glmnet the LARS algorithm will be extended to deal with elastic net (L1 +L2) penalties.

References

  1. http://www.stanford.edu/~hastie/Papers/glmnet.pdf
  2. http://www-stat.stanford.edu/~jbien/jrssb2011strong.pdf
  3. http://www.jmlr.org/papers/volume11/yuan10c/yuan10c.pdf