GSOC 2012 and scikit-learn: Time Is Flying By

Good news, the refactoring of the elastic-net and lasso classes is done and has been merged into the scikit-learn master some time ago.

I invested some time learning about the hdf5 data format to upload some dataset to mldata.org . Sadly the documentation of mldata.org is very sparse and in case of the used hdf5 data format, outdated too. I have been in contact with the maintainer but I still don't have working specs. I decided to put this task at rest for the moment.

I have a Python prototype working for the new coordinate descent algorithm that I'm now about to integrate step by step into scikit-learn .

	def enet_coordinate_descent2(w, l2_reg, l1_reg, X, y, max_iter):

	n_samples = X.shape[0]
	n_features = X.shape[1]

	norm_cols_X = (X ** 2).sum(axis=0)

	Xy = np.dot(X.T,y)
	gradient = np.zeros(n_features)
	feature_inner_product = np.zeros(shape=(n_features, n_features))
	active_set = set(range(n_features))
	#debug
	value_enet_f = 0

	for n_iter in range(max_iter):

	for ii in active_set:

	w_ii = w[ii]

	# initial calculation
	if n_iter == 0:
	feature_inner_product[:, ii] = np.dot(X[:, ii], X)
	gradient[ii] = Xy[ii] - np.dot(feature_inner_product[:, ii], w)

	tmp = gradient[ii] + w_ii * norm_cols_X[ii]

	w[ii] = fsign(tmp) * max(abs(tmp) - l2_reg, 0) \
	/ (norm_cols_X[ii] + l1_reg)

	# update gradients, if coef changed
	if w_ii != w[ii]:
	for j in active_set:
	if n_iter >= 1 or j <= ii:
	gradient[j] -= feature_inner_product[ii, j] * \
	(w[ii] - w_ii)
	# debug
	#value_enet_f = check_convergence(y, X, w, value_enet_f)
	#print value_enet_f

	#remove inactive features
	tmp_s = set.copy(active_set)
	for j in tmp_s:
	if w[j] == 0:
	active_set.remove(j)
	return w

view raw code_for_blogger.py hosted with ❤ by GitHub

This version will be written in Cython to speed thins up. I hope that I can beat the execution time of the current implementation soon.

GSOC 2012 and scikit-learn

Donnerstag, 28. Juni 2012

Time Is Flying By

Keine Kommentare:

Kommentar veröffentlichen