Donnerstag, 28. Juni 2012

Time Is Flying By

Good news, the refactoring of the elastic-net and lasso classes is done and has been merged into the scikit-learn master some time ago.

I invested some time learning about the hdf5 data format to upload some dataset to mldata.org . Sadly the documentation of mldata.org is very sparse and in case of the used hdf5 data format, outdated too. I have been in contact with the maintainer but I still don't have working specs. I decided to put this task at rest for the moment.

I have a Python prototype working for the new coordinate descent algorithm that  I'm now about to integrate step by step into scikit-learn .

def enet_coordinate_descent2(w, l2_reg, l1_reg, X, y, max_iter):
n_samples = X.shape[0]
n_features = X.shape[1]
norm_cols_X = (X ** 2).sum(axis=0)
Xy = np.dot(X.T,y)
gradient = np.zeros(n_features)
feature_inner_product = np.zeros(shape=(n_features, n_features))
active_set = set(range(n_features))
#debug
value_enet_f = 0
for n_iter in range(max_iter):
for ii in active_set:
w_ii = w[ii]
# initial calculation
if n_iter == 0:
feature_inner_product[:, ii] = np.dot(X[:, ii], X)
gradient[ii] = Xy[ii] - np.dot(feature_inner_product[:, ii], w)
tmp = gradient[ii] + w_ii * norm_cols_X[ii]
w[ii] = fsign(tmp) * max(abs(tmp) - l2_reg, 0) \
/ (norm_cols_X[ii] + l1_reg)
# update gradients, if coef changed
if w_ii != w[ii]:
for j in active_set:
if n_iter >= 1 or j <= ii:
gradient[j] -= feature_inner_product[ii, j] * \
(w[ii] - w_ii)
# debug
#value_enet_f = check_convergence(y, X, w, value_enet_f)
#print value_enet_f
#remove inactive features
tmp_s = set.copy(active_set)
for j in tmp_s:
if w[j] == 0:
active_set.remove(j)
return w


This version will be written in Cython to speed thins up. I hope that I can beat the execution time of the current implementation soon.

Keine Kommentare:

Kommentar veröffentlichen