Lightgbm
Теги: machine-learning pip
Градиент бустинг фреймворк LightGBM для ml
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
- Faster training speed and higher efficiency.
- Lower memory usage.
- Better accuracy.
- Support of parallel, distributed, and GPU learning.
- Capable of handling large-scale data.
Python quik start
pip install lightgbm
import lightgbm as lgb
The LightGBM Python module can load data from:
- LibSVM (zero-based)/TSV/CSV/ XT format file
- NumPy 2D array(s), pandas DataFrame, H2O DataTable’s Frame, SciPy sparse matrix
- LightGBM binary file
- LightGBM
Sequence
object(s)
The data is stored in a Dataset
object.
To load a LibSVM (zero-based) text file or a LightGBM binary file into Dataset:
train_data = lgb.Dataset('train.svm.bin')
To load a numpy array into Dataset:
import numpy as np
data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target
train_data = lgb.Dataset(data, label=label)
To load a scipy.sparse.csr_matrix array into Dataset:
import scipy
csr = scipy.sparse.csr_matrix((dat, (row, col)))
train_data = lgb.Dataset(csr)
Load from Sequence objects. We can implement Sequence
interface to read binary files. The following example shows reading HDF5 file with h5py
:
import h5py
class HDFSequence(lgb.Sequence):
def __init__(self, hdf_dataset, batch_size):
self.data = hdf_dataset
self.batch_size = batch_size
def __getitem__(self, idx):
return self.data[idx]
def __len__(self):
return len(self.data)
f = h5py.File('train.hdf5', 'r')
train_data = lgb.Dataset(HDFSequence(f['X'], 8192), label=f['Y'][:])
Saving Dataset into a LightGBM binary file will make loading faster:
train_data = lgb.Dataset('train.svm.txt')
train_data.save_binary('train.bin')
Create validation data. In LightGBM, the validation data should be aligned with training data.
validation_data = train_data.create_valid('validation.svm')
# or
validation_data = lgb.Dataset('validation.svm', reference=train_data)
Specific feature names and categorical features:
train_data = lgb.Dataset(data, label=label, feature_name=['c1', 'c2', 'c3'], categorical_feature=['c3'])
LightGBM can use categorical features as input directly. It doesn’t need to convert to one-hot encoding, and is much faster than one-hot encoding (about 8x speed-up). But you should convert your categorical features to int type before you construct Dataset.
Weights can be set when needed:
w = np.random.rand(500, )
train_data = lgb.Dataset(data, label=label, weight=w)
# or
train_data = lgb.Dataset(data, label=label)
w = np.random.rand(500, )
train_data.set_weight(w)
And you can use Dataset.set_init_score()
to set initial score, and Dataset.set_group()
to set group/query data for ranking tasks.
Setting parameters:
# Booster parameters
param = {'num_leaves': 31, 'objective': 'binary'}
param['metric'] = 'auc'
# You can also specify multiple eval metrics
param['metric'] = ['auc', 'binary_logloss']
Training:
Training a model requires a parameter list and data set
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[validation_data])
# After training, the model can be saved
json_model = bst.dump_model()
# A saved model can be loaded
bst = lgb.Booster(model_file='model.txt') # init model
Training with 5-fold CV:
lgb.cv(param, train_data, num_round, nfold=5)
Early Stopping:
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in valid_sets
. If there is more than one, it will use all of them except the training data
bst = lgb.train(param, train_data, num_round, valid_sets=valid_sets, early_stopping_rounds=5)
bst.save_model('model.txt', num_iteration=bst.best_iteration)
The model will train until the validation score stops improving. Validation score needs to improve at least every early_stopping_rounds
to continue training
Prediction:
# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)
If early stopping is enabled during training, you can get predictions from the best iteration with bst.best_iteration
Как устроен LightGBM
Параметры
Интерактивное описание параметров
Format
params = {
"monotone_constraints": [-1, 0, 1]
}
Parameters Tuning
Несколько подходов, достыпных в lightgbm:
- Tune Parameters for the Leaf-wise (Best-first) Tree
- For Faster Speed
- For Better Accuracy
- Deal with Over-fitting
Python API
GPU tutorial
Advanced Topics
- Missing Value Handle
- Categorical Feature Support
- LambdaRank
- Cost Efficient Gradient Boosting