nodegam.gams package#

Submodules#

nodegam.gams.EncodingBase module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

class nodegam.gams.EncodingBase.EncodingBase#

Bases: object

A base class for handling label or onehot encoding.

get_GAM_df(x_values_lookup=None, **kwargs)#
revert_dataframe(df)#
class nodegam.gams.EncodingBase.LabelEncodingClassifierMixin#

Bases: LabelEncodingFitMixin

predict_proba(X)#
class nodegam.gams.EncodingBase.LabelEncodingFitMixin#

Bases: EncodingBase

fit(X, y, **kwargs)#
my_fit(X, y)#
my_transform(X)#
revert_dataframe(df)#
class nodegam.gams.EncodingBase.LabelEncodingRegressorMixin#

Bases: LabelEncodingFitMixin

predict(X)#
class nodegam.gams.EncodingBase.OnehotEncodingClassifierMixin#

Bases: OnehotEncodingFitMixin

predict_proba(X)#
class nodegam.gams.EncodingBase.OnehotEncodingFitMixin#

Bases: EncodingBase

fit(X, y, **kwargs)#
predict(X)#
revert_dataframe(df)#

Move the old onehot-encoding df to new non-onehot encoding one.

class nodegam.gams.EncodingBase.OnehotEncodingRegressorMixin#

Bases: OnehotEncodingFitMixin

nodegam.gams.MyBagging module#

Adapted from https://github.com/zzzace2000/GAMs_models/.

It implements the bagging of GAM models. Unlike sklearn implmementation of BaggingClassifier, it averages the logits instead of the probability to make sure the bagging of GAMs is still a GAM. It also implements the get_GAM_df() that automatically takes average of the GAMs under bagging.

Usage: >>> from nodegam.gams.MyXGB import MyXGBClassifier >>> from nodegam.gams.MyBagging import MyBaggingClassifier >>> base_model = MyXGBClassifier() >>> # Train an XGB-GAM with 10 times bagging >>> bag_model = MyBaggingClassifier(base_model=base_model, n_estimators=10) >>> bag_model.fit(X, y) >>> df = bag_model.get_GAM_df()

class nodegam.gams.MyBagging.MyBaggingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)#

Bases: OnehotEncodingClassifierMixin, MyBaggingClassifierBase, MyCommonBase

The bagging for the base estimator GAM.

It overwrites the sklearn.ensemble.BaggingClassifier to (1) do ensemble on the logits and NOT

the probabilities to make it still as a GAM, and (2) support GAM df extraction.

Parameters
  • base_estimator – the base estimator model.

  • n_estimators – how many number of estimators to do bagging.

  • max_samples (int or float) – the number of samples to draw from X to train each base estimator (with replacement by default, see bootstrap for more details). - If int, then draw max_features features. - If float, then draw max_features * X.shape[1] features.

  • max_features (int or float) – The number of features to draw from X to train each base estimator (without replacement by default, see bootstrap_features for more details). - If int, then draw max_features features. - If float, then draw max_features * X.shape[1] features.

  • bootstrap (bool) – Whether samples are drawn with replacement. If False, sampling without replacement is performed.

  • bootstrap_features (bool) – Whether features are drawn with replacement.

  • oob_score (bool) – Whether to use out-of-bag samples to estimate the generalization error. Only available if bootstrap=True.

  • warm_start (bool) – When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble.

  • n_jobs (int) – The number of jobs to run in parallel for both fit() and predict(). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • random_state – random state.

  • verbose – verbose.

class nodegam.gams.MyBagging.MyBaggingClassifierBase(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)#

Bases: MyBaggingMixin, BaggingClassifier

predict_proba(X, parallel=False)#

Modify it to be using the average of the log-odds instead of avg probobability.

The predicted class probabilities of an input sample is computed as the mean predicted class probabilities of the base estimators in the ensemble. If base estimators do not implement a predict_proba method, then it resorts to voting and the predicted class probabilities of an input sample represents the proportion of estimators predicting each class.

Parameters
  • X – {array-like, sparse matrix} of shape = [n_samples, n_features] The training input samples. Sparse matrices are accepted only if they are supported by the base estimator.

  • parallel – if True, predict outputs using parallel threads to speed up. But in xgboost, the base estimator already uses multiple threads so it actually slows down.

Returns

p – array of shape = [n_samples, n_classes]. The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.

class nodegam.gams.MyBagging.MyBaggingLabelEncodingClassifier(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)#

Bases: LabelEncodingClassifierMixin, MyBaggingClassifierBase, MyCommonBase

class nodegam.gams.MyBagging.MyBaggingLabelEncodingRegressor(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)#

Bases: LabelEncodingRegressorMixin, MyBaggingRegressorBase, MyCommonBase

class nodegam.gams.MyBagging.MyBaggingMixin#

Bases: MyGAMPlotMixinBase

get_GAM_df(x_values_lookup=None, get_y_std=True)#

Get the GAM graph parameter.

Parameters
  • x_values_lookup – a dictionary of mapping feature name to its correpsonding unique increasing x. E.g. {‘BUN’: [1.1, 1.5, 3.1, 5.0], ‘cancer’: [0, 1]}.

  • get_y_std – to get the error bar of the y. It’s slower if this is set to true. Default: True

Returns

A dataframe of GAM graph.

property is_GAM#

Returns True if it’s a GAM.

class nodegam.gams.MyBagging.MyBaggingRegressor(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)#

Bases: OnehotEncodingRegressorMixin, MyBaggingRegressorBase, MyCommonBase

The bagging for the base estimator GAM regressor.

It overwrites the sklearn.ensemble.BaggingRegressor to support GAM df extraction.

Parameters
  • base_estimator – the base estimator model.

  • n_estimators – how many number of estimators to do bagging.

  • max_samples (int or float) – the number of samples to draw from X to train each base estimator (with replacement by default, see bootstrap for more details). - If int, then draw max_features features. - If float, then draw max_features * X.shape[1] features.

  • max_features (int or float) – The number of features to draw from X to train each base estimator (without replacement by default, see bootstrap_features for more details). - If int, then draw max_features features. - If float, then draw max_features * X.shape[1] features.

  • bootstrap (bool) – Whether samples are drawn with replacement. If False, sampling without replacement is performed.

  • bootstrap_features (bool) – Whether features are drawn with replacement.

  • oob_score (bool) – Whether to use out-of-bag samples to estimate the generalization error. Only available if bootstrap=True.

  • warm_start (bool) – When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new ensemble.

  • n_jobs (int) – The number of jobs to run in parallel for both fit() and predict(). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • random_state – random state.

  • verbose – verbose.

class nodegam.gams.MyBagging.MyBaggingRegressorBase(base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)#

Bases: MyBaggingMixin, BaggingRegressor

nodegam.gams.MyBaselines module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

class nodegam.gams.MyBaselines.MyEBMPreprocessorTransformMixin(binning='uniform', **kwargs)#

Bases: object

fit(X, y)#
transform(X)#
class nodegam.gams.MyBaselines.MyIndicatorLinearRegressionCV(binning='uniform', **kwargs)#

Bases: LabelEncodingRegressorMixin, MyGAMPlotMixinBase, MyEBMPreprocessorTransformMixin, MyIndicatorTransformMixin, MyTransformRegressionMixin, MyLinearRegressionCVBase

class nodegam.gams.MyBaselines.MyIndicatorLogisticRegressionCV(binning='uniform', **kwargs)#

Bases: LabelEncodingClassifierMixin, MyGAMPlotMixinBase, MyEBMPreprocessorTransformMixin, MyIndicatorTransformMixin, MyTransformClassifierMixin, MyLogisticRegressionCVBase

class nodegam.gams.MyBaselines.MyIndicatorTransformMixin#

Bases: object

fit(X, y)#
transform(X)#
class nodegam.gams.MyBaselines.MyLinearRegressionCVBase(alphas=array([0.001, 0.00351119173, 0.0123284674, 0.0432876128, 0.151991108, 0.533669923, 1.87381742, 6.57933225, 23.101297, 81.1130831, 284.803587, 1000.0]), **kwargs)#

Bases: RidgeCV

class nodegam.gams.MyBaselines.MyLinearRegressionRidgeCV(*args, **kwargs)#

Bases: OnehotEncodingRegressorMixin, MyGAMPlotMixinBase, MyStandardizedTransformMixin, MyTransformRegressionMixin, MyLinearRegressionCVBase

class nodegam.gams.MyBaselines.MyLogisticRegressionCV(*args, **kwargs)#

Bases: OnehotEncodingClassifierMixin, MyGAMPlotMixinBase, MyStandardizedTransformMixin, MyTransformClassifierMixin, MyLogisticRegressionCVBase

class nodegam.gams.MyBaselines.MyLogisticRegressionCVBase(Cs=12, cv=5, penalty='l2', random_state=1377, solver='lbfgs', max_iter=3000, n_jobs=-1, **kwargs)#

Bases: LogisticRegressionCV

class nodegam.gams.MyBaselines.MyMarginalLinearRegressionCV(binning='uniform', **kwargs)#

Bases: LabelEncodingRegressorMixin, MyGAMPlotMixinBase, MyEBMPreprocessorTransformMixin, MyMarginalizedTransformMixin, MyTransformRegressionMixin, MyLinearRegressionCVBase

class nodegam.gams.MyBaselines.MyMarginalLogisticRegressionCV(binning='uniform', **kwargs)#

Bases: LabelEncodingClassifierMixin, MyGAMPlotMixinBase, MyEBMPreprocessorTransformMixin, MyMarginalizedTransformMixin, MyTransformClassifierMixin, MyLogisticRegressionCVBase

class nodegam.gams.MyBaselines.MyMarginalizedTransformMixin(*args, **kwargs)#

Bases: object

fit(X, y)#
transform(X)#
class nodegam.gams.MyBaselines.MyMaxMinTransformMixin(*args, **kwargs)#

Bases: object

fit(X, y)#
transform(X)#
class nodegam.gams.MyBaselines.MyRandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)#

Bases: LabelEncodingClassifierMixin, MyCommonBase, RandomForestClassifier

property is_GAM#

Returns True if it’s a GAM.

class nodegam.gams.MyBaselines.MyRandomForestRegressor(n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)#

Bases: LabelEncodingRegressorMixin, MyCommonBase, RandomForestRegressor

property is_GAM#

Returns True if it’s a GAM.

class nodegam.gams.MyBaselines.MyStandardizedTransformMixin(*args, **kwargs)#

Bases: object

fit(X, y)#
transform(X)#
class nodegam.gams.MyBaselines.MyTransformClassifierMixin#

Bases: MyTransformMixin

predict_proba(X)#
class nodegam.gams.MyBaselines.MyTransformMixin#

Bases: object

transform(X)#
class nodegam.gams.MyBaselines.MyTransformRegressionMixin#

Bases: MyTransformMixin

predict(X)#

nodegam.gams.MyEBM module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

class nodegam.gams.MyEBM.MyExplainableBoostingClassifier(feature_names=None, feature_types=None, max_bins=256, max_interaction_bins=32, binning='quantile', mains='all', interactions=10, outer_bags=8, inner_bags=0, learning_rate=0.01, validation_size=0.15, early_stopping_rounds=50, early_stopping_tolerance=0.0001, max_rounds=5000, min_samples_leaf=2, max_leaves=3, n_jobs=-2, random_state=42)#

Bases: LabelEncodingClassifierMixin, MyExplainableBoostingMixin, ExplainableBoostingClassifier

Explainable Boosting Classifier. The arguments will change in a future release, watch the changelog.

Parameters
  • feature_names – List of feature names.

  • feature_types – List of feature types.

  • max_bins – Max number of bins per feature for pre-processing stage.

  • max_interaction_bins – Max number of bins per feature for pre-processing stage on interaction terms. Only used if interactions is non-zero.

  • binning – Method to bin values for pre-processing. Choose “uniform”, “quantile” or “quantile_humanized”.

  • mains – Features to be trained on in main effects stage. Either “all” or a list of feature indexes.

  • interactions – Interactions to be trained on. Either a list of lists of feature indices, or an integer for number of automatically detected interactions. Interactions are forcefully set to 0 for multiclass problems.

  • outer_bags – Number of outer bags.

  • inner_bags – Number of inner bags.

  • learning_rate – Learning rate for boosting.

  • validation_size – Validation set size for boosting.

  • early_stopping_rounds – Number of rounds of no improvement to trigger early stopping.

  • early_stopping_tolerance – Tolerance that dictates the smallest delta required to be considered an improvement.

  • max_rounds – Number of rounds for boosting.

  • min_samples_leaf – Minimum number of cases for tree splits used in boosting.

  • max_leaves – Maximum leaf nodes used in boosting.

  • n_jobs – Number of jobs to run in parallel.

  • random_state – Random state.

class nodegam.gams.MyEBM.MyExplainableBoostingMixin#

Bases: MyCommonBase

fit(X, y)#
get_GAM_df(x_values_lookup=None)#
class nodegam.gams.MyEBM.MyExplainableBoostingRegressor(feature_names=None, feature_types=None, max_bins=256, max_interaction_bins=32, binning='quantile', mains='all', interactions=10, outer_bags=8, inner_bags=0, learning_rate=0.01, validation_size=0.15, early_stopping_rounds=50, early_stopping_tolerance=0.0001, max_rounds=5000, min_samples_leaf=2, max_leaves=3, n_jobs=-2, random_state=42)#

Bases: LabelEncodingRegressorMixin, MyExplainableBoostingMixin, ExplainableBoostingRegressor

Explainable Boosting Regressor. The arguments will change in a future release, watch the changelog.

Parameters
  • feature_names – List of feature names.

  • feature_types – List of feature types.

  • max_bins – Max number of bins per feature for pre-processing stage on main effects.

  • max_interaction_bins – Max number of bins per feature for pre-processing stage on interaction terms. Only used if interactions is non-zero.

  • binning – Method to bin values for pre-processing. Choose “uniform”, “quantile”, or “quantile_humanized”.

  • mains – Features to be trained on in main effects stage. Either “all” or a list of feature indexes.

  • interactions – Interactions to be trained on. Either a list of lists of feature indices, or an integer for number of automatically detected interactions.

  • outer_bags – Number of outer bags.

  • inner_bags – Number of inner bags.

  • learning_rate – Learning rate for boosting.

  • validation_size – Validation set size for boosting.

  • early_stopping_rounds – Number of rounds of no improvement to trigger early stopping.

  • early_stopping_tolerance – Tolerance that dictates the smallest delta required to be considered an improvement.

  • max_rounds – Number of rounds for boosting.

  • min_samples_leaf – Minimum number of cases for tree splits used in boosting.

  • max_leaves – Maximum leaf nodes used in boosting.

  • n_jobs – Number of jobs to run in parallel.

  • random_state – Random state.

class nodegam.gams.MyEBM.MyOnehotExplainableBoostingClassifier(feature_names=None, feature_types=None, max_bins=256, max_interaction_bins=32, binning='quantile', mains='all', interactions=10, outer_bags=8, inner_bags=0, learning_rate=0.01, validation_size=0.15, early_stopping_rounds=50, early_stopping_tolerance=0.0001, max_rounds=5000, min_samples_leaf=2, max_leaves=3, n_jobs=-2, random_state=42)#

Bases: OnehotEncodingClassifierMixin, MyFitMixin, MyExplainableBoostingMixin, ExplainableBoostingClassifier

Explainable Boosting Classifier. The arguments will change in a future release, watch the changelog.

Parameters
  • feature_names – List of feature names.

  • feature_types – List of feature types.

  • max_bins – Max number of bins per feature for pre-processing stage.

  • max_interaction_bins – Max number of bins per feature for pre-processing stage on interaction terms. Only used if interactions is non-zero.

  • binning – Method to bin values for pre-processing. Choose “uniform”, “quantile” or “quantile_humanized”.

  • mains – Features to be trained on in main effects stage. Either “all” or a list of feature indexes.

  • interactions – Interactions to be trained on. Either a list of lists of feature indices, or an integer for number of automatically detected interactions. Interactions are forcefully set to 0 for multiclass problems.

  • outer_bags – Number of outer bags.

  • inner_bags – Number of inner bags.

  • learning_rate – Learning rate for boosting.

  • validation_size – Validation set size for boosting.

  • early_stopping_rounds – Number of rounds of no improvement to trigger early stopping.

  • early_stopping_tolerance – Tolerance that dictates the smallest delta required to be considered an improvement.

  • max_rounds – Number of rounds for boosting.

  • min_samples_leaf – Minimum number of cases for tree splits used in boosting.

  • max_leaves – Maximum leaf nodes used in boosting.

  • n_jobs – Number of jobs to run in parallel.

  • random_state – Random state.

class nodegam.gams.MyEBM.MyOnehotExplainableBoostingRegressor(feature_names=None, feature_types=None, max_bins=256, max_interaction_bins=32, binning='quantile', mains='all', interactions=10, outer_bags=8, inner_bags=0, learning_rate=0.01, validation_size=0.15, early_stopping_rounds=50, early_stopping_tolerance=0.0001, max_rounds=5000, min_samples_leaf=2, max_leaves=3, n_jobs=-2, random_state=42)#

Bases: OnehotEncodingRegressorMixin, MyFitMixin, MyExplainableBoostingMixin, ExplainableBoostingRegressor

Explainable Boosting Regressor. The arguments will change in a future release, watch the changelog.

Parameters
  • feature_names – List of feature names.

  • feature_types – List of feature types.

  • max_bins – Max number of bins per feature for pre-processing stage on main effects.

  • max_interaction_bins – Max number of bins per feature for pre-processing stage on interaction terms. Only used if interactions is non-zero.

  • binning – Method to bin values for pre-processing. Choose “uniform”, “quantile”, or “quantile_humanized”.

  • mains – Features to be trained on in main effects stage. Either “all” or a list of feature indexes.

  • interactions – Interactions to be trained on. Either a list of lists of feature indices, or an integer for number of automatically detected interactions.

  • outer_bags – Number of outer bags.

  • inner_bags – Number of inner bags.

  • learning_rate – Learning rate for boosting.

  • validation_size – Validation set size for boosting.

  • early_stopping_rounds – Number of rounds of no improvement to trigger early stopping.

  • early_stopping_tolerance – Tolerance that dictates the smallest delta required to be considered an improvement.

  • max_rounds – Number of rounds for boosting.

  • min_samples_leaf – Minimum number of cases for tree splits used in boosting.

  • max_leaves – Maximum leaf nodes used in boosting.

  • n_jobs – Number of jobs to run in parallel.

  • random_state – Random state.

nodegam.gams.MySpline module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

class nodegam.gams.MySpline.MySplineGAM(**kwargs)#

Bases: OnehotEncodingRegressorMixin, MySplineGAMBase

Spline for Regression with one-hot encoding for cat features.

Parameters
  • search (bool) – if True, it searches the best lam penalty for the model.

  • search_lam (list or numpy array) – the range of lam penalty to search. If None, it is set to np.linspace(-3, 3, 15).

  • max_iter (int) – maximum interations to train.

  • n_splines (int) – number of splines. Default: 50.

  • cat_features (list) – the column names of the categorical features. Default: None.

class nodegam.gams.MySpline.MySplineGAMBase(**kwargs)#

Bases: MyFitMixin, MySplineMixin

predict(X)#

Predict regression target.

Parameters

X (pandas dataframe) – inputs.

Returns

prob (numpy array) – the prediction of shape [N].

class nodegam.gams.MySpline.MySplineLogisticGAM(**kwargs)#

Bases: OnehotEncodingClassifierMixin, MySplineLogisticGAMBase

Logistic Spline for binary classification with one-hot encoding for cat features.

Parameters
  • search (bool) – if True, it searches the best lam penalty for the model.

  • search_lam (list or numpy array) – the range of lam penalty to search. If None, it is set to np.linspace(-3, 3, 15).

  • max_iter (int) – maximum interations to train.

  • n_splines (int) – number of splines. Default: 50.

  • cat_features (list) – the column names of the categorical features. Default: None.

class nodegam.gams.MySpline.MySplineLogisticGAMBase(**kwargs)#

Bases: MyFitMixin, MySplineMixin

predict_proba(X)#

Predict Probability.

Parameters

X (pandas dataframe) – inputs.

Returns

prob (numpy array) – the probability of both classes with shape [N, 2].

class nodegam.gams.MySpline.MySplineMixin(model_cls, search=True, search_lam=None, max_iter=500, n_splines=50, fit_binary_feat_as_factor_term=False, cat_features=None, **kwargs)#

Bases: MyExtractLogOddsMixin

fit(X, y, **kwargs)#
get_lam()#

Return the lambda penalty.

get_params(*args, **kwargs)#

Return the parameters.

set_params(*args, **kwargs)#

nodegam.gams.MyXGB module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

class nodegam.gams.MyXGB.MyXGBClassifier(*args, **kwargs)#

Bases: MyGAMPlotMixinBase, MyXGBMixin

predict_proba(data, ntree_limit=None, validate_features=True)#
class nodegam.gams.MyXGB.MyXGBLabelEncodingClassifier(*args, **kwargs)#

Bases: LabelEncodingClassifierMixin, MyXGBClassifier

class nodegam.gams.MyXGB.MyXGBLabelEncodingRegressor(*args, **kwargs)#

Bases: LabelEncodingRegressorMixin, MyXGBRegressor

class nodegam.gams.MyXGB.MyXGBMixin(max_depth=1, random_state=1377, n_estimators=5000, n_jobs=-1, model_cls=<class 'xgboost.sklearn.XGBClassifier'>, validation_size=0.15, early_stopping_rounds=50, objective=None, **kwargs)#

Bases: object

fit(X, y, verbose=False, **kwargs)#
get_params(*args, **kwargs)#
property is_GAM#
set_params(*args, **kwargs)#
class nodegam.gams.MyXGB.MyXGBOnehotClassifier(*args, **kwargs)#

Bases: OnehotEncodingClassifierMixin, MyXGBClassifier

XGB-GAM Classifier with one-hot encoding for categorical features.

Parameters
  • max_depth=1 – The tree depth of the package. Should be set to 1 to remain as a GAM.

  • random_state=1377 – Seed.

  • n_estimators=5000 – Maximum number of rounds to fit.

  • n_jobs=-1 – Set to -1 to use multi-thread parallel training.

  • validation_size=0.15 – The validation porportion.

  • early_stopping_rounds=50 – Early stopping rounds.

  • logistic' (objective='binary:) – The validation objective.

class nodegam.gams.MyXGB.MyXGBOnehotRegressor(*args, **kwargs)#

Bases: OnehotEncodingRegressorMixin, MyXGBRegressor

XGB-GAM Regressor with one-hot encoding for categorical features.

Parameters
  • max_depth=1 – The tree depth of the package. Should be set to 1 to remain as a GAM.

  • random_state=1377 – Seed.

  • n_estimators=5000 – Maximum number of rounds to fit.

  • n_jobs=-1 – Set to -1 to use multi-thread parallel training.

  • validation_size=0.15 – The validation porportion.

  • early_stopping_rounds=50 – Early stopping rounds.

  • squarederror' (objective='reg:) – The validation objective.

class nodegam.gams.MyXGB.MyXGBRegressor(*args, **kwargs)#

Bases: MyGAMPlotMixinBase, MyXGBMixin

predict(data, output_margin=False, ntree_limit=None, validate_features=True)#

nodegam.gams.base module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

class nodegam.gams.base.MyCommonBase#

Bases: object

property is_GAM#

Returns True if it’s a GAM.

property param_distributions#
class nodegam.gams.base.MyExtractLogOddsMixin#

Bases: MyCommonBase

Extract the output from the underlying model.

It uses the predict function to extract the log odds from the underlying model. It is useful to deal with a black-box model that is hard to extract the marginal plot from it. It can then use “get_GAM_df(self, x_values_lookup=None)” to extract.

Requirement:

the cls needs to implement one of: 1) predict(): this is for regression model. 2) predict_proba(): this is for binary classification.

get_GAM_df(x_values_lookup=None, center=True)#

Get the GAM dataframe.

Parameters
  • x_values_lookup (dict) – the unique values of X for each feature. If passed, the outputs of the GAM model w.r.t. these x values are extracted. Useful to get a coarser graph when there are too many unique values in a feature.

  • center (bool) – if True, it centers each GAM graph to 0 by moving its mean to the intercept term.

Returns

df (pandas dataframe) – a GAM dataframe where each row represents a GAM term with the inputs x, outputs y, and feature importance.

class nodegam.gams.base.MyFitMixin#

Bases: object

My Mixin to record the feature names and counts when called fit().

It overides the fit() to record the self.feature_names and self.X_value_counts. It would call the super().fit() if there exists such function or just silently returns if not.

fit(X, y, **kwargs)#
class nodegam.gams.base.MyGAMPlotMixinBase#

Bases: MyFitMixin, MyExtractLogOddsMixin

nodegam.gams.general_utils module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

class nodegam.gams.general_utils.Timer(name, remove_start_msg=True)#

Bases: object

nodegam.gams.general_utils.output_csv(the_path, data_dict, order=None, delimiter=',')#
nodegam.gams.general_utils.vector_in(vec, names)#

nodegam.gams.model_utils module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

nodegam.gams.model_utils.get_ebm_model(model_name, problem, random_state=1377, **kwargs)#
nodegam.gams.model_utils.get_ilr_model(model_name, problem, random_state=1377, **kwargs)#

Get Indicator Logistic Regression

nodegam.gams.model_utils.get_lr_model(model_name, problem, random_state=1377, **kwargs)#
nodegam.gams.model_utils.get_mlr_model(model_name, problem, random_state=1377, **kwargs)#

Get Marginal Logistic Regression

nodegam.gams.model_utils.get_model(X_train, y_train, problem, model_name, random_state=1377, **kwargs)#
nodegam.gams.model_utils.get_rf_model(model_name, problem, random_state=1377, **kwargs)#
nodegam.gams.model_utils.get_spline_model(model_name, problem, random_state=1377, **kwargs)#
nodegam.gams.model_utils.get_xgb_model(model_name, problem, random_state=1377, **kwargs)#

nodegam.gams.utils module#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.

class nodegam.gams.utils.DotDict(*args, **kwargs)#

Bases: dict

dot.notation access to dictionary attributes

class nodegam.gams.utils.Timer(name, remove_start_msg=True)#

Bases: object

nodegam.gams.utils.bin_data(X, max_n_bins=256)#

Do a quantile binning for the X.

Parameters
  • X – the pandas table or numpy array with shape as [N, D] where N is number of samples and D is number of features.

  • max_n_bins – the maximum number of bins per feature. Default: 256.

Returns

Binned X with the same input type (pandas table or numpy array)

nodegam.gams.utils.extract_GAM(X, predict_fn, predict_type='binary_logodds', max_n_bins=None)#

X: input 2d array predict_fn: the model prediction function predict_type: choose from [“binary_logodds”, “binary_prob”, “regression”]

This corresponds to which predict_fn to pass in.

max_n_bins: default set as None (No binning). It bins the value into

this number of buckets to reduce the resulting GAM graph clutterness. Should set large enough to not change prediction too much.

nodegam.gams.utils.get_GAM_df_by_models(models, x_values_lookup=None, aggregate=True)#
nodegam.gams.utils.get_X_values_counts(X, feature_names=None)#
nodegam.gams.utils.get_x_values_lookup(X, feature_names=None)#

Get x values lookup.

Parameters

X – input features. Numpy array or pandas dataframe.

Returns

x_values_lookup – a dictionary with key as feature name and the value is all unique values of that feature.

nodegam.gams.utils.my_interpolate(x, y, new_x)#

Handle edge cases for interpolation.

nodegam.gams.utils.predict_score(model, X)#
nodegam.gams.utils.predict_score_by_df(GAM_plot_df, X)#
nodegam.gams.utils.predict_score_with_each_feature(model, X)#
nodegam.gams.utils.predict_score_with_each_feature_by_df(GAM_plot_df, X, sum_directly=False)#
nodegam.gams.utils.sigmoid(x)#

Numerically stable sigmoid function.

Module contents#

GAM baselines adapted from https://github.com/zzzace2000/GAMs_models/.