The linear_model module

The linear_model.stepwise module
- Both Method Variable Selection
- Forward Variable Selection

Module contents

Logistic Regression

class LogisticRegression(X, y, penalty, dual, tol, C, fit_intercept, intercept_scaling, class_weight, random_state, solver, max_iter, verbose, warm_start, n_jobs, l1_ratio)

This class implements a logistic regression model. It is like the sklearn.linear_model.LogisticRegression class, but adds additional methods for calculating confidence intervals, p-values, and model summaries.

__init__(X, y, penalty, dual, tol, C, fit_intercept, intercept_scaling, class_weight, random_state, solver, max_iter, verbose, warm_start, n_jobs, l1_ratio)

Parameters:

X (Union[DataFrame, ndarray, None]) – A Pandas DataFrame or a NumPy array containing the model predictors.
y (Union[Series, ndarray, None]) – A Pandas Series or a NumPy array containing the model response.
penalty (Literal['l1', 'l2', 'elasticnet']) – The type of penalty to use. Can be one of "none" (default). "l1", "l2", or "elasticnet".
dual (bool) – Whether to use the dual formulation of the problem.
tol (float) – The tolerance for convergence.
C (int) – The regularization strength.
fit_intercept (bool) – Whether to fit an intercept term.
intercept_scaling (int) – The scaling factor for the intercept term.
class_weight (Union[None, str, dict]) – None (default), “balanced” or a dictionary that maps class labels to weights.
random_state (int) – The random seed.
solver (Literal['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']) – The solver to use. Can be one of "lbfgs" (default), "liblinear", "newton-cg", "newton-cholesky", "sag", or "saga".
max_iter (int) – The maximum number of iterations.
verbose (int) – The verbosity level.
warm_start (bool) – Whether to use the warm start.
n_jobs (int) – The number of jobs to use for parallel processing.
l1_ratio (Union[float, None]) – The l1_ratio parameter for elasticnet regularization.

fit(): Fits the model to the data. Can be used like the sklearn.linear_model.LogisticRegression class or with the from_formula class method from statsmodels.

predict(new_data: DataFrame): Predicts the class labels for new data.

conf_int(conf_level=0.95): Calculates the confidence intervals for the model coefficients.

se(): Calculates the standard errors for the model coefficients.

z_values(): Calculates the z-scores for the model coefficients.

p_values(): Calculates the p-values for the model coefficients.

summary(conf_level=0.95): Prints a summary of the model.

from_formula(formula, data): Class method to create an instance from a formula.

params: Returns the estimated values for model parameters.

aic: Calculates the Akaike information criterion (AIC) for the model.

bic: Calculates the Bayesian information criterion (BIC) for the model.

cov_matrix: Returns the estimated covariance matrix for model parameters.

residuals: Returns the deviance of the model.

deviance_residuals: Returns the deviance residuals.

Examples

Example 1: Using the LogisticRegression() like the statsmodels Logit class.

import numpy as np
import pandas as pd
from estyp.linear_model import LogisticRegression

np.random.seed(123)
data = pd.DataFrame({
   "y": np.random.randint(2, size=100),
   "x1": np.random.uniform(-1, 1, size=100),
   "x2": np.random.uniform(-1, 1, size=100),
})

formula = "y ~ x1 + x2"
spec = LogisticRegression.from_formula(formula, data)
model = spec.fit()

print(model.summary())

           Estimate      S.E.         z  Pr(>|z|)   [Lower,    Upper]
Intercept -0.200864  0.202894 -0.989996  0.322176 -0.598530  0.196801
x1         0.032006  0.375254  0.085292  0.932030 -0.703478  0.767490
x2         0.438665  0.344263  1.274215  0.202587 -0.236078  1.113407

Example 2: Using LogisticRegression() like the sklearn.linear_model.LogisticRegression() class.

from estyp.linear_model import LogisticRegression

X = data.drop(columns="y")
y = data["y"]

model = LogisticRegression()
model.fit(X, y)

print(model.summary())

           Estimate      S.E.         z  Pr(>|z|)   [Lower,    Upper]
Intercept -0.200864  0.202894 -0.989996  0.322176 -0.598530  0.196801
x1         0.032006  0.375254  0.085292  0.932030 -0.703478  0.767490
x2         0.438665  0.344263  1.274215  0.202587 -0.236078  1.113407

Stepwise Selection for Linear Models

class Stepwise(formula, data, model, direction, criterion, alpha, max_iter, formula_params, fit_params, verbose)

The Stepwise class provides a method to perform stepwise model selection, which is a method to add or remove predictors based on their significance, AIC or BIC in a model.

Parameters:

formula (str) – A string representing the formula, using the patsy formula syntax.
data (DataFrame) – A pandas DataFrame that contains the data for both the dependent and independent variables.
model (Union[GLM, OLS, Logit, LogisticRegression]) – Specifies the type of model to be used.
direction (Literal["both", "forward", "backward"]) – Specifies the direction of the stepwise process.
criterion (Literal["aic", "bic", "f-test"]) – The criterion to be used for adding or removing predictors.
alpha (float) – The significance level for adding or removing predictors. It must be a value between 0 and 1.
max_iter (int) – The maximum number of iterations for the both direction process.
formula_params (Dict[str, Any]) – Additional parameters to be passed to the model’s from_formula method.
fit_params (Dict[str, Any]) – Additional parameters to be passed to the model’s fit method.
verbose (bool) – If set to False, the class will not print information about the stepwise process.

optimal_model_: The optimal model obtained after the stepwise process.

optimal_formula_: The optimal model formula after the stepwise process.

optimal_variables_: List of optimal predictor variables in the final model.

optimal_metric_: The optimal value of the chosen criterion (e.g., AIC, BIC, or F-test) for the final model.

fit()

Conducts the stepwise process based on the specified direction and criterion.

Examples:

import pandas as pd
from statsmodels.api import OLS
from estyp.linear_model import Stepwise
data = pd.DataFrame({"y": [1,2,3,4,5], "x1": [5,20,3,2,1], "x2": [6,7,8,9,10]})
stepwise = Stepwise(formula="y ~ 1", data=data, model=OLS, direction="forward", criterion="aic")
stepwise.fit()
print("Best predictors:", stepwise.optimal_variables_)

Starting AIC: 19.6551
- Term added: "x2" | AIC: -317.7430
- Term added: "x1" | AIC: -323.7311
[92m[4m[1mForward selection completed[0m
- Obtained AIC: -323.7311
- Added terms: None
- Obtained formula: "y ~ x2 + x1"
Best predictors: ['x2', 'x1']

plot_history(ax=None)

Plots the history of the chosen criterion during the stepwise.

Parameters:: ax (matplotlib.axes.Axes, optional) – An Axes instance for the plot. If not provided, a new figure and axes will be created.

Returns:

fig, axmatplotlib.figure.Figure, matplotlib.axes.Axes: The Figure and Axes instances containing the plot if not provided.

Examples:

import pandas as pd
from statsmodels.api import OLS
from estyp.linear_model import Stepwise

data = pd.DataFrame(
   {
      "y": [1, 2, 3, 4, 5],
      "x1": [5, 20, 3, 2, 1],
      "x2": [6, 7, 8, 9, 10],
      "x3": [1, 2, 40, 4, 30],
      "x4": [20, 1, 4, 5, 6],
      "x5": [90, -1, 40, 5, 26],
   }
)
stepwise = Stepwise(
   formula="y ~ x1 + x2 + x3 + x4 + x5",
   data=data,
   model=OLS,
   direction="backward",
   criterion="bic",
)
stepwise.fit()
fig, ax = stepwise.plot_history()

Starting BIC: -299.2531

- Term dropped: "x3" | BIC: -312.2742

- Term dropped: "x5" | BIC: -315.9306
- Term dropped: "x1" | BIC: -320.3508
[92m[4m[1mBackward selection completed[0m
- Obtained BIC: -320.3508
- Dropped terms: 3
- Obtained formula: "y ~ x2 + x4"

Example

import pandas as pd
from statsmodels.api import OLS
from estyp.linear_model import Stepwise
data = pd.DataFrame({"y": [1,2,3,4,5], "x1": [5,20,3,2,1], "x2": [6,7,8,9,10]})
stepwise = Stepwise(formula="y ~ 1", data=data, model=OLS, direction="forward", criterion="aic")
stepwise.fit()

Starting AIC: 19.6551
- Term added: "x2" | AIC: -317.7430
- Term added: "x1" | AIC: -323.7311
[92m[4m[1mForward selection completed[0m
- Obtained AIC: -323.7311
- Added terms: None
- Obtained formula: "y ~ x2 + x1"

Note

The class is designed to work seamlessly with statsmodels models.
If using “both” as the direction, the “f-test” criterion is not available.
Ensure that the data provided is appropriate for the model chosen.