Scikit-learn interface and Cross Validation#

Uses the swissmetro data. Based on previous example for this dataset, which is based on the xlogit example Mixed Logit.

Note that this wrapper can use scikit-learn’s tools such as cross-validation as in this example, but it is not a proper estimator by scikit-learn’s requirements, and it does not pass sklearn.utils.estimator_checks.check_estimator. This is because information about the variables and the alternatives needs to be provided in the pandas dataframe and as data to the estimator, where as the check validation tool for scikit-learn only passes in generated numpy arrays of floats for the input data. The number of alternatives and variables could be inferred, but sometimes would be ambiguous.

[1]:

import pandas as pd
import numpy as np
import jax

#  64bit precision
jax.config.update("jax_enable_x64", True)

Import Swissmetro Dataset#

The alternatives are car, train or SM (the Swissmetro). The explanatory variables are cost, travel time and alternative specific constants for the train and car options. See the previous example for the Swissmetro Dataset for a detailed explaination here

Read data#

The dataset is imported and filtered.

[2]:

df_wide = pd.read_table("http://transp-or.epfl.ch/data/swissmetro.dat", sep="\t")

# Keep only observations for commute and business purposes that contain known choices
df_wide = df_wide[(df_wide["PURPOSE"].isin([1, 3]) & (df_wide["CHOICE"] != 0))]
df_wide["CHOICE"] = df_wide["CHOICE"].map({1: "TRAIN", 2: "SM", 3: "CAR"})

df_wide["custom_id"] = np.arange(len(df_wide))  # Add unique identifier
df_wide

[2]:

	GROUP	SURVEY	SP	ID	PURPOSE	FIRST	TICKET	WHO	LUGGAGE	AGE	...	TRAIN_CO	TRAIN_HE	SM_TT	SM_CO	SM_HE	SM_SEATS	CAR_TT	CAR_CO	CHOICE	custom_id
0	2	0	1	1	1	0	1	1	0	3	...	48	120	63	52	20	0	117	65	SM	0
1	2	0	1	1	1	0	1	1	0	3	...	48	30	60	49	10	0	117	84	SM	1
2	2	0	1	1	1	0	1	1	0	3	...	48	60	67	58	30	0	117	52	SM	2
3	2	0	1	1	1	0	1	1	0	3	...	40	30	63	52	20	0	72	52	SM	3
4	2	0	1	1	1	0	1	1	0	3	...	36	60	63	42	20	0	90	84	SM	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8446	3	1	1	939	3	1	7	3	1	5	...	13	30	50	17	30	0	130	64	TRAIN	6763
8447	3	1	1	939	3	1	7	3	1	5	...	12	30	53	16	10	0	80	80	TRAIN	6764
8448	3	1	1	939	3	1	7	3	1	5	...	16	60	50	16	20	0	80	64	TRAIN	6765
8449	3	1	1	939	3	1	7	3	1	5	...	16	30	53	17	30	0	80	104	TRAIN	6766
8450	3	1	1	939	3	1	7	3	1	5	...	13	60	53	21	30	0	100	80	TRAIN	6767

6768 rows × 29 columns

Reshape data#

This scikit learn interface uses the data in wide format. Here are data transformations and adding alternative specific constraints using pandas dataframes. Data headings for each alternative and variable pair is in the form alternative_variable, so for the cost of the train option, it would be TRAIN_CO.

[3]:

varnames = ["CO", "TT"]
alternatives = ["TRAIN", "CAR", "SM"]
seperator = "_"
alt_is_prefix = True

for alternative in alternatives:
    # alternative specific constants for train and car
    for alternative_constant in ["TRAIN", "CAR"]:
        if alternative_constant == alternative:
            df_wide[alternative + seperator + 'ASC' + seperator + alternative_constant] = np.ones(len(df_wide))
        else:
            df_wide[alternative + seperator + 'ASC' + seperator + alternative_constant] = np.zeros(len(df_wide))

    # scale time and cost
    for var in varnames:
        df_wide[alternative + seperator + var] = df_wide[alternative + seperator + var]/100


varnames = ["CO", "TT", "ASC_TRAIN", "ASC_CAR"]
all_varnames = [alternative + seperator + varname for alternative in alternatives for varname in varnames]
all_varnames

[3]:

['TRAIN_CO',
 'TRAIN_TT',
 'TRAIN_ASC_TRAIN',
 'TRAIN_ASC_CAR',
 'CAR_CO',
 'CAR_TT',
 'CAR_ASC_TRAIN',
 'CAR_ASC_CAR',
 'SM_CO',
 'SM_TT',
 'SM_ASC_TRAIN',
 'SM_ASC_CAR']

Creating and fitting a model#

Options for the model are given in the creation of the esimtator. Note that variable names must be included here. Panel data is currently not supported.

Then the model can be fit when given the data.

[4]:

from jaxlogit.scikit_wrapper import MixedLogitEstimator

mixed_logit_estimator = MixedLogitEstimator(
    varnames=varnames,
    randvars = {'TT': 'n'},
    n_draws=1500
)
X=df_wide[all_varnames]
y=df_wide["CHOICE"]

mixed_logit_estimator.fit(X, y)

[4]:

MixedLogitEstimator(n_draws=1500, randvars={'TT': 'n'},
                    varnames=['CO', 'TT', 'ASC_TRAIN', 'ASC_CAR'])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Scikit learn utilities#

From this interface utilties for splitting up data in to training and testing data and cross validation can be used.

[5]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
mixed_logit_estimator.fit(X_train, y_train)

mixed_logit_estimator.predict(X_test)

[5]:

array(['SM', 'SM', 'SM', ..., 'SM', 'SM', 'SM'],
      shape=(2708,), dtype='<U3')

[6]:

mixed_logit_estimator.score(X_test, y_test)

[6]:

0.6004431314623339

[7]:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(mixed_logit_estimator, X, y, cv=5)
scores

[7]:

array([0.59379616, 0.49039882, 0.59748892, 0.60458241, 0.60458241])

	alternatives	()
	varnames	['CO', 'TT', ...]
	randvars	{'TT': 'n'}
	weights	None
	avail	None
	panels	None
	init_coeff	None
	maxiter	2000
	random_state	None
	n_draws	1500
	halton	True
	halton_opts	None
	tol_opts	None
	num_hess	False
	set_vars	None
	optim_method	'L-BFGS-scipy'
	skip_std_errs	False
	include_correlations	False
	force_positive_chol_diag	True
	hessian_by_row	True
	finite_diff_hessian	False
	batch_size	None
	verbose	1