Machine learning

Rampy offers three classes for performing classification, regression or unsupervised ML exploration of a set of spectra. Those are helpful ways to automatically perform usual scalling, test-train splitting and ML training of popular ML algorithms, and use scikit-learn in the background.

Those functions work (rp.regressor was used in this publication) but still may evolve in the futur. For advanced, ML, I suggest using directly scikit-learn or other ML libraries.

Below you will find the documentation of the relevant functions, and have a look at the example notebooks too: Example notebooks

Do not hesitate to ask for new features depending on your needs !

Machine learning classification

Based on a set of spectra and their labels, the rampy.ml_classification module allows you to perform a classification of the spectra using a supervised ML algorithm. The class will take care of splitting the data into training and test sets, scaling the data, and training the model. You can then use the trained model to predict the labels of new spectra.

class rampy.ml_classification.mlclassificator(x, y, **kwargs)

Bases: object

Perform automatic classification of spectral data using scikit-learn machine learning algorithms.

This class supports various classification algorithms and allows customization of hyperparameters. It also handles scaling and splitting of training and testing datasets.

x

Training spectra organized in rows (1 row = one spectrum).

Type:: np.ndarray

y

Target labels for training data.

Type:: np.ndarray

X_test

Testing spectra organized in rows.

Type:: np.ndarray

y_test

Target labels for testing data.

Type:: np.ndarray

algorithm

Machine learning algorithm to use. Options: “Nearest Neighbors”, “Linear SVM”, “RBF SVM”, “Gaussian Process”, “Decision Tree”, “Random Forest”, “Neural Net”, “AdaBoost”, “Naive Bayes”, “QDA”.

Type:: str

scaling

Whether to scale the data during fitting and prediction.

Type:: bool

scaler

Type of scaler to use (“MinMaxScaler” or “StandardScaler”).

Type:: str

test_size

Fraction of the dataset to use as a testing dataset if X_test and y_test are not provided.

Type:: float

rand_state

Random seed for reproducibility. Default is 42.

Type:: int

params_

Hyperparameters for the selected algorithm.

Type:: dict

model: Scikit-learn model instance.

X_scaler: Scikit-learn scaler instance for X values.

fit(params_: dict = None)

Scale data and train or re-train the model with the specified algorithm.

This method initializes and trains the model if it hasn’t been trained yet. If a model already exists (from a previous fit), it reuses the existing model and optionally updates its hyperparameters.

Parameters:: params (dict, optional) – Hyperparameters for the selected algorithm. If provided, these parameters will override any previously set parameters.
Raises:: ValueError – If an invalid algorithm is specified or if scaling is inconsistent.

predict(X)

Predict target values using the trained model.

Parameters:: X (np.ndarray) – Samples to predict with shape (n_samples, n_features).
Returns:: Predicted target values with shape (n_samples,).
Return type:: np.ndarray

Notes

If scaling is enabled, input samples will be scaled before prediction.

Raises:: ValueError – If the model has not been fitted yet.

refit(): Re-train a model previously trained with fit()

scale_data(): Scale training and testing data.

Machine learning exploration

The rampy.ml_exploration module allows you to perform unsupervised ML exploration of a set of spectra. The class will take care of scaling the data and training the model. You can then use the trained model to explore the data and find patterns.

class rampy.ml_exploration.mlexplorer(x, **kwargs)

Bases: object

use machine learning algorithms from scikit learn to explore spectroscopic datasets

Performs automatic scaling and train/test split before NMF or PCA fit.

x

Spectra; n_features = n_frequencies.

Type:: {array-like, sparse matrix}, shape = (n_samples, n_features)

X_test

spectra organised in rows (1 row = one spectrum) that you want to use as a testing dataset. THose spectra should not be present in the x (training) dataset. The spectra should share a common X axis.

Type:: {array-like, sparse matrix}, shape = (n_samples, n_features)

algorithm

“PCA”, “NMF”, default = “PCA”

Type:: String,

scaling

True or False. If True, data will be scaled prior to fitting (see below),

Type:: Bool

scaler

the type of scaling performed. Choose between MinMaxScaler or StandardScaler, see http://scikit-learn.org/stable/modules/preprocessing.html for details. Default = “MinMaxScaler”.

Type:: String

test_size

the fraction of the dataset to use as a testing dataset; only used if X_test and y_test are not provided.

Type:: float

rand_state

the random seed that is used for reproductibility of the results. Default = 42.

Type:: Float64

model

A Scikit Learn object model, see scikit learn library documentation.

Type:: Scikit learn model

Remarks

-------

For details on hyperparameters of each algorithms, please directly consult the documentation of SciKit Learn at

http

Type:: //scikit-learn.org/stable/

Results for machine learning algorithms can vary from run to run. A way to solve that is to fix the random_state.

Example

Given an array X of n samples by m frequencies, and Y an array of n x 1 concentrations

>>> explo = rampy.mlexplorer(X) # X is an array of signals built by mixing two partial components
>>> explo.algorithm = 'NMF' # using Non-Negative Matrix factorization
>>> explo.nb_compo = 2 # number of components to use
>>> explo.test_size = 0.3 # size of test set
>>> explo.scaler = "MinMax" # scaler
>>> explo.fit() # fitting!
>>> W = explo.model.transform(explo.X_train_sc) # getting the mixture array
>>> H = explo.X_scaler.inverse_transform(explo.model.components_) # components in the original space
>>> plt.plot(X,H.T) # plot the two components

fit()

Train the model with the indicated algorithm.

Do not forget to tune the hyperparameters.

predict(X)

Predict using the model.

Parameters:

X ({array-like, sparse matrix}, shape = (n_samples, n_features)) – Samples.

Returns:

C (array, shape = (n_samples,)) – Returns predicted values.
Remark
——
if self.scaling == “yes”, scaling will be performed on the input X.

refit()

Train the model with the indicated algorithm.

Do not forget to tune the hyperparameters.

Machine learning regression

Based on a set of spectra and their labels, the rampy.ml_regressor module allows you to perform a regression using the spectra and a supervised ML algorithm. The class will take care of splitting the data into training and test sets, scaling the data, and training the model. You can then use the trained model to predict the new values of your target from new spectra.

rampy.ml_regressor.chemical_splitting(Pandas_DataFrame, target, split_fraction=0.3, rand_state=42)

split datasets depending on their chemistry

Parameters:

Pandas_DataFrame (Pandas DataFrame) – The input DataFrame with in the first row the names of the different data compositions
label (string) – The target in the DataFrame according to which we will split the dataset
split_fraction (float, between 0 and 1) – This is the amount of splitting you want, in reference to the second output dataset (see OUTPUTS).
rand_state (float64) – the random seed that is used for reproductibility of the results. Default = 42.

Returns:

frame1 (Pandas DataFrame) – A DataSet with (1-split_fraction) datas from the input dataset with unique chemical composition / names
frame2 (Pandas DataFrame) – A DataSet with split_fraction datas from the input dataset with unique chemical composition / names
frame1_idx (ndarray) – Contains the indexes of the data picked in Pandas_DataFrame to construct frame1
frame2_idx (ndarray) – Contains the indexes of the data picked in Pandas_DataFrame to construct frame2

Notes

This function avoids the same chemical dataset to be found in different training/testing/validating datasets that are used in ML.

Indeed, it is worthless to put data from the same original dataset / with the same chemical composition in the training / testing / validating datasets. This creates a initial bias in the splitting process…

Another way of doing that would be to write:

>>> grouped = Pandas_DataFrame.groupby(by='label')
>>> k = [i for i in grouped.groups.keys()]
>>> k_train, k_valid = model_selection.train_test_split(np.array(k),test_size=0.40,random_state=100)
>>> train = Pandas_DataFrame.loc[Pandas_DataFrame['label'].isin(k_train)]
>>> valid = Pandas_DataFrame.loc[Pandas_DataFrame['label'].isin(k_valid)]

(results will vary slightly as variable k is sorted but not variable names in the function below)

class rampy.ml_regressor.mlregressor(x, y, **kwargs)

Bases: object

use machine learning algorithms from scikit learn to perform regression between spectra and an observed variable.

x

Spectra; n_features = n_frequencies.

Type:: {array-like, sparse matrix}, shape = (n_samples, n_features)

y

Returns predicted values.

Type:: array, shape = (n_samples,)

X_test

spectra organised in rows (1 row = one spectrum) that you want to use as a testing dataset. THose spectra should not be present in the x (training) dataset. The spectra should share a common X axis.

Type:: {array-like, sparse matrix}, shape = (n_samples, n_features)

y_test

the target that you want to use as a testing dataset. Those targets should not be present in the y (training) dataset.

Type:: array, shape = (n_samples,)

algorithm

“KernelRidge”, “SVM”, “LinearRegression”, “Lasso”, “ElasticNet”, “NeuralNet”, “BaggingNeuralNet”, default = “SVM”

Type:: String,

scaling

True or False. If True, data will be scaled during fitting and prediction with the requested scaler (see below),

Type:: Bool

scaler

the type of scaling performed. Choose between MinMaxScaler or StandardScaler, see http://scikit-learn.org/stable/modules/preprocessing.html for details. Default = “MinMaxScaler”.

Type:: String

test_size

the fraction of the dataset to use as a testing dataset; only used if X_test and y_test are not provided.

Type:: float

rand_state

the random seed that is used for reproductibility of the results. Default = 42.

Type:: Float64

param_kr

contain the values of the hyperparameters that should be provided to KernelRidge and GridSearch for the Kernel Ridge regression algorithm.

Type:: Dictionary

param_svm

containg the values of the hyperparameters that should be provided to SVM and GridSearch for the Support Vector regression algorithm.

Type:: Dictionary

param_neurons

contains the parameters for the Neural Network (MLPregressor model in sklearn). Default= dict(hidden_layer_sizes=(3,),solver = ‘lbfgs’,activation=’relu’,early_stopping=True)

Type:: Dictionary

param_bagging

contains the parameters for the BaggingRegressor sklearn function that uses a MLPregressor base method. Default= dict(n_estimators=100, max_samples=1.0, max_features=1.0, bootstrap=True,

bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, random_state=rand_state, verbose=0)

Type:: Dictionary

prediction_train

the predicted target values for the training y dataset.

Type:: Array{Float64}

prediction_test

the predicted target values for the testing y_test dataset.

Type:: Array{Float64}

model

A Scikit Learn object model, see scikit learn library documentation.

Type:: Scikit learn model

X_scaler: A Scikit Learn scaler object for the x values.

Y_scaler: A Scikit Learn scaler object for the y values.

Example

Given an array X of n samples by m frequencies, and Y an array of n x 1 concentrations

>>> model = rampy.mlregressor(X,y)
>>> model.algorithm("SVM")
>>> model.user_kernel = 'poly'
>>> model.fit()
>>> y_new = model.predict(X_new)

Remarks

For details on hyperparameters of each algorithms, please directly consult the documentation of SciKit Learn at:

http://scikit-learn.org/stable/

For Support Vector and Kernel Ridge regressions, mlregressor performs a cross_validation search with using 5 KFold cross validators.

If the results are poor with Support Vector and Kernel Ridge regressions, you will have to tune the param_grid_kr or param_grid_svm dictionnary that records the hyperparameter space to investigate during the cross validation.

Results for machine learning algorithms can vary from run to run. A way to solve that is to fix the random_state. For neural nets, results from multiple neural nets (bagging technique) may also generalise better, such that it may be better to use the BaggingNeuralNet function.

fit()

Scale data and train the model with the indicated algorithm.

Do not forget to tune the hyperparameters.

Parameters:: algorithm (String,) – algorithm to use. Choose between “KernelRidge”, “SVM”, “LinearRegression”, “Lasso”, “ElasticNet”, “NeuralNet”, “BaggingNeuralNet”, default = “SVM”

predict(X)

Predict using the model.

Parameters:

X ({array-like, sparse matrix}, shape = (n_samples, n_features)) – Samples.

Returns:

C (array, shape = (n_samples,)) – Returns predicted values.
Remark
——
if self.scaling == “yes”, scaling will be performed on the input X.

refit(): Re-train a model previously trained with fit()

Linear mixture

This function helps you solve a simple problem: you have spectra that are obtained by a linear combination of two endmember spectra.

If you have the two endmember spectra, you can use the rampy.mixing() function to know the fraction of each endmember in the mixture.

If you do not know the endmember spectra, then you may be interested in using directly the PyMCR library, see the documentation here and an example notebook here. We used it in this publication, see the code here.

class rampy.mixing.mixing_sp(y_fit: ndarray, ref1: ndarray, ref2: ndarray)

Bases:

Mixes two reference spectra to match given experimental signals.

This function calculates the fractions of the first reference spectrum (ref1) in a linear combination of ref1 and ref2 that best matches the provided signals (y_fit). The calculation minimizes the sum of the least absolute values of the objective function: ( ext{obj} = sum left| y_{ ext{fit}} - (F_1 cdot ext{ref1} + (1 - F_1) cdot ext{ref2})

ight| ).

Args:

y_fit (np.ndarray): Array containing the experimental signals with shape (m, n),
where m is the number of data points and n is the number of experiments.

ref1 (np.ndarray): Array containing the first reference signal with shape (m,). ref2 (np.ndarray): Array containing the second reference signal with shape (m,).

Returns:
np.ndarray: Array of shape (n,) containing the fractions of ref1 in the mix. Values range between 0 and 1.

Notes:

The calculation is performed using cvxpy for optimization.

Ensure that y_fit, ref1, and ref2 have compatible dimensions.

Example:
>>> import numpy as np
>>> y_fit = np.array([[0.5, 0.6], [0.4, 0.5], [0.3, 0.4]])
>>> ref1 = np.array([0.5, 0.4, 0.3])
>>> ref2 = np.array([0.2, 0.3, 0.4])
>>> fractions = mixing_sp(y_fit, ref1, ref2)
>>> print(fractions)