Machine learning
Rampy offers three classes for performing classification, regression or unsupervised ML exploration of a set of spectra. Those are helpful ways to automatically perform usual scalling, test-train splitting and ML training of popular ML algorithms, and use scikit-learn in the background.
Those functions work (rp.regressor
was used in this publication) but still may evolve in the futur. For advanced, ML, I suggest using directly scikit-learn or other ML libraries.
Below you will find the documentation of the relevant functions, and have a look at the example notebooks too: Example notebooks
Do not hesitate to ask for new features depending on your needs !
Machine learning classification
Based on a set of spectra and their labels, the rampy.ml_classification
module allows you to perform a classification of the spectra using a supervised ML algorithm. The class will take care of splitting the data into training and test sets, scaling the data, and training the model. You can then use the trained model to predict the labels of new spectra.
- class rampy.ml_classification.mlclassificator(x, y, **kwargs)
Bases:
object
Perform automatic classification of spectral data using scikit-learn machine learning algorithms.
This class supports various classification algorithms and allows customization of hyperparameters. It also handles scaling and splitting of training and testing datasets.
- x
Training spectra organized in rows (1 row = one spectrum).
- Type:
np.ndarray
- y
Target labels for training data.
- Type:
np.ndarray
- X_test
Testing spectra organized in rows.
- Type:
np.ndarray
- y_test
Target labels for testing data.
- Type:
np.ndarray
- algorithm
Machine learning algorithm to use. Options: “Nearest Neighbors”, “Linear SVM”, “RBF SVM”, “Gaussian Process”, “Decision Tree”, “Random Forest”, “Neural Net”, “AdaBoost”, “Naive Bayes”, “QDA”.
- Type:
str
- scaling
Whether to scale the data during fitting and prediction.
- Type:
bool
- scaler
Type of scaler to use (“MinMaxScaler” or “StandardScaler”).
- Type:
str
- test_size
Fraction of the dataset to use as a testing dataset if X_test and y_test are not provided.
- Type:
float
- rand_state
Random seed for reproducibility. Default is 42.
- Type:
int
- params_
Hyperparameters for the selected algorithm.
- Type:
dict
- model
Scikit-learn model instance.
- X_scaler
Scikit-learn scaler instance for X values.
- fit(params_: dict = None)
Scale data and train or re-train the model with the specified algorithm.
This method initializes and trains the model if it hasn’t been trained yet. If a model already exists (from a previous fit), it reuses the existing model and optionally updates its hyperparameters.
- Parameters:
params (dict, optional) – Hyperparameters for the selected algorithm. If provided, these parameters will override any previously set parameters.
- Raises:
ValueError – If an invalid algorithm is specified or if scaling is inconsistent.
- predict(X)
Predict target values using the trained model.
- Parameters:
X (np.ndarray) – Samples to predict with shape (n_samples, n_features).
- Returns:
Predicted target values with shape (n_samples,).
- Return type:
np.ndarray
Notes
If scaling is enabled, input samples will be scaled before prediction.
- Raises:
ValueError – If the model has not been fitted yet.
- refit()
Re-train a model previously trained with fit()
- scale_data()
Scale training and testing data.
Machine learning exploration
The rampy.ml_exploration
module allows you to perform unsupervised ML exploration of a set of spectra. The class will take care of scaling the data and training the model. You can then use the trained model to explore the data and find patterns.
- class rampy.ml_exploration.mlexplorer(x, **kwargs)
Bases:
object
use machine learning algorithms from scikit learn to explore spectroscopic datasets
Performs automatic scaling and train/test split before NMF or PCA fit.
- x
Spectra; n_features = n_frequencies.
- Type:
{array-like, sparse matrix}, shape = (n_samples, n_features)
- X_test
spectra organised in rows (1 row = one spectrum) that you want to use as a testing dataset. THose spectra should not be present in the x (training) dataset. The spectra should share a common X axis.
- Type:
{array-like, sparse matrix}, shape = (n_samples, n_features)
- algorithm
“PCA”, “NMF”, default = “PCA”
- Type:
String,
- scaling
True or False. If True, data will be scaled prior to fitting (see below),
- Type:
Bool
- scaler
the type of scaling performed. Choose between MinMaxScaler or StandardScaler, see http://scikit-learn.org/stable/modules/preprocessing.html for details. Default = “MinMaxScaler”.
- Type:
String
- test_size
the fraction of the dataset to use as a testing dataset; only used if X_test and y_test are not provided.
- Type:
float
- rand_state
the random seed that is used for reproductibility of the results. Default = 42.
- Type:
Float64
- model
A Scikit Learn object model, see scikit learn library documentation.
- Type:
Scikit learn model
- Remarks
- -------
- For details on hyperparameters of each algorithms, please directly consult the documentation of SciKit Learn at
- http
- Type:
//scikit-learn.org/stable/
- Results for machine learning algorithms can vary from run to run. A way to solve that is to fix the random_state.
Example
Given an array X of n samples by m frequencies, and Y an array of n x 1 concentrations
>>> explo = rampy.mlexplorer(X) # X is an array of signals built by mixing two partial components >>> explo.algorithm = 'NMF' # using Non-Negative Matrix factorization >>> explo.nb_compo = 2 # number of components to use >>> explo.test_size = 0.3 # size of test set >>> explo.scaler = "MinMax" # scaler >>> explo.fit() # fitting! >>> W = explo.model.transform(explo.X_train_sc) # getting the mixture array >>> H = explo.X_scaler.inverse_transform(explo.model.components_) # components in the original space >>> plt.plot(X,H.T) # plot the two components
- fit()
Train the model with the indicated algorithm.
Do not forget to tune the hyperparameters.
- predict(X)
Predict using the model.
- Parameters:
X ({array-like, sparse matrix}, shape = (n_samples, n_features)) – Samples.
- Returns:
C (array, shape = (n_samples,)) – Returns predicted values.
Remark
——
if self.scaling == “yes”, scaling will be performed on the input X.
- refit()
Train the model with the indicated algorithm.
Do not forget to tune the hyperparameters.
Machine learning regression
Based on a set of spectra and their labels, the rampy.ml_regressor
module allows you to perform a regression using the spectra and a supervised ML algorithm. The class will take care of splitting the data into training and test sets, scaling the data, and training the model. You can then use the trained model to predict the new values of your target from new spectra.
- rampy.ml_regressor.chemical_splitting(Pandas_DataFrame, target, split_fraction=0.3, rand_state=42)
split datasets depending on their chemistry
- Parameters:
Pandas_DataFrame (Pandas DataFrame) – The input DataFrame with in the first row the names of the different data compositions
label (string) – The target in the DataFrame according to which we will split the dataset
split_fraction (float, between 0 and 1) – This is the amount of splitting you want, in reference to the second output dataset (see OUTPUTS).
rand_state (float64) – the random seed that is used for reproductibility of the results. Default = 42.
- Returns:
frame1 (Pandas DataFrame) – A DataSet with (1-split_fraction) datas from the input dataset with unique chemical composition / names
frame2 (Pandas DataFrame) – A DataSet with split_fraction datas from the input dataset with unique chemical composition / names
frame1_idx (ndarray) – Contains the indexes of the data picked in Pandas_DataFrame to construct frame1
frame2_idx (ndarray) – Contains the indexes of the data picked in Pandas_DataFrame to construct frame2
Notes
This function avoids the same chemical dataset to be found in different training/testing/validating datasets that are used in ML.
Indeed, it is worthless to put data from the same original dataset / with the same chemical composition in the training / testing / validating datasets. This creates a initial bias in the splitting process…
Another way of doing that would be to write:
>>> grouped = Pandas_DataFrame.groupby(by='label') >>> k = [i for i in grouped.groups.keys()] >>> k_train, k_valid = model_selection.train_test_split(np.array(k),test_size=0.40,random_state=100) >>> train = Pandas_DataFrame.loc[Pandas_DataFrame['label'].isin(k_train)] >>> valid = Pandas_DataFrame.loc[Pandas_DataFrame['label'].isin(k_valid)]
(results will vary slightly as variable k is sorted but not variable names in the function below)
- class rampy.ml_regressor.mlregressor(x, y, **kwargs)
Bases:
object
use machine learning algorithms from scikit learn to perform regression between spectra and an observed variable.
- x
Spectra; n_features = n_frequencies.
- Type:
{array-like, sparse matrix}, shape = (n_samples, n_features)
- y
Returns predicted values.
- Type:
array, shape = (n_samples,)
- X_test
spectra organised in rows (1 row = one spectrum) that you want to use as a testing dataset. THose spectra should not be present in the x (training) dataset. The spectra should share a common X axis.
- Type:
{array-like, sparse matrix}, shape = (n_samples, n_features)
- y_test
the target that you want to use as a testing dataset. Those targets should not be present in the y (training) dataset.
- Type:
array, shape = (n_samples,)
- algorithm
“KernelRidge”, “SVM”, “LinearRegression”, “Lasso”, “ElasticNet”, “NeuralNet”, “BaggingNeuralNet”, default = “SVM”
- Type:
String,
- scaling
True or False. If True, data will be scaled during fitting and prediction with the requested scaler (see below),
- Type:
Bool
- scaler
the type of scaling performed. Choose between MinMaxScaler or StandardScaler, see http://scikit-learn.org/stable/modules/preprocessing.html for details. Default = “MinMaxScaler”.
- Type:
String
- test_size
the fraction of the dataset to use as a testing dataset; only used if X_test and y_test are not provided.
- Type:
float
- rand_state
the random seed that is used for reproductibility of the results. Default = 42.
- Type:
Float64
- param_kr
contain the values of the hyperparameters that should be provided to KernelRidge and GridSearch for the Kernel Ridge regression algorithm.
- Type:
Dictionary
- param_svm
containg the values of the hyperparameters that should be provided to SVM and GridSearch for the Support Vector regression algorithm.
- Type:
Dictionary
- param_neurons
contains the parameters for the Neural Network (MLPregressor model in sklearn). Default= dict(hidden_layer_sizes=(3,),solver = ‘lbfgs’,activation=’relu’,early_stopping=True)
- Type:
Dictionary
- param_bagging
contains the parameters for the BaggingRegressor sklearn function that uses a MLPregressor base method. Default= dict(n_estimators=100, max_samples=1.0, max_features=1.0, bootstrap=True,
bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=1, random_state=rand_state, verbose=0)
- Type:
Dictionary
- prediction_train
the predicted target values for the training y dataset.
- Type:
Array{Float64}
- prediction_test
the predicted target values for the testing y_test dataset.
- Type:
Array{Float64}
- model
A Scikit Learn object model, see scikit learn library documentation.
- Type:
Scikit learn model
- X_scaler
A Scikit Learn scaler object for the x values.
- Y_scaler
A Scikit Learn scaler object for the y values.
Example
Given an array X of n samples by m frequencies, and Y an array of n x 1 concentrations
>>> model = rampy.mlregressor(X,y) >>> model.algorithm("SVM") >>> model.user_kernel = 'poly' >>> model.fit() >>> y_new = model.predict(X_new)
Remarks
For details on hyperparameters of each algorithms, please directly consult the documentation of SciKit Learn at:
http://scikit-learn.org/stable/
For Support Vector and Kernel Ridge regressions, mlregressor performs a cross_validation search with using 5 KFold cross validators.
If the results are poor with Support Vector and Kernel Ridge regressions, you will have to tune the param_grid_kr or param_grid_svm dictionnary that records the hyperparameter space to investigate during the cross validation.
Results for machine learning algorithms can vary from run to run. A way to solve that is to fix the random_state. For neural nets, results from multiple neural nets (bagging technique) may also generalise better, such that it may be better to use the BaggingNeuralNet function.
- fit()
Scale data and train the model with the indicated algorithm.
Do not forget to tune the hyperparameters.
- Parameters:
algorithm (String,) – algorithm to use. Choose between “KernelRidge”, “SVM”, “LinearRegression”, “Lasso”, “ElasticNet”, “NeuralNet”, “BaggingNeuralNet”, default = “SVM”
- predict(X)
Predict using the model.
- Parameters:
X ({array-like, sparse matrix}, shape = (n_samples, n_features)) – Samples.
- Returns:
C (array, shape = (n_samples,)) – Returns predicted values.
Remark
——
if self.scaling == “yes”, scaling will be performed on the input X.
- refit()
Re-train a model previously trained with fit()
Linear mixture
This function helps you solve a simple problem: you have spectra that are obtained by a linear combination of two endmember spectra.
If you have the two endmember spectra, you can use the rampy.mixing()
function to know the fraction of each endmember in the mixture.
If you do not know the endmember spectra, then you may be interested in using directly the PyMCR library, see the documentation here and an example notebook here. We used it in this publication, see the code here.
- class rampy.mixing.mixing_sp(y_fit: ndarray, ref1: ndarray, ref2: ndarray)
Bases:
Mixes two reference spectra to match given experimental signals.
This function calculates the fractions of the first reference spectrum (ref1) in a linear combination of ref1 and ref2 that best matches the provided signals (y_fit). The calculation minimizes the sum of the least absolute values of the objective function: ( ext{obj} = sum left| y_{ ext{fit}} - (F_1 cdot ext{ref1} + (1 - F_1) cdot ext{ref2})
ight| ).
- Args:
- y_fit (np.ndarray): Array containing the experimental signals with shape (m, n),
where m is the number of data points and n is the number of experiments.
ref1 (np.ndarray): Array containing the first reference signal with shape (m,). ref2 (np.ndarray): Array containing the second reference signal with shape (m,).
- Returns:
np.ndarray: Array of shape (n,) containing the fractions of ref1 in the mix. Values range between 0 and 1.
- Notes:
The calculation is performed using cvxpy for optimization.
Ensure that y_fit, ref1, and ref2 have compatible dimensions.
Example:
>>> import numpy as np >>> y_fit = np.array([[0.5, 0.6], [0.4, 0.5], [0.3, 0.4]]) >>> ref1 = np.array([0.5, 0.4, 0.3]) >>> ref2 = np.array([0.2, 0.3, 0.4]) >>> fractions = mixing_sp(y_fit, ref1, ref2) >>> print(fractions)