MLpronto

What is MLpronto?

MLpronto supports the democratization of machine learning.

MLpronto can be used to execute some of the more common machine learning algorithms without the need to engage in any way with programming code. With the web interface, people can choose their file of data, and MLpronto will analyze the data according to the selected machine learning options.

For those users who prefer engaging with programming code, MLpronto generates code that can be used straightaway for analysis of data with machine learning algorithms, and the code can be customized and built upon for rapid development of machine learning projects.

Currently, MLpronto supports the supervised machine learning tasks of classification and regression.

Stages of MLpronto

1. A file of data and machine learning options are selected by the user.

2. MLpronto generates code to analyze the file with machine learning algorithms based on the user's selections.

3. MLpronto executes the machine learning algorithms on the data.

4. MLpronto reports the results of the analysis as well as the code it generated.

Format of Input File

MLpronto requires a file of data in one of the following formats:

.txt	Text file, either comma or tab delimited
.csv	Text file, comma separated values
.tsv	Text file, tab separated values
.xls	Excel file, older version
.xlsx	Excel file, newer version
.xlsm	Excel file, with macros
.xlsb	Excel file, binary
.ods	OpenDocument spreadsheet

Every row of the file should have the same number of columns

The file may have a header row or not

There should be one column corresponding to the label, i.e., dependent variable, output, target, response, or class (indicated in advanced parameter settings)

Every column (excepting its header) should contain either numbers or else instances of a small number of different categories. A column should not contain text where most items in the column are unique.


7	3.14	yes	Comedy	ID1	Idris	why	Non-stop acting
-123	17000	no	Drama	ID2	Basia	105	exquisite on so many levels
42	-1.5	yes	Drama	ID3	Laila	no way	well - I liked the credits anyway
1285	-826.29	yes	Action & Advntr	ID4	Zahara	???	What is this even about?
400	0.1251		Documentary	ID5	Nia	FOMO	i laughed i cried then i entered the theater
88	47.238	no	Comedy	ID6	Javier	bougie	mercurial filmmaking
45362	-9		Drama	ID7	Kamal	hmm...	kept me gussing.
-91	62.609	yes	Action & Advntr	ID8	Lina	0.33	I watched for free. It was overpriced.
-5573		no	Comedy	ID9	Azami	Ack!	intense story with a poetic ending
13	4296.8	no	Documentary	ID10	Malik	mayhaps	a magisterial portrayal

The data may contain missing values (handling of missing values is indicated in advanced parameter settings)

Example Input Files

Below are some example files that can be used to test out MLpronto

Domain	Classification or Regression	Has header row	Column containing label	Contains missing values	CSV format	TSV format	XLSX format	ODS format
Penguin species	Classification	✔	-1	✔	.csv	.tsv	.xlsx	.ods
Airline flight delays	Classification	✔	-1		.csv	.tsv	.xlsx	.ods
Parkinson's disease	Classification		0		.csv	.tsv	.xlsx	.ods
Health insurance cost	Regression	✔	-1	✔	.csv	.tsv	.xlsx	.ods
Movie revenue	Regression	✔	-1		.csv	.tsv	.xlsx	.ods
Price of diamonds	Regression	✔	-1		.csv	.tsv	.xlsx	.ods

Parameter Options

Parameter	Options
Algorithm	Logistic regression (classification) K Nearest Neighbors (classification) Gradient Boosting (classification) Random Forest (classification) Gaussian Naive Bayes (classification) Quadratic Discriminant Analysis (classification) Support vector machine (classification) Neural Network (classification) Linear regression (regression) K Nearest Neighbors (regression) Gradient Boosting (regression) Lasso (regression) Bayesian Ridge (regression) Elastic Net (regression) Stochastic Gradient Descent (regression) Neural Network (regression)
How to handle any missing values in the data	Remove rows with missing values Remove columns with missing values Univariate imputation of missing values Multivariate imputation of missing values
Index of column indicating the labels	An integer indicating the index of the colum containing labels, i.e., dependent variable, output, target, response, or class. The rest of the columns correspond to features, i.e., independent variables, predictors, input, or explanatory variables. The index of the first column is 0, of the second column is 1, ..., of the second to last column is -2, and of the last column is -1.
Feature scaling	Yes or no. Should data be scaled so that each feature column has a mean of 0 and a standard deviation of 1?
Percent of data used for training	A number between 1 and 99
Visual data with plot	Yes or no. Should a 2-dimensional (3-dimensional) scatter plot of the data be created via principal component analysis (PCA)? In the case of classification, this will project the feature columns along the two (three) most significant pricipal components and the points in the plot will be colored based on their labeled class. In the case of regression, this will project the feature columns along the one (two) most signficant principal component(s) (horizontal axes) and plot the projected data along with their labeled values (vertical axis). In either case, the percentage of explained variance is reported.
Calculate feature relationships	Yes or no. Should relationships between feature columns and the label column be calculated? Correlations between every pair of columns will be calculated. Also, various dependencies will be calculated. The mutual information indicates the dependency between a feature column and the label column. The F-value and p-value for each feature column relative to the label column are based on ANOVA in the case of classification analysis and on univariate linear regression testing in the case of regression analysis.

Classification or Regression

In general, for supervised machine learning, in classification problems, the labels correspond to a small number of categories. In regression problems, the labels correspond to a large number of different numbers (integers or decimal numbers).

Training and Testing Data

Normally, in supervised machine learning, data are split into two groups: training and testing. The training data are used to build a machine learning model and the testing data used used to evaluate how well the model performs on new data (i.e., data that did not influence the construction of the model).

In general, the majority of the data are used for training (MLpronto uses 80% by default) and a minority for testing (MLpronto uses 20% by default).

Analysis and Output

MLpronto performs a number of analyses (listed below). It also outputs the code (as a Python file and as a Jupyter Notebook) that it generates and executes along with the parameters (as a JSON file) that it uses for the specified dataset.

Analysis	Description
Data visualization with plots	A 2-dimensional (3-dimensional) scatter plot of the data is created via principal component analysis (PCA). In the case of classification, the feature columns are projected along the two (three) most significant pricipal components and the points in the plot are colored based on their labeled class. In the case of regression, the feature columns are projected along the one (two) most signficant principal component(s) (horizontal axes) and the projected data are plotted along with their labeled values (vertical axis). In either case, the percentage of explained variance is reported.
Feature relationships	Relationships between feature columns and the label column are calculated. Correlations between every pair of columns are calculated. Also, various dependencies are calculated. The mutual information indicates the dependency between a feature column and the label column. The F-value and p-value for each feature column relative to the label column are based on ANOVA in the case of classification analysis and on univariate linear regression testing in the case of regression analysis.
Training metrics	The size of (number of points in) the training data is reported along with various measures of the machine learning model's performance on the training data. For classification problems, the performance measures include the accuracy, F1 score, precision, recall, and area under the ROC curve. For regression problems, the performance measures include the R² score, adjusted R² score, mean squared error (MSE), and mean absolute error (MAE).
Testing metrics	The size of (number of points in) the testing data is reported along with various measures of the machine learning model's performance on the testing data. For classification problems, the performance measures include the accuracy, F1 score, precision, recall, and area under the ROC curve. For regression problems, the performance measures include the R² score, adjusted R² score, out of sample R² score, mean squared error (MSE), and mean absolute error (MAE).
Other analyses	For classification problems, a confusion matrix, classification report, receiver operating characteristic (ROC) curve, and precision recall curve (PRC) are shown for the testing data. For regression problems, a plot is generated showing values predicted by the model as compared to actual values, and two plots relating to residuals are generated.

Fidelity of Results

In general, code produced by MLpronto will yield the same results each time it is executed. Many machine learning algorithms employ randomization, and MLpronto seeds random number generation to ensure consistent results. However, there may be exceptional cases where results differ, e.g., if code is executed on the MLpronto webserver using one version of libraries and the same code is then executed on a user's local machine using a different version of libraries.

Libraries used by MLpronto

MLpronto uses the following libraries and versions

Python	3.9.18
numpy	1.23.2
pandas	1.4.3
sklearn	1.1.2
matplotlib	3.5.3

Source Code

MLpronto source code is available on GitHub

Citing MLpronto

MLpronto: A tool for democratizing machine learning. Tjaden J, Tjaden B. PLoS ONE, 18(11):e0294924, 2023.