A tool for rapid, robust, and
reproducible machine learning.
No ML background needed whatsoever.
A tool for rapid, robust, and
reproducible machine learning.
No ML background needed whatsoever.
MLpronto supports the democratization of machine learning.
MLpronto can be used to execute some of the more common machine learning algorithms without the need to engage in any way with programming code. With the web interface, people can choose their file of data, and MLpronto will analyze the data according to the selected machine learning options.
For those users who prefer engaging with programming code, MLpronto generates code that can be used straightaway for analysis of data with machine learning algorithms, and the code can be customized and built upon for rapid development of machine learning projects.
Currently, MLpronto supports the supervised machine learning tasks of classification and regression.
1. A file of data and machine learning options are selected by the user.
2. MLpronto generates code to analyze the file with machine learning algorithms based on the user's selections.
3. MLpronto executes the machine learning algorithms on the data.
4. MLpronto reports the results of the analysis as well as the code it generated.
MLpronto requires a file of data in one of the following formats:
.txt | Text file, either comma or tab delimited |
.csv | Text file, comma separated values |
.tsv | Text file, tab separated values |
.xls | Excel file, older version |
.xlsx | Excel file, newer version |
.xlsm | Excel file, with macros |
.xlsb | Excel file, binary |
.ods | OpenDocument spreadsheet |
Every row of the file should have the same number of columns
The file may have a header row or not
There should be one column corresponding to the label, i.e., dependent variable, output, target, response, or class (indicated in advanced parameter settings)
Every column (excepting its header) should contain either numbers or else instances of a small number of different categories. A column should not contain text where most items in the column are unique.
7 | 3.14 | yes | Comedy | ID1 | Idris | why | Non-stop acting |
-123 | 17000 | no | Drama | ID2 | Basia | 105 | exquisite on so many levels |
42 | -1.5 | yes | Drama | ID3 | Laila | no way | well - I liked the credits anyway |
1285 | -826.29 | yes | Action & Advntr | ID4 | Zahara | ??? | What is this even about? |
400 | 0.1251 | Documentary | ID5 | Nia | FOMO | i laughed i cried then i entered the theater | |
88 | 47.238 | no | Comedy | ID6 | Javier | bougie | mercurial filmmaking |
45362 | -9 | Drama | ID7 | Kamal | hmm... | kept me gussing. | |
-91 | 62.609 | yes | Action & Advntr | ID8 | Lina | 0.33 | I watched for free. It was overpriced. |
-5573 | no | Comedy | ID9 | Azami | Ack! | intense story with a poetic ending | |
13 | 4296.8 | no | Documentary | ID10 | Malik | mayhaps | a magisterial portrayal |
The data may contain missing values (handling of missing values is indicated in advanced parameter settings)
Below are some example files that can be used to test out MLpronto
Domain | Classification or Regression | Has header row | Column containing label | Contains missing values | CSV format | TSV format | XLSX format | ODS format |
Penguin species | Classification | ✔ | -1 | ✔ | .csv | .tsv | .xlsx | .ods |
Airline flight delays | Classification | ✔ | -1 | .csv | .tsv | .xlsx | .ods | |
Parkinson's disease | Classification | 0 | .csv | .tsv | .xlsx | .ods | ||
Health insurance cost | Regression | ✔ | -1 | ✔ | .csv | .tsv | .xlsx | .ods |
Movie revenue | Regression | ✔ | -1 | .csv | .tsv | .xlsx | .ods | |
Price of diamonds | Regression | ✔ | -1 | .csv | .tsv | .xlsx | .ods |
In general, for supervised machine learning, in classification problems, the labels correspond to a small number of categories. In regression problems, the labels correspond to a large number of different numbers (integers or decimal numbers).
Normally, in supervised machine learning, data are split into two groups: training and testing. The training data are used to build a machine learning model and the testing data used used to evaluate how well the model performs on new data (i.e., data that did not influence the construction of the model).
In general, the majority of the data are used for training (MLpronto uses 80% by default) and a minority for testing (MLpronto uses 20% by default).
MLpronto performs a number of analyses (listed below). It also outputs the code (as a Python file and as a Jupyter Notebook) that it generates and executes along with the parameters (as a JSON file) that it uses for the specified dataset.
In general, code produced by MLpronto will yield the same results each time it is executed. Many machine learning algorithms employ randomization, and MLpronto seeds random number generation to ensure consistent results. However, there may be exceptional cases where results differ, e.g., if code is executed on the MLpronto webserver using one version of libraries and the same code is then executed on a user's local machine using a different version of libraries.
MLpronto uses the following libraries and versions
Python | 3.9.18 |
numpy | 1.23.2 |
pandas | 1.4.3 |
sklearn | 1.1.2 |
matplotlib | 3.5.3 |
MLpronto source code is available on GitHub
MLpronto: A tool for democratizing machine learning. Tjaden J, Tjaden B. PLoS ONE, 18(11):e0294924, 2023.
Contact Us