- Active Learning
- AI vs. ML vs. DL
- Apache Spark
- Arbiter
- Artificial Intelligence (AI)
- Attention Mechanism Memory Networks
- Automated Machine Learning & AI
- Autonomous Vehicle
- Backpropagation
- Bag of Words & TF-IDF
- Convolutional Neural Network (CNN)
- Data for Deep Learning
- Datasets and Machine Learning
- Decision Tree
- Deep Autoencoders
- Deep-Belief Networks
- Deep Reinforcement Learning
- Deep Learning Resources
- Deeplearning4j
- Denoising Autoencoders
- Machine Learning DevOps
- Differentiable Programming
- Eigenvectors, Eigenvalues, PCA, Covariance and Entropy
- Evolutionary & Genetic Algorithms
- Fraud and Anomaly Detection
- Generative Adversarial Network (GAN)
- Glossary
- Gluon
- Graph Analytics
- Hopfield Networks
- Hyperparameter
- Wiki Home
- Java AI
- Jumpy
- Logistic Regression
- LSTM
- Machine Learning Algorithms
- Machine Learning Demos
- Machine Learning Software
- Machine Learning Operations (MLOps)
- Machine Learning Research Groups & Labs
- Machine Learning Workflows
- Machine Learning
- Markov Chain Monte Carlo
- Multilayer Perceptron
- Natural Language Processing (NLP)
- ND4J
- Neural Network Tuning
- Neural Network
- Open Datasets
- Radial Basis Function Networks
- Random Forest
- Recurrent Network (RNN)
- Recursive Neural Tensor Network
- Restricted Boltzmann Machine (RBM)
- Robotic Process Automation (RPA)
- Scala AI
- Single-layer Network
- Skynet
- Spiking Neural Networks
- Stacked Denoising Autoencoder (SDA)
- Strong AI & General AI
- Supervised Learning
- Symbolic Reasoning
- Text Analysis
- Thought Vectors
- Unsupervised Learning
- Deep Learning Use Cases
- Variational Autoencoder (VAE)
- Word2Vec, Doc2Vec and Neural Word Embeddings

When a data scientist has chosen a target variable - the “column” in a spreadsheet they wish to predict - and have done the prerequisites of transforming data and building a model, one of the most important steps in the process is evaluating the model’s performance.

Choosing a performance metric often depends on the business problem being solved. Let’s say you have 100 examples in your data and you’ve fed each one to your model and received a classification. The predicted vs. actual classification can be charted in a table called a confusion matrix.

Negative (predicted) | Positive (predicted) | |
---|---|---|

Negative (actual) |
98 | 0 |

Positive (actual) |
1 | 1 |

The table above describes an output of negative vs. positive. These two outcomes are the “classes” of each examples. Because there are only two classes, the model used to generate the confusion matrix can be described as a *binary classifier*.

To better interpret the table, you can also see it in terms of true positives, false negatives, etc.

Negative (predicted) | Positive (predicted) | |
---|---|---|

Negative (actual) |
true negative | false positive |

Positive (actual) |
false negative | true positive |

Overall, how often is our model correct?

As a heuristic, accuracy can immediately tell us whether a model is being trained correctly and how it may perform generally. However, it does not give detailed information regarding its application to the problem.

When the model predicts positive, how often is it correct?

Precision helps determine when the costs of false positives are high. So let’s assume the business problem involves the detection of skin cancer. If we have a model that has very low precision, the result is that many patients will receive results that they have melanoma. Lots of extra tests and stress are at stake.

Recall helps determine when the costs of false negatives are high. What if our problem requires that we check for a fatal virus such as Ebola? If many patients are told they don’t have Ebola (when they actually do), the result is likely a large infection of the population and an epidemiological crisis.

F1 is a helpful measure of a test’s accuracy. It is a consideration of both precision and recall, and an F1 score is considered perfect when at `1`

and is a total failure when at `0`

.