scikit-learn Associate study guide

Fundamental ML

Proficiency in fundamental machine learning algorithms and identifying when and how to use various models. This includes knowing when machine learning models are sufficient, as opposed to when deep learning might be overkill.
Programming skills

Proficiency in Python, particularly in using libraries such as scikit-learn, Pandas, and NumPy.
Data manipulation

Ability to clean, manipulate, and preprocess data using Python libraries.
Data visualization

Leveraging Python plotting tools and interpreting results effectively to create robust data-driven solutions.

Statistical knowledge

Basic understanding of statistics, probability, and hypothesis testing to interpret model results.
Model evaluation

Familiarity with techniques for evaluating model performance, such as cross-validation, confusion matrices, and ROC curves.
Attention to detail

Strong attention to detail to ensure data accuracy and model reliability.
Problem solving

Basic problem solving skills with a logical approach to analyzing and addressing issues. This includes making design choices for data pipelines and their evaluation.

Machine learning concepts

Types of Machine Learning: Supervised, Unsupervised, and Semi-supervised learning.
Model Families: Tree-based, Linear, Ensemble, Neighbors.
Key concepts (features, labels, training and test sets)
Model overfitting and underfitting
Bias/variance trade-off

Data preprocessing

Model building and evaluation

Splitting datasets into training and testing sets using train_test_split
Training ML models using the fit() method
Making predictions using the predict() method
Evaluating model performance with most common metrics (accuracy, precision, recall, F1 score, confusion matrix, mean squared error, R-squared)
Interpreting score with respect to dummy models

Model selection and validation

Understanding and implementing cross-validation techniques (KFold, ShuffleSplit, etc)
Learning and validation curves
Performing hyperparameter tuning using GridSearchCV, RandomSearchCV
Stability of learned coefficients across splits

Interpretation of results & communication

Visualizing model results using basic plotting techniques (matplotlib, seaborn)
Interpreting and communicating model outputs and performance metrics to non-technical stakeholders

Recommended training and resources