The framework comprises six steps

Steps in Data Modelling:

  1. Problem Definition: "What problem are we trying to solve?"

    • Supervised.
    • Unsupervised.
    • Classification.
    • Regression.
  2. Data: "What data do we have?"

    • Structured.
    • Unstructured.
  3. Evalution: "What defines succes for us?"

  4. Features: "What do we already know about the data?"

  5. Modelling: "Based on our problem and data, what model should we use?"

  6. Experimentation: "How could we improve?/what else can we try?"

1. Types of Machine Learning problems

Supervised Learning:

It is called Supervise Learning, because here we have data and labels. Here a machine learning algorithms tries to use the data to predict a label.

If it guesses the label wrong, the algorithms corrects itself and tries again. This active correction why is called Supervise

Main type of Supervised Learning problems:

  1. Classification:

    • "Is this example one thing or another?"
    • Binary classification = two options
    • Multi-class classification = more than two options
  2. Regression:

    • "How much will this house sell for?"
    • "How many people will buy this app?"

Unsupervised Learning

It has data but no labels

Main types of Unsupervised Learning problems:

  1. Cluster:
    • Grouping the labels.

Transfer Learning

Use the pre-build model for similar project to predict and customize it to make it precise that aren't cover in pre-build model.

Reinforcment Learning

  • +1 if the model do correct.
  • -1 if the model do wrong.

2. Data

Types of Data

  1. Structured data :

    • CSV (comma separated values)
    • excel sheet
  2. Unstructured data :

    • Images.
    • voices.
    • Natural language processing.
  • ### Static data : Data that does not change over time. eg : Csv,...

  • ### Streaming data : Data that changes over time. eg : stock, news,...

Workflow

Static / streaming => jupyter-lab => Data_Analysis (pandas) => matplotlib/plotly => machine_learning_model (tensorflow / Scikit learn) => results

3. Evaluation : "What defines success for us?"

Different types fo metrics

Classification Regression Recommendation
Accuracy Mean absolute error (MAE) precision at K
Precision Mean squared error (MSE)
Recall Root mean squared error (RMSE)

4 Features in Data : What do we already know about the data?

Use Feature variable (weight, heart rate) to predict Target variable (Heart disease?)

  • Numerical Feature : A number like body weigth

  • Categorial features : Like one in two (yes / no)

  • Derived feature : Look at the data and create new feature using the existing one.

    Feature Engineering: Looking at different features of data and creating new ones/altering existing ones.


Feature Coverage: How many samples have different features? Ideally, every sample has the same features.

What are features of your problems?


5. Modelling :

I. Choosing and training a model. [Training data]

II. Tuning a model. [Validatin data]

III. Model comparison. [Test data]

The most important concept in machine learning.

(The training, validation and test sets or $\boxed{3 sets}$)

Split data into :

  • Training [Train your model on this ] : 70-80%
  • Validation [Tune your model on this ] : 10-15%
  • Test [ Test and compare on this ] : 10-15%

All the split are different.

When thing go wrong?

When Test is revealed before the time.

Modelling : Picking the model

Problem-1 => Model-1

Problem-2 => Model-2

Sturctured Data

  • CatBoost
  • dmlc XGBoost
  • Random Forest

Unstructured Data

  • Deep learning.
  • Transfer learning.

Training a Model

X(data):

- Weight
- Sex
- heart rate
- chest pain

y(label)

- Heart disease?

Goal :Minize time between experiments Experiment

  1. Input => Model-1 => Accurary | Training-Time | Prediction-Time

Thing to remember

  • Some models work better than other on different problems.
  • Don't be afraid to try things.
  • Start small and build up (add complexit) as you need.

Modelling : Tuning

Random Forest:

Allow to adjust number of tree.

Neural network:

Allow to adjust number of hidden layers.

Things to remember:

  • Machine learning model have hyperparameters you can adjust.
  • A models first result aren't its last.
  • Tuning can take place on training or validation data sets.

Modeling : comparison

Testing a model

Right Way :

  • Balanced [Goldilocks Zone]
Data Set Performace
Training 98%
Test 96%

Wrong Way :

  • Underfitting [Potential]
Data Set Performace
Training 64%
Test 47%
  • Overfitting [Potential]
Data Set Performace
Training 93%
Test 99%

Overfitting and underfitting

Data Leakage :

This happens when Training data leaks into Test data, Which make occurs of Overfitting

Data Mismatch :

This happen when Training data is different than Test data, Which lead to Underfitting

Fixes for overfitting and underfitting

Underfitting :

  • Try a more advanced model.
  • Increase model hyperparameters.
  • Reduce amount of features.
  • Train longer.

Overfitting :

  • collect more data.
  • Try a less advanced model.

Thing to remember

  • Want to avoid overfitting and underfitting (head towards generality)
  • keep the test set separate at all costs.
  • Compare apples to apples.
  • One best performance metric does not equal best model

Overfitting and Underfitting Definitions

All experiments should be conducted on different portions of your data.

Training data set — Use this set for model training, 70–80% of your data is the standard.

Validation/development data set — Use this set for model hyperparameter tuning and experimentation evaluation, 10–15% of your data is the standard.

Test data set — Use this set for model testing and comparison, 10–15% of your data is the standard.

These amounts can fluctuate slightly, depending on your problem and the data you have.

Poor performance on training data means the model hasn’t learned properly and is underfitting. Try a different model, improve the existing one through hyperparameter or collect more data.

Great performance on the training data but poor performance on test data means your model doesn’t generalize well. Your model may be overfitting the training data. Try using a simpler model or making sure your the test data is of the same style your model is training on.

Another form of overfitting can come in the form of better performance on test data than training data. This may mean your testing data is leaking into your training data (incorrect data splits) or you've spent too much time optimizing your model for the test set data. Ensure your training and test datasets are kept separate at all times and avoid optimizing a models performance on the test set (use the training and validation sets for model improvement).

Poor performance once deployed (in the real world) means there’s a difference in what you trained and tested your model on and what is actually happening. Ensure the data you're using during experimentation matches up with the data you're using in production.

Fullscreen

Experimentation: "What have tried/ what else can we try?"

Doing experimentation with input hyperparameter, until hit the spot.