Blog

Introduction to Machine Learning-2

Continuing from the previous post at Introduction-to-machine-learning-1

We examine the training of Machine Learning Algorithms now.

These are done with Training Samples or Training Sets, then tested with Testing Samples or Testing Sets.

Consider the following (fictitious data)

City              Temperature         Ice Cream Price

Varanasi          40oC                            ₹ 100

Varanasi          50oC                            ₹ 200

Varanasi          46oC                            ₹ 200

Varanasi          44oC                            ₹ 100

The Algorithm will quickly learn that a temperature less than equal to 44oC means an Ice Cream price of ₹ 100 and above or equal to 46oC means ₹ 200.

We can create Testing Sets and validate the algorithm.

Underfitting

Underfitting happens when the Training Set is inadequate and the Algorithm has no proper answer for certain situations.

What happens if our Alorithm is asked the Ice Cream price at 45oC .The Algorithm would have no proper answer but can make an educated guess. The average of  ₹ 100 and ₹ 200 would be a good answer. What answer should the Algorithm give for 10oC?

The Algorithm isn’t well trained for this. This is high bias and low variance.

Tho other case is Overfitting

Consider the following Training Set.

  1. Who is the highest scorer in Maths?A. Pappu -100, B Appu 34, C Tappu 76
    Answer Pappu

  2. Who is the lowest scorer in Physics?
    A. Appu 76, Tappu 86, Pappu 33
  3. Who is the highest scorer in Chemistry?
    A. Tappu 23, Pappu 79, Appu 10

The answer in each of these cases is Pappu.
So, how does the Algorithm answer the following question?

4.  Who is the highest scorer in Geography?
A. Tappu 30, B Pappu 25, C Appu 77

The Algorithm answers Pappu. It thinks that there is some link with the name.

This is Overfitting. One of the solutions is cross validation. Divide the Training Set into parts. Train on one and then validate with others. A data set with different names  would uncover the problem in this case.

Regularization is another solution.

This simply means simplification. Remove the names here. Divide each mark by 100 and reduce everything to values between 0 and 1.

This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Feature Selection and Dimensionality Reduction

Essentially meaning that you reduce some of the features being put in the Training Set(fictitious).

Think about this situation where we try and pick a laptop.

Model Name            Manufacturer           Price           Installed Memory         Color        Warranty

MMX-1                            Menovo                      ₹ 100             5GB                                     Black          1year

…….                                       ……                             …..                ….                                          …..              ……

We will need to make decisions for all these features. You decide the Manufacturer(options 2). Next pick the price (options 2). This makes 4 options. For a product with n features you will 2n decisions to make.

We can reduce the number of features under consideration, reduce the dimensionality to achieve a faster, better tuned Algorithm.

 

Steps

So, how would you g about implementing a Machine Learning Solution .

  1. Define the Machine Learning Problem to be solved. Let us say we have e commerce company?
    So, how many of how many types to stock?
  2. Initial Information input. Get initial information from an expert, or go for exploration, market survey etc and get some data
  3. Get Data from the Information. Process the information. Rectify, classify, discover features, dimensions and create Training Sets and Testing Sets.
  4. Machine Learning Modeling. Create a Machine Learning Algorithm and train it.
  5. Machine Learning Algorithm Testing. Test the data and then go back to previous steps if necessary. Test again.
  6. Deploy the solution.

 

Some more steps.

  1. Supply Missing Values
    How to deal with missing values. No values for rainfall in a city for some duration. You can get the data from some source and then rectify. If not, try and deduce from available data. Get the mean of neighboring data maybe.
  2. Encoding of Labels
    Go back to our example on Pappu, Tappu etc. Use roll numbers instead of names.
    Other type of encoding will be.
    Again the data is fictitious here, and we have passing marks 40
    Rollno       Physics               Chemistry               Maths
    1                     45                          34                                77
    2                      33                         76                                 98
    This can be encoded as1101
    2011
  3. Scaling
    Scale all values between a single range. Percentages, or between 0 and 1.
  4. Specializing or Partitioning
    Go back to our E Commerce example. Might need different Machine Learning Algorithms for different geographical  areas or products.

 

Leave a Reply

Your email address will not be published. Required fields are marked *