Machine Learning


https://spark.apache.org/docs/latest/mllib-guide.html

Algorithms :

Classification

  1. Linear Regression
  2. Decision Trees
  3. Naive Bayes
  4. Logistics Regression
  5. SVM
Clustering

  1. k-Means
  2. Gaussian Mixture
  3. Power Iteration Clustering
  4. Latent Dirichlet Allocation  

Backbone of Spark MLLib:

  1. RDD - aggregate, treeaggregate
  2. Optimizer 
    • Stochastic Gradient Descent
    • L-BGFS
  3. Gradient
    • Logistic
    • Least Square
    • Hinge
  4. Updater/Regualizer
    • Squared L2
    • L1
    • Simple

General Usage & Architecture : 

Linear Models:

Usage

  1.  Prepare the data and create "LabelPoint" classes out of it, which basically groups the target/output class and the input features together
  2. Call the "train" method of the respective training algorithm singleton object along with the algorithm specific parameters
  3. That will return you an algorithm model you have trained for
  4. Now simple your model is ready for use, use the method "predict" from the returned model object and start guessing the output for a new features by passing it to the "predict" method
Data -> Cleaning -> LabelPoint -> Algorithm.train -> AlgorithmModel -> predict -> Guessed target

Implementation

  1. Repeat below steps until it converges or given number of iterations
  2. Driver broadcasts initialized "weights" to each worker
    • Worker computes the gradient for next batch of B records from the training set
  3. Each task (running on the workers) samples records from its data partition
  4. Aggregate the gradient currently "RDD.treeaggregate" is used
  5. Drivers updates the weight



No comments:

Post a Comment