https://spark.apache.org/docs/latest/mllib-guide.html
Algorithms :
Classification
- Linear Regression
- Decision Trees
- Naive Bayes
- Logistics Regression
- SVM
Clustering
- k-Means
- Gaussian Mixture
- Power Iteration Clustering
- Latent Dirichlet Allocation
Backbone of Spark MLLib:
- RDD - aggregate, treeaggregate
- Optimizer
- Stochastic Gradient Descent
- L-BGFS
- Gradient
- Logistic
- Least Square
- Hinge
- Updater/Regualizer
- Squared L2
- L1
- Simple
General Usage & Architecture :
Linear Models:
Usage
- Prepare the data and create "LabelPoint" classes out of it, which basically groups the target/output class and the input features together
- Call the "train" method of the respective training algorithm singleton object along with the algorithm specific parameters
- That will return you an algorithm model you have trained for
- Now simple your model is ready for use, use the method "predict" from the returned model object and start guessing the output for a new features by passing it to the "predict" method
Data -> Cleaning -> LabelPoint -> Algorithm.train -> AlgorithmModel -> predict -> Guessed target
Implementation
- Repeat below steps until it converges or given number of iterations
- Driver broadcasts initialized "weights" to each worker
- Worker computes the gradient for next batch of B records from the training set
- Each task (running on the workers) samples records from its data partition
- Aggregate the gradient currently "RDD.treeaggregate" is used
- Drivers updates the weight
No comments:
Post a Comment