*What is Machine Learning?…*

Machine learning is a **subgroup of artificial intelligence.** Its main objective is to create systems that can learn automatically, i.e. they are able to discover complex patterns buried in large sets of data on their own **with and without the need for human interference**.

Machine Learning Algorithms

Machine learning has generated a lot of excitement in the tech world as its proved itself applicable across a wide range of use cases. The technology includes a variety of methods that are each suited to answering different business questions. These are the Machine Learning algorithms which we can divide into **two groups**:

In supervised learning, the machine is **taught by example**. The prediction obtained is represented by means of a function where the entries represent the analyzed characteristics and the output represents the variable to be predicted. The output function is numerical in regression and categorical in classification problems. The training data used previously already has enough labels to allow the algorithm to use positions of data points to assume a relationship between multiple variables. It is so named because it refers to the data scientist as a teacher, who guides the algorithm to what conclusions it should draw and **applies what has been learned from historical data** so the algorithm is aware that it is looking for relationships between predefined parameters.

*Supervised learning methods*

Now that we have separated these algorithms into groups based on how they work and whether they are labeled or not, we are now going to look at the two methods of supervised learning, depending on the format of their outputs: **regression and classification**. Regression fits the data, and classification separates the data.

*Regression*

Regression is the Machine Learning technique that aims to reproduce the output value. It is useful for a number of services like to predict the price of a product, a property or the value of stock, to name a few. It is useful to companies for **predicting outputs that are continuous** i.e. the answer to a business’s question must be represented by a quantity that is determined in a flexible manner based on the inputs of the model and not confined to a set of potential labels. Just like with classification models, there are several regression algorithms too, including Simple linear regression (Scala, Python, Spark-Python), R. Multiple linear, R. Logistics (Python), Neural Networks (R with Keras), ANOVA and Manova.

* **Classification*

Classification algorithms is a technique used when the desired output is a **discrete label** and is therefore useful for businesses whose question or problem is a **finite set of possible outcomes.** This type of model aims to draw conclusions from observed values, as given one or more inputs tries to predict the value of one or more outcomes. An example of this would be filtering emails to see if they are spam or not. There are only two possible outcomes as when we come to analyze the transaction data we can divide them into two categories; either “fraudulent”, or “authorized”. There are a number of classification algorithms, which include Naive Bayesian quantifier, Logistic regression, discriminant Analysis, AdaBoost, Decision Trees (Spark-Python), Random Forest, Bagging and SVM.

*Unsupervised learning*

Unsupervised machine learning algorithms infer patterns from a data set **without labeled outcomes** like with supervised learning. With this technique we don’t know what the values for the output data or outcomes should be, and the data is therefore unlabeled and the algorithm acts on information without data. As with supervised learning, there exists two models.

*Cluster analysis*

Clustering is a method of unsupervised learning. It is a **common technique for statistical data analysis** which draws references from datasets as a part of input data without labeled responses. Its algorithms include Single and multiple Correspondence Analysis, Multidimensional Scaling and Hierarchical and partitioning cluster analysis (R, Spark-Python).

*Reduction of dimensionality*

Dimensionality reduction is the **process of** **reducing the number of random variables by obtaining a set of principal variables**. It can be divided into feature selection and feature extraction. Its algorithms include Principal component Analysis and Factorial analysis.

Written by our Data Scientist expert, Diego Calvo, check out his blog here.