Affinity discovery, classification, clustering are all methods of data analysis that rely on machine learning algorithms. In this article, we will list three of them, as well as their operation.
In terms of Big Data projects, there is a prime motivation: the possibility for a company to set up an advanced analytical program and improve the operational or increase the marketing and commercial opportunities. For this, machine learning algorithms and automation have arrived on the radar of companies to precisely accompany them in their analysis of big data.
It must be specified that this interest in Machine Learning techniques is strongly linked to their integration into analytics tools, facilitating in passing their adoption and use. It is easy to be carried away by his enthusiasm without really understanding what these algorithms do, why they were designed, not to mention the methods applied.
Here are 3 examples of Machine Learning methods that decrypt stakes and mechanics.
1 – The classification puts the data in their place
Let’s start with the classification. Its primary purpose is to formulate a prediction by creating separate classes in a dataset. The uses of this method are for example linked to the detection of spam, or to the analysis of risk in the health field. For the first, after scanning the text of an email, and tagging certain words and phrases, the “signature” of the message can be injected into a classification algorithm to determine whether or not it is spam. In the second case, a patient’s vital statistics, health history, activity levels, and demographics can be cross-referenced to score (a level of risk) and assess the likelihood of illness.
Another classification method is the decision tree. This is identical to a flowchart, which provides a hierarchical sequence of information. Tests are performed on several classified feature parameters. These tests can be a question-and-answer game (yes-no) or include a larger set of distinct variables. At each stage of a decision tree, they are applied to the data to refine the classification up to the root of the tree, the entities being finally separated into different classes.
A decision tree is built using a machine learning algorithm. Starting from a pre-defined set of classes, the algorithm searches iteratively for the most different variables in the classified entities. Once this is identified, and the decision rules are determined, the dataset is segmented into several groups according to these rules. Data analysis is performed recursively on each subset until all key classification rules are identified.
2 – Clustering gathers data
Examples of ML methods include clustering. The goal of a cluster analysis algorithm is to place the entities in a single large pool and form smaller groups that share similar characteristics. For example, a cable TV that wants to determine the demographic distribution of viewers over networks can do so by creating clusters from available subscriber data and what they are watching. A restaurant chain can group its customers according to the menus chosen by geographic location, then modify its menus accordingly.
In general, clustering algorithms examine a defined number of data characteristics and map each data entity to a corresponding point in a dimensional chart. The algorithms then seek to group the elements according to their relative proximity to each other in the graph.
A commonly used type is the k-means clustering algorithm. These algorithms divide a set of data entities into groups, where k represents the number of groups created. The algorithms refine the assignment of entities to different clusters by iteratively calculating the average midpoint or centroid of each cluster. The centroids become the focal points of the iterations, which refine their locations in the plot and reassign the data entities to fit the new locations. An algorithm is repeated until the groupings are optimized and the centroids do not move anymore.
3 – Affinity research to build relationships
Affinity research is another approach to discovering and analyzing data where Machine Learning plays a role. This allows for correlations between data attributes and event processing. This is for example used by large retailers for basket analysis. It is to identify the objects bought in the same basket; an e-commerce player could use these results to guide the placement of products on their site.
Cybersecurity also makes use of this approach. Transactions on the network that precede attacks are analyzed to identify patterns. The events that are correlated can then be used to create prescriptive analysis applications that can anticipate the same type of attacks.
One of the main algorithms used for affinity analysis is Apriori. It looks for correlations – called association rules – among attributes and their values in transactional records in databases. Like the other algorithms in this article, Apriori works iteratively. At each step, the number of variables increases.
To discover affinities, this algorithm takes into account 2 metrics: “support” which divides the number of records of databases with certain properties, by the total number of records; and “trust” which calculates the probability that a record includes one of the defined properties if it contains the others.
Correlations with these two measures above a certain threshold are identified as association rules: for example, “95% of the time, when a consumer buys beer, he also buys chips. Once an iteration is finalized, the records that have not been taken into account by the association rules are removed, the set of variables is extended and the process resumes until no other association is detected.
In addition to these examples, there are many other algorithms that can be used to achieve similar results. So do not stop here: creating a comprehensive inventory of available algorithms will help your Data Scientists and your analysts choose the right methods for their analytic applications.