Clustering: Platform for big data analysis and knowledge extraction
MUSCAT: Satish Chander, assistant professor of Department of Computer Science and Engineering at Waljat College of Applied Sciences briefs on data clustering.
The modern era of technology and its advancements have caused the data from internet, imaging and video surveillance to be rising at an alarming rate. One of the studies reveals that about 281 Exabyte of data is increasing in a span of every 10 years.
This data, for the most part, are getting stored in World Wide Web as electronic digital data. The increase is not experienced at the amount of data alone, but at the data variety (text, image and video) as well. Besides, billions of emails, blogs, transaction data and web pages are being created in terabytes every day. Hence, the increase in the volume of datasets is experienced again. This understanding is of utmost significance in a changing information scenario of the world. Suggesting techniques perform automatic analysis, classification and retrieval of such a huge unstructured data seem highly unfeasible.
Plenty of applications require enormous datasets to be analysed for their success. The data analysis procedure is categorised as exploratory or confirmatory, in accordance with the data source is available. Structuring the high dimensional dataset, devoid of any assumptions or pre-specified models, offset the exploratory analysis. In contrast, structuring the high dimensional data with assumptions or pre-specified models offsets the confirmatory analysis. The currently existing data analysis techniques include linear regression, discriminant analysis, canonical correlation analysis, factor analysis, principal component analysis, multidimensional scaling, cluster analysis and much more.
The study states that a key element, called grouping, which may possibly be based on a postulated model or natural groupings (i.e. clustering), is highly necessitated in any of the data analysis procedure kind. Conventional data analysis techniques often overlook the useful information from the bulk databases and as a consequence, the potential benefits of increased computational and data gathering capabilities are only partially realised.
An explorative study discloses the fact that the word ‘dataclustering’ first appeared in the anthropological data-related article in 1954. Data clustering is one among the data mining techniques, which structures the data in a way that the useful and associated information could be effectively extracted from the greater part of data corpus. Data clustering has emerged as a field of practice in its current form during the time between World War I and World War II in the discipline of ecology, where the scientists attempted to address the territorial structure of the bird species.
Data clustering, aka cluster analysis, functions in a manner that the natural groupings of data patterns, points or objects could be successfully discovered. Cluster analysis can be defined as “a statistical classification technique for discovering whether the individuals of a population fall into different groups by making quantitative comparisons of multiple characteristics”. Data clustering handles high dimensional data, in accordance with a similarity measure, such that the objects lying within the same group are alike and the objects belonging to different groups fail to be alike. As far as the data is concerned, the clusters are found to have the ability to diverge in terms of their shape, size and density. A study on clustering algorithm classifies it into two types, namely, hierarchical clustering and partitional clustering. Hierarchical clustering carries out data clustering in an agglomerative mode as well as divisive mode. In the agglomerative mode, every single data point is imagined to be a cluster, agreeing to merge similar pairs successively. The divisive mode differs from the agglomerative mode in the fact that all the data points are clustered into a single group initially and then, partitioning into smaller clusters follows. Partitional clustering algorithms, on the other hand, do not impose a hierarchical structure. Besides, it paves the way to find the entire number of clusters at one shot.
While taking a thorough look at the scale of application, Cluster analysis is found to be widespread in any discipline that involves multivariate data analysis. It is highly impractical to exhaustively list the scientific field and applications, where data clustering could apply. A few renowned applications, making use of data clustering include image segmentation, document clustering and character recognition. The jobs, which data clustering does on the application side are: i) storing, organising and integrating massive data, ii) data processing as well as analysing, iii) Extraction of knowledge and insights to predict the future from data.
The development of clustering methodology is a truly interdisciplinary endeavour. The reason may be that any number of people, who rely on real time data collection and processing such as, social scientist, engineers, computer scientist, medical researchers, taxonomist and so forth, are obligated to perform clustering methodology. In all, clustering seems to have a more impact on the environment, where we live in and where the systems that acquire and understand knowledge from texts evolve.