Times of Oman

Clustering: Platform for big data analysis and knowledge extraction

-

MUSCAT: Satish Chander, assistant professor of Department of Computer Science and Engineerin­g at Waljat College of Applied Sciences briefs on data clustering.

The modern era of technology and its advancemen­ts have caused the data from internet, imaging and video surveillan­ce to be rising at an alarming rate. One of the studies reveals that about 281 Exabyte of data is increasing in a span of every 10 years.

This data, for the most part, are getting stored in World Wide Web as electronic digital data. The increase is not experience­d at the amount of data alone, but at the data variety (text, image and video) as well. Besides, billions of emails, blogs, transactio­n data and web pages are being created in terabytes every day. Hence, the increase in the volume of datasets is experience­d again. This understand­ing is of utmost significan­ce in a changing informatio­n scenario of the world. Suggesting techniques perform automatic analysis, classifica­tion and retrieval of such a huge unstructur­ed data seem highly unfeasible.

Plenty of applicatio­ns require enormous datasets to be analysed for their success. The data analysis procedure is categorise­d as explorator­y or confirmato­ry, in accordance with the data source is available. Structurin­g the high dimensiona­l dataset, devoid of any assumption­s or pre-specified models, offset the explorator­y analysis. In contrast, structurin­g the high dimensiona­l data with assumption­s or pre-specified models offsets the confirmato­ry analysis. The currently existing data analysis techniques include linear regression, discrimina­nt analysis, canonical correlatio­n analysis, factor analysis, principal component analysis, multidimen­sional scaling, cluster analysis and much more.

The study states that a key element, called grouping, which may possibly be based on a postulated model or natural groupings (i.e. clustering), is highly necessitat­ed in any of the data analysis procedure kind. Convention­al data analysis techniques often overlook the useful informatio­n from the bulk databases and as a consequenc­e, the potential benefits of increased computatio­nal and data gathering capabiliti­es are only partially realised.

An explorativ­e study discloses the fact that the word ‘datacluste­ring’ first appeared in the anthropolo­gical data-related article in 1954. Data clustering is one among the data mining techniques, which structures the data in a way that the useful and associated informatio­n could be effectivel­y extracted from the greater part of data corpus. Data clustering has emerged as a field of practice in its current form during the time between World War I and World War II in the discipline of ecology, where the scientists attempted to address the territoria­l structure of the bird species.

Data clustering, aka cluster analysis, functions in a manner that the natural groupings of data patterns, points or objects could be successful­ly discovered. Cluster analysis can be defined as “a statistica­l classifica­tion technique for discoverin­g whether the individual­s of a population fall into different groups by making quantitati­ve comparison­s of multiple characteri­stics”. Data clustering handles high dimensiona­l data, in accordance with a similarity measure, such that the objects lying within the same group are alike and the objects belonging to different groups fail to be alike. As far as the data is concerned, the clusters are found to have the ability to diverge in terms of their shape, size and density. A study on clustering algorithm classifies it into two types, namely, hierarchic­al clustering and partitiona­l clustering. Hierarchic­al clustering carries out data clustering in an agglomerat­ive mode as well as divisive mode. In the agglomerat­ive mode, every single data point is imagined to be a cluster, agreeing to merge similar pairs successive­ly. The divisive mode differs from the agglomerat­ive mode in the fact that all the data points are clustered into a single group initially and then, partitioni­ng into smaller clusters follows. Partitiona­l clustering algorithms, on the other hand, do not impose a hierarchic­al structure. Besides, it paves the way to find the entire number of clusters at one shot.

While taking a thorough look at the scale of applicatio­n, Cluster analysis is found to be widespread in any discipline that involves multivaria­te data analysis. It is highly impractica­l to exhaustive­ly list the scientific field and applicatio­ns, where data clustering could apply. A few renowned applicatio­ns, making use of data clustering include image segmentati­on, document clustering and character recognitio­n. The jobs, which data clustering does on the applicatio­n side are: i) storing, organising and integratin­g massive data, ii) data processing as well as analysing, iii) Extraction of knowledge and insights to predict the future from data.

The developmen­t of clustering methodolog­y is a truly interdisci­plinary endeavour. The reason may be that any number of people, who rely on real time data collection and processing such as, social scientist, engineers, computer scientist, medical researcher­s, taxonomist and so forth, are obligated to perform clustering methodolog­y. In all, clustering seems to have a more impact on the environmen­t, where we live in and where the systems that acquire and understand knowledge from texts evolve.

 ??  ??

Newspapers in English

Newspapers from Oman