Friday, November 13, 2015

Every road leads to Data

Data by itself is quite useful. Data in context with other data is still more useful. And data that answers our questions is invaluable. When we try to operationalize data, there are 4 things that need to be considered. They are as follows :

  • Data Engineering
  • Data Integration
  • Data Quality
  • Data Security

Data Engineering

This starts from acquisition of data to deployment. Data that is acquired or available has to be made appropriate for the models that would run over them. Data has to be cleansed before it is used. The procedure is called ETL (Extract Transform Load) and it deals with extracting the data that is under consideration and transform it, so that it becomes usable and finally loading the data into the setup.Data Scientists run different models against this setup and finally arrive at some conclusion. Typically this is an insight that is derived by a data analyst. Unfortunately, a deluge of data can easily force the data scientist to spend time cleaning the data than focussing on their area of work. Typically 80% of the time is spent in making the data usable. And then there are many sources of data, each with a different profile.

Data Integration

It’s very essential to understand that data is coming to us from various channels. From company systems, from individual machines and from the external world. All this data has to be accommodated in the setup. Mind you, data can be structured, unstructured or semi structured. To collect all this data and analysing it is not an easy task. The more the data collected, the more number of ETL programs have to be run. Duplication of data should be avoided. For this we have to have MDM (Master Data Management) – a place where the data is unique and addressable by a key value. Consider your customer – In the system, every customer should have only one unique ID based on which all his/her transactional and master data can be pulled up. Data integration involves creation of meta data (data about data)

Data Quality

Quality is solely responsible for getting insights from data that has been gleaned from multiple data sources. Sometimes, quality is associated with veracity or accuracy. Is the data that you are working with fresh or stale ? Is the data duplicated ? Does the data support a single version of the truth ? Is data timely ? Is data getting continuously updated ? How many data sources are being looked into ? Because this will affect the number of ETL processes. Data that is authentic has to be collected. The best place to look for this data is your operational data. If you read a lot of external data from say the internet or journals or some other sources, keep in mind that the data quality may be suspect.And finally how do you merge all this data with others. Bad data is like a rotten apple on an apple tree. The whole tree may get infected. So before you introduce an unknown data source, beware !

Data Security

Is your data safe ? What are the access control mechanisms to ensure safety of your data ? Who can access your data lake and what are the actions that they can take ? For example, do people have delete rights to the data ? Can somebody modify the data ? Who can run ETL jobs ? Structured systems are much less prone to hacking than unstructured ones. You have to be very careful when you set up user, groups etc. with access control to the data. This is not just at the database level , but this applies very much to the analytical tool that somebody is running. You must configure the tool in such a way that people get access to various insights based on what they need. For example – some visualizations may not be available for some users. The idea is to give users access to data that they need rather than throwing the doors open.

‘Trust in God, but bring data to the table’ is an old cliché. Data is very important in today’s world. Data-fying companies like Google have an intrinsic edge on their competition, because they extract value from data. There is 4.4 zettabytes(millions of billions) of information in the world today and only 0.5% of that has been analysed. Imagine the kind of insights that we can have if we were to  analyse just half of it. Insights are not top down, but they come from bottom up. Right from the operations and tactical levels. Co-relations are a very useful concept and we can expect causality also in the near future. As machines become more and more powerful and with AI in software, we will be making leaps of progress towards the final destination. As mounds of experience cover our daily activities, the short cut to God is evident through this one path – the path to Data.

Much wishes

Guru30

No comments: