Dr. Parshotam S. Manhas
We’re entering a new world in which data may be more important than software
Data Science is the technology that has emerged out as one of the most popular fields of 21st Century due to the onset of Artificial Intelligence and Deep Learning. Data science employs scientific methodologies, processes, algorithms and systems to extract knowledge and useful insights across structured and unstructured data in various forms. It is in fact an empirical concept to amalgam statistics, data analysis, machine learning and their related methods to analyze actual phenomena with data. Data is considered as a ‘fourth paradigm’ of science after empirical, theoretical, computational science and everything about science is changing because of the impact of information technology and the humongous data explosion. Companies employ data scientists to help them gain insights about the market and to better their products. Data scientists work as decision makers and are mainly responsible for analyzing and handling a large amount of data.
Data science makes use of several statistical procedures ranging from data transformations, data modeling, statistical operations to machine learning modeling. Statistics is the primary asset of every data scientist. Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make predictions with minimal human intervention.
Data science is related to data mining, deep learning and big data. Data mining is a subset of data science which involves analyzing large amounts of data to discover patterns and other useful information. Deep Learning is an advance version of ML and requires high computational resources and special frameworks. Some prominent examples in sight are TensorFlow, Netflix and virtual assistants like Alexa. Big data is pool of data characterized by 3Vs (volume, variety and velocity) that demand innovative information processing for enhanced insight and decision making. Data science plays the role of big brother and encompasses the entire scope of data collection and processing.
Data scientists are responsible for breaking down big data into usable information and creating software and algorithms that help companies and organizations determine optimal operations. In recent years, there is a huge growth in the field of Internet of Things (IoT) due to which 90 percent of the data has been generated in the world. Over 2.5 exabytes of data are generated every single day and it is going to enhance with the growth of IoT. This data comes from all possible sources such as sensors used in shopping malls to gather shoppers’ information, posts on social media platforms, digital pictures and videos captured in our phones, purchase transactions made through e-commerce. Companies are flooded with colossal amounts of data to visualize, analyze and utilize. It is therefore in this context, the concept of data science comes into the picture. Data science brings together a lot of skills like statistics, mathematics, visualizations, optimization, programming and business domain knowledge. It helps an organization find ways to gauge the effectiveness of a marketing campaign, tap on different demographics, predict new business strategy and launch a product. Therefore, regardless of the industry verticals, data science is likely to play a key role in organization’s success.
Data science is also attracting a great deal of attention in academia. Schools, colleges, and universities have an enormous amount of student data such as academic records, results, grades, personal interests, cultural interests, etc to handle. The analysis of this data using Machine Learning algorithms such as Random forest, logistic regression, decision trees, support vector machines, etc can assist in finding advanced methods for enhancing student learning. Data science provides a big platform to research activities and improves the current scientific research methodologies to the next level. Vast accumulation of data provides the opportunity to sift through considerably large portion which is useful to a concrete outcome. Scientific data are more structured, becomes easier to extract knowledge, make analysis simpler, more precise and accurate. Many research areas such as medical, astrophysics, etc are completely based on data science. Special institutes for data science, conferences, workshops, data science journals, etc. will enhance the awareness and understanding of this new emerging discipline among learners.
Data Science Programming Languages and Tools: Data science requires a vast array of tools. Some popular programming languages and data science tools that are utilized to analyze and generate predictions are as follows:
Python is one of the most dominant languages for data science in the industry today because of its ease, flexibility, open-source nature. It has gained rapid popularity and acceptance in the ML community. There are a number of python libraries that are used in data science including NumPy, pandas, and SciPy.
‘R’ is another very commonly used programming language in data science. It has a thriving and incredibly supportive community and it comes with a plethora of packages and libraries that support most machine learning tasks.
Julia is a high-level, high-performance, dynamic programming language well-suited for numerical analysis and computational science and is being touted as the successor of Python.
SQL The two most common database query languages are SQL and NoSQL. SQL has been the market-dominant players for a number of years before NoSQL emerged. Some examples for SQL are Oracle, MySQL, SQLite, whereas NoSQL consists of popular databases like MongoDB, Cassandra, etc. These NoSQL databases are seeing huge adoption numbers because of their ability to scale and handle dynamic data.
SAS (Statistical Analysis System) is a specifically designed tool for statistical operations which turns data into insights. It is closed source proprietary software that is used by large organizations to analyze data. SAS applies to mine, alter, manage and retrieve data from several areas. Paired with SQL, SAS becomes an extremely efficient tool for data access and analysis.
Apache Spark or simply Spark is an open-source unified analytics engine for big data processing and machine learning. Spark is specifically designed to handle batch and stream processing. Spark offers various APIs that are programmable in Scala, Python, Java, and R. Spark is highly efficient in cluster management system that allows it to process application at a high speed.
BigML provides a fully interactable, cloud-
based GUI environment that can be used for processing ML algorithms. It provides a standardized software using cloud computing for industry requirements and specializes in predictive modelling. It uses a wide variety of ML algorithms like clustering, classification, time-series forecasting, etc. It allows interactive visualizations of data and provides with the ability to export visual charts on your mobile or IOT devices.
DataRobot is the platform for automated machine learning and is key to the AI-driven enterprise. It can be used by data scientists, executives, software engineers, and IT professionals for an easy deployment process and model optimization. It allows parallel processing.
RapidMiner provides an integrated environment for data preparation, machine learning and predictive analytics. It supports an integrated environment for machine learning, deep learning, text mining, and predictive analysis. It is used for wide range of applications including commercial applications, research, education, training, application development and supports machine learning process.
MATLAB is a multi-paradigm numerical computing environment widely used in academics and research. It is closed-source software that facilitates matrix functions, algorithmic implementation and statistical modeling of data. MATLAB is very adaptive to data science and offers toolboxes to access and preprocess data, build machine learning and predictive models and deploy models to enterprise IT systems or the cloud. Though MATLAB is not as popular as Python and R in data science may be due to its high processing cost but many people still consider it for learning data science.
Tableau is one of the most popular data visualization tools used by data science and business intelligence (BI) professionals today. It is packed with powerful graphics to make interactive visualizations. Tableau is also focused on industries through BI solution. The most important features of Tableau are to interface with databases, spreadsheets, and OLAP cubes. Besides, tableau has the ability to visualize geographical data and maps.
Jupyter is a popular open-source web-application tool for writing live code, visualizations and presentations through an interactable environment. Jupyter supports multiple languages like Julia, Python, and R. Jupyter Notebooks can be used to perform data cleaning, statistical computation, visualization and create predictive machine learning models.
TensorFlow is an open-source library that provides an interface and framework for working with neural networks and deep learning. TensorFlow can run on both CPUs and GPUs and has also emerged on more powerful TPU platforms. This has provided impetus to the processing power of advanced machine learning algorithms. Due to its high processing ability, Tensorflow has a variety of applications such as speech recognition, image classification, perception, drug discovery, prediction and creation, etc.
Apache Hadoop is an open-source framework that can manage data processing over large distributed systems and store specifically huge volume of data. It provides distributed computing of massive data sets over a large cluster of computers. It is used for high-level computations and data processing.
Data Science, in a nutshell, is a multi-disciplinary field comprising mainly data inference, algorithm development and technology to solve analytically complex problems with practical knowledge in computational science, statistics and mathematics. New technologies are emerging to deal with massive amount of data in every field and benefits are percolating to almost every aspect of life that generates data from health care to telecommunication. But at the same time the data should be handled cautiously to ensure that the respondents’ information is not exploited. In the near future data science will explore many frontiers that support humans to improve their life style in every aspect.
(The author is associate Professor of Physics at GDC, Samba and can be reached at firstname.lastname@example.org)
Dr. Parshotam S. Manhas