Professional Documents
Culture Documents
The Basics of Data Analytics
The Basics of Data Analytics
For you to become a professional data scientist, working in data mining and
business intelligence firms you have to understand the fundamentals of data
analytics. The goal of this article is to help you firm up all the key concepts in data
analytics.
By the end of the article, you should be in a position to describe different types
of analytics, common terminologies used in analytics, tools and basic pre-
requisites for analytics and the workflow of data analytics. Without further ado,
let’s dive in to explore the basics of data analytics.
Types of analytics
Raw data is not any different from crude oil. These days, any person or institution
with a moderate budget can collect large volumes of raw data. But the collection
in itself shouldn’t be the end goal. Organizations that can extract meanings from
the collected raw data are the ones that can compete in today’s complex and
unpredictable environment.
At the core of any data refinement process sits what is commonly referred to as
“analytics”. But different people use the word “analytics” to imply different
things. If you’re in marketing and would like to understand data analytics, you
should understand the different types of analytics. Below are examples of
analytics:
· Descriptive
· Diagnostic
· Prescriptive
· Exploratory
· Predictive
· Mechanistic
· Casual
· Inferential
· What happened?
· What is happening?
While most data analytics provides general insights on the subject, prescriptive
analytics gives you with a “laser-like” focus to answer precise questions. For
instance, in the healthcare industry, you can use prescriptive analytics to manage
the patient population by measuring the number of patients who are clinically
obese.
Prescriptive analytics can allow you to add filters in obesity such as obesity with
diabetes and cholesterol levels to find out areas where treatment should be
focused.
Causal analytics allow big data scientists to figure out what is likely to happen if
one component of the variable is changed. When you use this approach, you
should rely on a number of random variables to determine what’s likely to
happen next even though you can use non-random studies to infer from
causations. This approach to analytics is appropriate if you’re dealing with large
volumes of data.
This approach to analytics takes different theories on the world into account to
determine the certain aspects of the large population. When you use inferential
analytics, you’ll be required to take a smaller sample of information from the
population and use that as a basis to infer parameters about the larger
population.
Common terminologies used in data analytics
As you plan to begin using data analytics for the achievement of your bottom
line, there are terminologies that you must learn. Below is a list of common
terminologies and their meanings:
The severing of links between people in a database and their records to prevent
the discovery of the source of the records.
· Behavioral analytics. It involves using data about people’s behavior to infer their
intent and predict their future actions.
· Big Data Scientist. A professional who can develop algorithms that make sense
from big data.
· Data exhaust. The by-product that is created by a person who uses database
system.
· Data feed. A means for any person to receive a stream of data such as RSS.
· Data governance. A set of processes that promotes the integrity of the data
stored in a database system.
· Data integration. The act of combining data from diverse and disparate sources
and presenting it in a single coherent and unified view.
· Data migration. The process of moving data from one storage location or server
to another while maintaining its format.
· Data mining. The process of obtaining patterns or knowledge from large sets of
databases.
· Data science. A discipline that incorporates the use of statistics, data
visualization, machine learning, computer programming and data mining
database to solve complex problems in organizations.
· Machine learning. Using algorithms to allow computers to analyze data for the
purpose of extracting information to take specific actions based on specific
events or patterns.
· Quartiles. The lower (Q1) quartile is the value below for which the bottom 25%
of any sampled data lies, and the upper (Q3) quartile is the value above which
the upper 25% of sampled data lies.
· Representative. The extent to which the sampled data reflect accurately the
characteristics of the selected population in an experiment.
· Sample. A subset (n) that is selected from entire population (N). Scatter
· The standard error of the mean. It is a measure of the accuracy of the sampled
mean as an estimate of the entire population mean.
· Independent variable. The variable that determines the values of the other
dependent (response) variable. For instance, blood pressure can be deemed to
in response to changes in age.
By now, you are wondering, “Where should I start to become a professional data
analyst?”
Well to become a professional data scientist, here is what you should learn:
· Mathematics
· Excel
· Basic SQL
· Web development
Let’s see how these fields are important in data analytics.
#1: Mathematics
Data analytics is all about numbers. If you relish working with numbers and
algebraic functions, then you’ll love data analytics. However, if you don’t like
numbers, you should begin to cultivate a positive attitude. Also, be willing to
learn new ideas. Truth be told — the world of data analytics is fast-paced and
unpredictable. Therefore, you can’t be contented. You should be ready to learn
new technologies that are springing up to deal with changes in data
management.
#2: Excel
Excel is the most all-around and common business application for data analytics.
While many data scientists graduate with functional specific skill — such as data
mining, visualization, and statistical applications — almost all these skills can be
learned in Excel. You can start by learning the basic concepts of Excel such as the
workbooks, the worksheets, the formula bar and the ribbon.
Once you’ve familiarized with concepts of Excel, you can proceed to learn the
basic formulas such as sum, average, if, count, vlookup, date, max and min,
getpivotdata. As you begin to become more comfortable with basic formulas,
you can try out the complex formulas for regression and chi-square distributions.
Excel provides you with tools to slice and dice your data. However, it assumes
you already have the data stored in your computer system. What about data
collection and storage. As you’ll learn about seasoned data scientists, the best
approach to deal with data is getting it or pulling it directly from its source. Excel
doesn’t provide you with these functionalities.
· Select
· From
· Where
· Group By
· Having
· Order By
Besides mastering the basic SQL commands, you should also understand the
reason behind use of primary keys, foreign keys and candidate keys in these
DBMSs’
I know you’re thinking that web development is an odd-ball with regard to data
analytics. But trust me, mastery of web development will be an added bonus to
your data scientist career. If you want to work for consumer internet companies
or work for IoT companies such as IBM, AWS, and Microsoft Azure, you have to
be good in internet programming tools such as HTML, JavaScript and PHP.
#2: R programming
Every person that start the journey of data science usually faces the common
problem of selecting the best programming language. Today, there are a couple
of programming languages that can perform data analytics. Each of these
programming languages has their own fair share of pros and cons. However, R
programming language is a tested programming language that you can try out.
R is very useful for data analytics due to its versatility nature especially in the
field of statistics. It is an open source software that provides data scientists with
a variety of features for analyzing data.
· It has storage facilities therefore, data analysts can effective handle their data.
· It has graphical facilities that data analysts can use to display processed data.
Similarly, Python has been used to develop popular scalable web applications
such as YouTube. Because of its popularity, Python can help you with tools for
big data and business analytics in science, engineering and other areas of
scalable computing. You can use Python’s inbuilt libraries such as Panda and
NumPy to help you with data analytics.
To fast-track the mastery of SQL, you should understand how the following
statements/commands are used:
· Select
· From
· Where
· Group By
· Having
· Order By
Besides mastering the basic SQL commands, you should also understand the
reason behind the use of primary keys, foreign keys and candidate keys in these
systems.
#5: MatLab
MatLab is a very powerful, flexible and open source programming language that
is easy to learn, use and has powerful libraries for data manipulation,
management, and analysis. Its simple syntax is easy to learn and resembles C or
C++. If you have basic skills in these programming languages, you’ll not have a
problem with MatLab language.
Similarly, MatLab has been used to develop popular scalable web applications
such as YouTube. Because of its popularity, MatLab can help you with tools for
big data and business analytics in science, engineering and other areas of
scalable computing. This is because it integrates well with existing independent
IT infrastructure systems.
#6: Perl
Perl is a dynamic and high-level programming language that you can use for
data analytics. Originally developed as a scripting language for UNIX by Larry
Wall, Perl has provided its UNIX-like features and flexibility of any programming
language to develop robust and scalable systems.
With the advent of the internet in 1990’s, Perl usage exploded. Besides
providing dominant features of CGI programming, Perl has also become a key
language for data analysis because of its rich set of analysis libraries.
#7: Java
Java and its Java-based frameworks are found deep in the skeletons of virtually
all the biggest Silicon Valley tech companies. When you look at Twitter,
LinkedIn, or Facebook, you’ll find that Java is the backbone programming
language for all the data engineering infrastructures. While Java doesn’t
provide the same features of data analytics as Python and R, I can bet for Java
when it comes to the excellent performance of systems on a large scale.
Java’s speed makes it one of the best languages for developing large-scale
systems. While Python is significantly faster than R, the Java language provides
an even greater performance compared to Python. It is because of this reason
that Twitter, Facebook, and LinkedIn have picked Java as the backbone of their
systems. However, Java language may not be appropriate for statistical
modeling.
#8: Julia
Today, the vast majority of data analytics use R, Java, MatLab and Python for
data analysis. However, there’s still some gap that requires being filled since
there’s no language that’s one-stop-shop for data analysis needs. Julia is a new
programming language that can fill the gaps with respect to improving
visualizations and libraries for data analysis.
Even though the Julia programming community is in an infancy, more and more
programmers will soon realize its potentials in data analysis and adopt it.
· Preparation phase
· Analysis phase
· Reflection phase
· Dissemination phase
Let’s dive in to explore these phases.
· Data files from online repositories such as the public websites. For instance,
the U.S. Census data sets.
· Data files streamed on-demand through APIs. For instance, the Bloomberg
financial data stream.
· Physical apparatus such as scientific lab equipment that has been attached to
computers.
· Data from computer software such as log files from the web server.
After inspecting your set of output files, you can take notes if you’re dealing
with an experiment that that’s in either physical or digital format.
#4: Dissemination phase
Dissemination is the final phase of data analysis workflow. You can present your
results using written reports such as the internal memos, PowerPoint
presentations or business white papers. If you’re in the academic field, you can
publish the academic paper.
Statistical process
The process of data analysis begins with identifying the population from which
you’ll obtain data. Because it’s practically impossible to get data on every
subject in the population, you should use an appropriate sampling technique to
get a sample size that’s representative. The statistical process is a four-step
phase activity that includes the following:
· Estimate the expected proportion of the population that you want to study.
The proportion of that population must of interest to the study. If you have the
agreed benchmark from literature review or prior studies, you can use it as the
basis for your expected proportion. If in doubt, consult experts in that field to
get the correct estimate.
· Determine the confidence interval for use in your analysis. Think of confidence
level as the “margin of error” in your sample size. Now, all the empirical
estimates are based on a sample that must have a certain degree of
uncertainty. It’s a must for you to specify the desired total spectrum of the
confidence interval.
· Set the value of the confidence level. This provides the precision or level of
uncertainty in the analysis. Typically a 95% confidence level is widely used.
However, a narrow confidence interval that has a high confidence level such as
99% is likely to be as representative of the population as possible.
· Use the statistical table to estimate your sample size. If the number that is
required is too large, you can recalculate it with lower confidence levels or use
wider intervals to choose a smaller sample size.
· Chi-square distributions
· ANOVA
· ANCOVA