Step-by-step, from-the-ground-up guide to become an ultra-hireable data analyst.
As you can imagine the
data science is the most promising science in our data-driven world. The
stakes are vital when it comes to address, tackle and mitigate our critical
issues and preoccupations, in terms of business, health, security, sustainable development,
living and working conditions and more.
In this momentum, we mostly,
need to know how to think, how to ask the right questions. How to nurture
an intuition about what things are important, and what things aren’t; and learn
to sense the “question behind the question” and to discover the exact issues
driving the need to analyze the data.
Now we are going to meditate the post below, from Cheng Han Lee, talking about the up-front work of learning
and sharpening the necessary skills.
‘’Programming
Programming is an integral aspect of data analysis. It’s the core skill that sets data analysts
apart from business analysts. You’ll need to be able to program well in one or
more programming languages—start with Python or R—and to have a good grasp of the landscape of the most commonly used data
science libraries and packages (such as ggplot2, reshape2, numpy, pandas, and
scipy).
Statistics
What good is all that programming prowess without the ability to interpret
the data? An understanding of statistics, including statistical tests,
distributions, and maximum likelihood estimators, is essential in data
analysis.
Acquaint yourself with both descriptive and inferential
statistics. The former refers to
quantitative measures that describe the properties of a sample; the latter, to
predictive measures that infer properties of the larger population by
interpreting the sample.
You’ll need to know the basics, many of which will
sound familiar from high school or college (mean, median, mode; standard
deviation and variance; hypothesis testing), onto which you will layer more
complex statistical skills as well (different types of data distribution:
standard normal, exponential/poisson, binomial, chi-square; and tests for
significance: Z-test, t-test, Mann-Whitney U, chi-squared, ANOVA).
Beyond descriptive and inferential stats, data analysts need to be adept at
statistical experimental design. That’s the systematic process of selecting
parameters in order to make results both valid and significant. For example,
you’ll need to determine how many samples to collect, how different factors
should be interwoven, how to choose good control and testing groups, and the
like. To execute strong experimental design using tools like A/B testing and
concepts like power law, best practice is to use as a barometer the idea of
“SMART (Specific, Measurable, Actionable, Realistic, Timely) experiments.”
Math
The language of data analysts is numbers, so it follows that a strong
foundation in math is an essential building block on the path to becoming a
data analyst.
At a basic level, you should be comfortable with college algebra. You’ll
have to translate what you once knew as as “word problems” (real-world
equivalent: business problems) into mathematical expressions; you’ll need to be
able to manipulate algebraic expressions and solve equations; and you’ll need
to be able to graph different types of functions, with a deep understanding of
the relationship between a function’s graph and its equation.
Beyond that, a solid grasp of multivariable calculus and linear algebra
will serve you well as a data analyst. Think: matrix manipulations, dot product,
eigenvalues and eigenvectors, and multivariable derivatives.
Machine learning
Multivariable
calculus and linear algebra, along with statistics, make up the
basic foundation of machine
learning, which enables data professionals to make
predictions or calculated suggestions based on huge amounts of data. For a
career as a data analyst, you won’t need to invent new machine-learning
algorithms (advanced skills like that qualify you to become a data scientist), but you should know the most common of them. A few examples include
principal component analysis, neural networks, support vector machines, and
k-means clustering. Note that you may not need to know the theory and
implementation details behind these algorithms, but you should understand the
pros and cons, as well as when to (and when not to) apply them to a dataset.
There are three main types of machine learning
that data analysts need to know: supervised learning, unsupervised learning, and reinforcement learning.
In supervised learning, the “learner” (computer program) is provided
with two sets of data, a training set and a test set. The computer “learns”
from a set of labeled examples in the training set so that it can identify
unlabeled examples in the test set accurately. The goal is for the learner to
develop a rule that can identify the elements in the test set. It is supervised
learning that makes it possible for your phone to recognize your voice, and
your email to filter spam. Specific tools you’ll use include:
- decision trees
- Naive Bayes classification
- Ordinary Least Squares regression
- logistic regression
- neural networks
- support vector machines
- and ensemble methods.
Unsupervised learning is what you’ll use when faced with the challenge of discovering implicit
relationships, and thus hidden structure, in a given “unlabeled” dataset.
Unsupervised learning makes it possible for Netflix to recommend movies you’d
enjoy, and Amazon to predict products you’ll like. Specific tools you’ll use include:
- clustering algorithms
- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)
- and Independent Component Analysis (ICA).
Lastly, reinforcement learning applies to situations that fall
between the two extremes of supervised and unsupervised, i.e., when there is
some form of feedback available for each predictive step or action, but no
precise label or error measure. You can
apply reinforcement learning when you want to figure out how to maximize
rewards, for instance in arenas like robot control, chess, backgammon,
checkers, and other activities that a software agent can learn. Specific tools you’ll use include:
- Q-Learning
- TD-Learning
- and genetic algorithms.
Data
wrangling
Still with us? The last three
abilities crucial to your development as a data analyst pertain to
manipulating, displaying, and interpreting data. To transform raw material
into a useful, organized datasets, data wrangling (also known as “data
munging”) comes into play. This is the process of collecting and cleaning data
so it can be easily explored and analyzed.
You’ll need to equip yourself with knowledge of database systems (both
SQL-based and NoSQL-based) that act as a central hub to store information.
It’ll be useful to be familiar with relational databases such as PostgreSQL,
mySQL, Netezza, and Oracle, as well as Hadoop, Spark, and MongoDB.
Other concepts and tools essential to data wrangling include regular
expressions, mathematical transformations, and Python String library for string
manipulations.
You’ll also need to know
how to parse common file formats such as csv and xml files and how to convert
non-normal distribution to normal with log-10 transformation.
It may all sound overwhelming right now, especially if you’re brand new not
only to the skills involved, but to some of the terms themselves. Remember that
all of these skills are stackable: each one you master will help you build the
next, and the next after that, until you’re a fully equipped data analyst ready
to kick butt and take some names.
Building on a Programming
Background
Did some, or a lot, of that content overview sound familiar to you? Have
you been trained as a software engineer, or perhaps you studied programming in
college, but yet lack the solid mathematical foundation required to become a
data analyst?
No sweat. You’re in a great position to launch a learning journey, at the
culmination of which you’ll be situated for maximum data analysis success.
Here’s what you’ll need to learn next, in order, on the road to clicking
“apply” on a data analyst job opening.
Foundational
topics
- Statistics: You’ll need to be able to rigorously interpret, make inferences, and compare different types of data by applying the right approach, technique, or statistical tests to different types of distributions. Check out the above breakdown for specific tools and skills.
- Probability: In order to draw accurate conclusions, data analysts need to be able to reason about the likelihood that an event could have happened or that it will happen. Check out the above breakdown for specific tools and skills.
Advanced
topics
- Multivariable calculus/linear algebra: These advanced math skills are less important to know than statistics and probability, but will definitely be useful if you want to understand how machine learning actually works. In addition, if you envision wanting to leverage your data analyst chops into a career as a data scientist at some point, multivariable calc and linear algebra will provide the foundational knowledge to build your own algorithms.
Building on a
Mathematical Background
OK, so maybe you’re a math whiz, but have no knowledge of programming.
Here’s a step-by-step guide to building that programming knowledge that’s so
crucial to becoming a data analyst.
Foundational
topics
- Variables, control flow, loops, functions: These are the basic building blocks of programming. Know them and love them.
- Debugging: Your code will probably not work correctly the first time around, or could break when unexpected situations occur. When that happens, you’ll need to be able to figure out what the problem is and why it’s happening. This is where debugging skills will come in handy.
- Object-oriented programming: Learn how to structure your code into object-oriented design patterns, so it can be easily reused, tested, and shared with other people.
Advanced
topics
- Data structures: For extra credit, familiarize yourself with Stacks, Queues, Lists, Arrays, Hashmaps, Priority Queues, Tries, and Graphs. There are certain situations in which one data structure will be superior to others (in terms of memory usage and runtime efficiency), and if you understand these relationships, you can optimize your program to run faster and more efficiently. That’ll impress your team, and set you apart among other data professionals.
- Algorithms: Knowing which algorithm to apply in which situation can reduce the running time of your program from a few days to a few hours, or the memory requirement from a few gigabytes to a few hundreds of megabytes. Work towards understanding divide and conquer (D&C) algorithms, greedy algorithms, dynamic programming, linear programming, and graph algorithms (depth vs. breadth vs. traversal, minimum spanning trees, and shortest path between two nodes).
- Software design patterns: Want to make your code robust, reusable, and testable? Many pioneering software engineers and computer scientists have developed software design patterns to help you do so. Become comfortable with them so you can excel at your data analysis.
The Bottom Line
Data analysis is a fast-growing field, and there are a lot of voices out
there sharing what you need to learn, in what order. The variety of information
can be confusing, overwhelming, and discouraging.
Know that you can rely on this breakdown as the definitive guide to what
you really do need to learn in order to land that first data analyst job, along
with prescriptions for where to start, depending on your specific background.
The investment in a career as a data analyst is huge, no matter if you’re
just starting out or if you’re expanding on existing abilities. But the payoff,
we promise, is even bigger.’’