Step-by-step, from-the-ground-up guide to become an ultra-hireable data analyst.



As you can imagine the data science is  the most promising science in our data-driven world. The stakes are vital when it comes to address, tackle and mitigate our critical issues and preoccupations, in terms of business, health, security, sustainable development, living and working conditions and more. 

In this momentum, we mostly, need to know how to think, how to ask the right questions. How to nurture an intuition about what things are important, and what things aren’t; and learn to sense the “question behind the question” and to discover the exact issues driving the need to analyze the data.

Now we are going to meditate the post below, from Cheng Han Lee, talking about the up-front work of learning and sharpening the necessary skills.

‘’Programming
Programming is an integral aspect of data analysis. It’s the core skill that sets data analysts apart from business analysts. You’ll need to be able to program well in one or more programming languages—start with Python or R—and to have a good grasp of the landscape of the most commonly used data science libraries and packages (such as ggplot2, reshape2, numpy, pandas, and scipy).

Statistics

What good is all that programming prowess without the ability to interpret the data? An understanding of statistics, including statistical tests, distributions, and maximum likelihood estimators, is essential in data analysis.

Acquaint yourself with both descriptive and inferential statistics. The former refers to quantitative measures that describe the properties of a sample; the latter, to predictive measures that infer properties of the larger population by interpreting the sample. 

You’ll need to know the basics, many of which will sound familiar from high school or college (mean, median, mode; standard deviation and variance; hypothesis testing), onto which you will layer more complex statistical skills as well (different types of data distribution: standard normal, exponential/poisson, binomial, chi-square; and tests for significance: Z-test, t-test, Mann-Whitney U, chi-squared, ANOVA).

Beyond descriptive and inferential stats, data analysts need to be adept at statistical experimental design. That’s the systematic process of selecting parameters in order to make results both valid and significant. For example, you’ll need to determine how many samples to collect, how different factors should be interwoven, how to choose good control and testing groups, and the like. To execute strong experimental design using tools like A/B testing and concepts like power law, best practice is to use as a barometer the idea of “SMART (Specific, Measurable, Actionable, Realistic, Timely) experiments.”

Math

The language of data analysts is numbers, so it follows that a strong foundation in math is an essential building block on the path to becoming a data analyst.
At a basic level, you should be comfortable with college algebra. You’ll have to translate what you once knew as as “word problems” (real-world equivalent: business problems) into mathematical expressions; you’ll need to be able to manipulate algebraic expressions and solve equations; and you’ll need to be able to graph different types of functions, with a deep understanding of the relationship between a function’s graph and its equation.

Beyond that, a solid grasp of multivariable calculus and linear algebra will serve you well as a data analyst. Think: matrix manipulations, dot product, eigenvalues and eigenvectors, and multivariable derivatives.

Machine learning

Multivariable calculus and linear algebra, along with statistics, make up the basic foundation of machine learning, which enables data professionals to make predictions or calculated suggestions based on huge amounts of data. For a career as a data analyst, you won’t need to invent new machine-learning algorithms (advanced skills like that qualify you to become a data scientist), but you should know the most common of them. A few examples include principal component analysis, neural networks, support vector machines, and k-means clustering. Note that you may not need to know the theory and implementation details behind these algorithms, but you should understand the pros and cons, as well as when to (and when not to) apply them to a dataset.

There are three main types of machine learning that data analysts need to know: supervised learning, unsupervised learning, and reinforcement learning.

In supervised learning, the “learner” (computer program) is provided with two sets of data, a training set and a test set. The computer “learns” from a set of labeled examples in the training set so that it can identify unlabeled examples in the test set accurately. The goal is for the learner to develop a rule that can identify the elements in the test set. It is supervised learning that makes it possible for your phone to recognize your voice, and your email to filter spam. Specific tools you’ll use include:
  • decision trees
  • Naive Bayes classification
  • Ordinary Least Squares regression
  • logistic regression
  • neural networks
  • support vector machines
  • and ensemble methods.
Unsupervised learning is what you’ll use when faced with the challenge of discovering implicit relationships, and thus hidden structure, in a given “unlabeled” dataset. Unsupervised learning makes it possible for Netflix to recommend movies you’d enjoy, and Amazon to predict products you’ll like. Specific tools you’ll use include:
  • clustering algorithms
  • Principal Component Analysis (PCA)
  • Singular Value Decomposition (SVD)
  • and Independent Component Analysis (ICA).
Lastly, reinforcement learning applies to situations that fall between the two extremes of supervised and unsupervised, i.e., when there is some form of feedback available for each predictive step or action, but no precise label or error measure. You can apply reinforcement learning when you want to figure out how to maximize rewards, for instance in arenas like robot control, chess, backgammon, checkers, and other activities that a software agent can learn. Specific tools you’ll use include:
  • Q-Learning
  • TD-Learning
  • and genetic algorithms.
Data wrangling

Still with us? The last three abilities crucial to your development as a data analyst pertain to manipulating, displaying, and interpreting data. To transform raw material into a useful, organized datasets, data wrangling (also known as “data munging”) comes into play. This is the process of collecting and cleaning data so it can be easily explored and analyzed.

You’ll need to equip yourself with knowledge of database systems (both SQL-based and NoSQL-based) that act as a central hub to store information. It’ll be useful to be familiar with relational databases such as PostgreSQL, mySQL, Netezza, and Oracle, as well as Hadoop, Spark, and MongoDB.

Other concepts and tools essential to data wrangling include regular expressions, mathematical transformations, and Python String library for string manipulations.

You’ll also need to know how to parse common file formats such as csv and xml files and how to convert non-normal distribution to normal with log-10 transformation.

It may all sound overwhelming right now, especially if you’re brand new not only to the skills involved, but to some of the terms themselves. Remember that all of these skills are stackable: each one you master will help you build the next, and the next after that, until you’re a fully equipped data analyst ready to kick butt and take some names.

Building on a Programming Background
Did some, or a lot, of that content overview sound familiar to you? Have you been trained as a software engineer, or perhaps you studied programming in college, but yet lack the solid mathematical foundation required to become a data analyst?
No sweat. You’re in a great position to launch a learning journey, at the culmination of which you’ll be situated for maximum data analysis success.
Here’s what you’ll need to learn next, in order, on the road to clicking “apply” on a data analyst job opening.
Foundational topics
  • Statistics: You’ll need to be able to rigorously interpret, make inferences, and compare different types of data by applying the right approach, technique, or statistical tests to different types of distributions. Check out the above breakdown for specific tools and skills.
  • Probability: In order to draw accurate conclusions, data analysts need to be able to reason about the likelihood that an event could have happened or that it will happen. Check out the above breakdown for specific tools and skills.
Advanced topics
  • Multivariable calculus/linear algebra: These advanced math skills are less important to know than statistics and probability, but will definitely be useful if you want to understand how machine learning actually works. In addition, if you envision wanting to leverage your data analyst chops into a career as a data scientist at some point, multivariable calc and linear algebra will provide the foundational knowledge to build your own algorithms.
Building on a Mathematical Background

OK, so maybe you’re a math whiz, but have no knowledge of programming. Here’s a step-by-step guide to building that programming knowledge that’s so crucial to becoming a data analyst.
Foundational topics
  • Variables, control flow, loops, functions: These are the basic building blocks of programming. Know them and love them.
  • Debugging: Your code will probably not work correctly the first time around, or could break when unexpected situations occur. When that happens, you’ll need to be able to figure out what the problem is and why it’s happening. This is where debugging skills will come in handy.
  • Object-oriented programming: Learn how to structure your code into object-oriented design patterns, so it can be easily reused, tested, and shared with other people.
Advanced topics
  • Data structures: For extra credit, familiarize yourself with Stacks, Queues, Lists, Arrays, Hashmaps, Priority Queues, Tries, and Graphs. There are certain situations in which one data structure will be superior to others (in terms of memory usage and runtime efficiency), and if you understand these relationships, you can optimize your program to run faster and more efficiently. That’ll impress your team, and set you apart among other data professionals.
  • Algorithms: Knowing which algorithm to apply in which situation can reduce the running time of your program from a few days to a few hours, or the memory requirement from a few gigabytes to a few hundreds of megabytes. Work towards understanding divide and conquer (D&C) algorithms, greedy algorithms, dynamic programming, linear programming, and graph algorithms (depth vs. breadth vs. traversal, minimum spanning trees, and shortest path between two nodes).
  • Software design patterns: Want to make your code robust, reusable, and testable? Many pioneering software engineers and computer scientists have developed software design patterns to help you do so. Become comfortable with them so you can excel at your data analysis.
The Bottom Line

Data analysis is a fast-growing field, and there are a lot of voices out there sharing what you need to learn, in what order. The variety of information can be confusing, overwhelming, and discouraging.
Know that you can rely on this breakdown as the definitive guide to what you really do need to learn in order to land that first data analyst job, along with prescriptions for where to start, depending on your specific background.
The investment in a career as a data analyst is huge, no matter if you’re just starting out or if you’re expanding on existing abilities. But the payoff, we promise, is even bigger.’’



Popular Posts