top of page
Search

8 Questions Which You Should Know Before Starting Data Science Career

  • Writer: Ajay Sharma
    Ajay Sharma
  • Dec 2, 2020
  • 4 min read

It is all well and good to learn the technical skills that you need to become a data scientist. I think that it is also extremely important to learn to think like a data scientist. That means always questioning…basically everything.



Obviously every data science problem will require you to question your methods and the data in different ways, but there are a few things that I think are important to consider whenever embarking on any new data science project. In this story, I will go through those questions and why I think that they are important to be a responsible data scientist.

My questions for any new data science project are:


  • What is the question you are trying to answer?

  • Do you know exactly what you are trying to measure?

  • Do you have the right data to answer your question?

  • Do you know enough about how your data was collected?

  • Are there any ethical considerations?

  • Who is going to read your analysis and how much do they understand statistics?

  • Do you need to be able to interrogate your methods?


1. What is the question you are trying to answer?

It is extremely important to at least have an idea of what question you are trying to answer before you interrogate any data set.


You don’t want to test multiple hypotheses and just see which ones come out as significant. If you did that, you would run into multiple hypothesis testing problems. We will go into this in more detail when we do our lessons on statistics, but in brief, it occurs when you consider multiple hypotheses at the same time.


When we talk about a significant result, in general, we are referring to a result that we are fairly confident is different from the ‘control’ because of a real effect rather than random chance. 95% confidence is most commonly used (p<0.05).


That leaves an error rate of 5%, where we label a result as significant when it really is not. The problem with testing multiple hypotheses at the same time is that the likelihood of making this type of error for at least one of the hypotheses increases. Thus, by indiscriminately testing many hypotheses at once, you would be increasing your chance of making a false discovery.

So rather than testing at random and seeing what sticks, it is much better practice to strategically use statistical tests when you have a thought-out and well-research hypothesis.


For example:

If you had a data set with measurements on 4 different groups and did a t-test between each different combination of them to see if any of them came out to be significant, then you would run into the multiple hypothesis testing problems. You would have an increased likelihood of making an erroneous conclusion.


It would be much better practice to create a null hypothesis and test that instead.

In addition to avoiding the multiple hypothesis testing problems, having clarity of thought about what question you are trying to answer will keep you from getting side-tracked.


Sometimes there are so many different shiny and interesting insights to be gleaned from a dataset that it can be easy to fall down the wrong rabbit hole. You may end up doing a lot of work to solve interesting problems but have no answers to your original questions.


That may not be such a big deal if data science is a hobby for you, but it may be much more important if you are trying to work to a deadline or solve a specific problem for the company you work for.


2. Do you know exactly what you are trying to measure?

Once you know what problem you want to solve, you need to know how you are going to solve it. Part of that is deciding what you are trying to measure.


There are often multiple different ways to approach the same question. However, if you choose to measure the wrong effect or variable, then you may not be able to effectively solve your problem. So it is extremely important to thoughtfully consider if what you are trying to measure is the most effective way to answer your question of interest.


Similar to number 1 above, if you choose the wrong effect or variable to measure, then you can spend a lot of time working on an analysis that does not really meet your needs.


For example:

I once needed to create a prediction model where the outcome variable was if an individual had been treated with a specific medical procedure. The data I was using was healthcare claims data where procedures are indicated by codes. Many of the codes were extremely specific so I needed to group a selection of codes together to create my binary outcome variable.


Whilst I was investigating my procedure of interest, I came across a subset of these procedures which captured my attention. I so wanted to go down the rabbit hole and focus on this subset of the procedure, but that is not what I had been asked to research by the company that I was working for. So I had to table that for a later date and stay on task.


Deciding which approach to take can be the most important step in any data science project.

It may take a bit of research to find out what is the correct variable to measure for your purpose, but it will be worth it. If you are confident that you have correctly chosen what to measure then you can be much more confident in the results of your analysis.


Further, when you are communicating your findings to others you can do so with the assurance that you have appropriately considered which problem you want to solve and how to measure it correctly.



For more article visit here: https://insideaiml.com/article


 
 
 

Recent Posts

See All
Python MongoDB - Insert Document

In this article, we will try to see how we can insert documents into MongoDB using python. You can store documents into MongoDB using the...

 
 
 

Comments


Post: Blog2 Post

Subscribe Form

Thanks for submitting!

©2020 by InsideAIML

bottom of page