Skip to main content

What is Data Science Methodology Emails?

What is Data Science Methodology?

Methodology can be defined as a system of methods used in a particular area of study or activity. It is a method of investigating the concept of focal points. Now let’s see the 'Methodology' in terms of Data Science. We shall see this with an example for: data science methodology case study emails.

Data science methodology emails
Data science methodology emails

You are given some problem. What should your approach be?

  1. What is the problem that you are trying to solve?
  2. How can you use the data to answer the question? 

Now Work with the data:

  1. What is the data you need to solve the problem?
  2. Where is the data coming from? Identify all the sources of the data and find out how will you get it?
  3. Check whether the data you have collected will help you in the solving of the problem?
  4. What additional work you must do to manipulate and work with the data?

The Final Part: Getting the answers

  1. In what way can the data be visualized to get the answer that is required?
  2. Whether the model developed from such approach really answers our questions or it has to be adjusted?
  3. Can you use the Model to successfully solve the problem?
  4. Can you get constructive feedback in answering the question?

The above can be summarized as follows:

  1. Business Understanding
  2. Analytic Approach
  3. Data Requirements
  4. Data Collection
  5. Data Understanding
  6. Data Preparation
  7. Data Modelling 
  8. Evaluation
  9. Deployment
  10. Feedback

The above is the basic approach in data science methodology. Now Analyze it with a small example.

I shall select Email Classification as my case study to apply for data science methodology.

1. Business Understanding 

We receive Emails in our mailbox. We receive mails from our friends, subscriptions and so on but sometimes we receive lots of mail which are fishy trying to steal our password, give discounts and ask for personal information for lucky draw prize to be won. Can we automatically detect such spam mails and put them in the spam folder?

2. Analytic Approach 

Here we must classify the mails which we receive in our E-mail Id to spam and non-spam. So, the basic question is whether the mail received is a spam or not a spam. So, we shall use the classification model as it gives us the answer as yes or no. If the result comes as yes, the mail shall go to the spam folder and if no it shall be delivered as a normal mail in our Inbox.

3. Data requirements

For this we need all the mails which we have received. Thus, identifying the data fulfils the data requirement stage of the data science methodology.

4. Data Collection

Now that we have identified our data requirements, we shall start collecting the data for it. We shall collect all the mail from our mailboxes. But that is not enough. We shall collect more mail samples from our friends or the Internet for increasing our data. The data thus collected can be structured, unstructured or semi structured.

5. Data Understanding and Preparation

Now that we have the data, we will understand its content, access its quality, discover any preliminary insights and determine whether additional data is necessary to full in the gaps. We see some mails are deliberately have spelling mistakes like med1icine, w4tches and so on. Some mails are having punctuation errors. Some are offering us deals and discounts. Some mails are asking us for our personal information and passwords. All these mails are looking fishy and could be spam.

We then see try to establish a relationship between them. We plot histogram and other plots to see how the variables are distributed. See their maximum, minimum, mean and other parameters. We see that discount word occurring as discounts, discounted and discounting. Should we consider them as one. We must decide and start preparation of the data accordingly.

Data preparation that is also known as data cleaning requires 70% to 90% time of our project and if done properly will give us a correct model. We remove all the mails which are duplicate that is they are having same content. We consider all the mails having words such as discount, discounts, discounted and discounting as one. We do feature engineering and we add or delete columns as per our findings.

Once we have removed, added and did all the other operations on our data, we merge all the data in one table which is data frame. With this our data preparation is completed.

6. Modelling and Evaluation

We have different algorithms and libraries which we can download and install and build models with our data frame. Try different algorithms and choose the one which gives you the most accuracy. To achieve this, you have to understand the question for which you are solving the problem for, in our case it is whether the mail is spam or not. Then select an analytic approach or method to solve the problem.

Then we do the evaluation of our model, that is check its quality. We split our data set into a training set and a test set. We build our model using the training set. Then we will test the model on test set and compare the spam emails that the model predicts to the actual spam emails.

7. Deployment

It is the actual application of the model. We deploy our model to classify mail as spam and non-spam and the spam mail goes into the spam folder.

8. Feedback

This is tested by the client whether the model is giving the right output or not. If it is giving the right results, we setup it for the client. If it is not giving the right result, we again start the modelling process with feedback given by the client. If we require to collect new data, we collect it. It is an iterative process and continues until the client gets the right result.

Hope you got to know and understand the data science methodology case study emails.


Popular posts from this blog

How to remove powered by Blogger?

Starting a free blog on gives you the extension of It can't be removed unless you purchase a domain name and redirect your blog to it.

But for the "powered by blogger" in the footer portion of the layout, it can be removed. There are many ways to remove the "powered by blogger" attribution widget. We shall see the one which is very short and east to implement.

If you are using the extension, removing "powered by blogger makes no sense. But if you have purchased a custom domain and want to present your blog as a professional one, it becomes a must to remove "powered to blogger". This let's your audience know that you are serious about blogging and your content is reliable and helpful.
Steps to follow to remove "powered by blogger"Go to the blog "Theme". Click on "Edit HTML".Search for  ]]></b:skin> using Ctrl + F.Just above it paste #Attribution1{display:none;} Now cl…

What is the national bird of Japan?

The Green Pheasant is the national bird of Japan. It is also know as Japanese Green Pheasant.

Japanese Green Pheasant is an omnivorous bird native to Japan. Green pheasant eat small animals such as insects, worms, plants and grains. There can be hybrids of Green Pheasant and Copper Pheasant. There are 3 subspecies of Green Pheasant. The female Green Pheasant is smaller than the male ones with a shorter tail.

Japanese people believe that Green Pheasant are scared of earth quakes and screams. Green Pheasant are eaten as food. They are also kept as pets. They are the only species that can be legally hunted in Japan, but for that a hunting licence is required. It's typical habitat is grassland, parkland, woodland, farmland and forest edges.

Green Pheasant have been introduced to Hawaii and in North America as a game bird. It was designated as Japan's national bird in 1947 at the 81st meeting of the National Bird Society of Japan. Green Pheasant might have been selected as the nation…

What is the best shirt material for hot weather?

A shirt is a garment worn by men on their upper body from neck to the waist. Hot and humid weather during summer results in sweating of the body. As a result our clothes get soaked with sweat and we feel uneasy.
Here is the list of best shirt material for hot weather 1. CottonCotton is one of the best fabric for shirt material in hot weathers. Cotton is soft, cheap, lightweight and breathable. It soaks up the sweat, allowing the heat to escape the body and makes you feel cool. 2. KhadiKhadi is a hand woven hand spun cloth. It gained popularity during the Swadeshi Movement. The maintenance of khadi is very easy and so you can use it as a shirt material for hot weather. 3. LeninLenin is light in weight and is loosely woven. It allows the heat to escape the body making you feel dry and fresh.  4. RayonRayon is a man-made fabric made out of cellulose, wood pulp, cotton and other natural fibers. Rayon is a good fabric material for sports and summer wear. 
Other popular shirt material for h…