What is Data Science Methodology?Methodology can be defined as a system of methods used in a particular area of study or activity. It is a method of investigating the concept of focal points. Now let’s see the 'Methodology' in terms of Data Science. We shall see this with an example for: data science methodology case study emails.
|Data Science Methodology Emails|
You are given some problem. What should your approach be?
- What is the problem that you are trying to solve?
- How can you use the data to answer the question?
Now Work with the data:
- What is the data you need to solve the problem?
- Where is the data coming from? Identify all the sources of the data and find out how will you get it?
- Check whether the data you have collected will help you in the solving of the problem?
- What additional work you must do to manipulate and work with the data?
The Final Part: Getting the answers
- In what way can the data be visualized to get the answer that is required?
- Whether the model developed from such approach really answers our questions or it has to be adjusted?
- Can you use the Model to successfully solve the problem?
- Can you get constructive feedback in answering the question?
The above can be summarized as follows:
- Business Understanding
- Analytic Approach
- Data Requirements
- Data Collection
- Data Understanding
- Data Preparation
- Data Modelling
The above is the basic approach in data science methodology. Now Analyze it with a small example.I shall select Email Classification as my case study to apply for data science methodology.
1. Business UnderstandingWe receive Emails in our mailbox. We receive mails from our friends, subscriptions and so on but sometimes we receive lots of mail which are fishy trying to steal our password, give discounts and ask for personal information for lucky draw prize to be won. Can we automatically detect such spam mails and put them in the spam folder?
2. Analytic ApproachHere we must classify the mails which we receive in our E-mail Id to spam and non-spam. So, the basic question is whether the mail received is a spam or not a spam. So, we shall use the classification model as it gives us the answer as yes or no. If the result comes as yes, the mail shall go to the spam folder and if no it shall be delivered as a normal mail in our Inbox.
3. Data requirementsFor this we need all the mails which we have received. Thus, identifying the data fulfils the data requirement stage of the data science methodology.
4. Data CollectionNow that we have identified our data requirements, we shall start collecting the data for it. We shall collect all the mail from our mailboxes. But that is not enough. We shall collect more mail samples from our friends or the Internet for increasing our data. The data thus collected can be structured, unstructured or semi structured.
5. Data Understanding and PreparationNow that we have the data, we will understand its content, access its quality, discover any preliminary insights and determine whether additional data is necessary to full in the gaps. We see some mails are deliberately have spelling mistakes like med1icine, w4tches and so on. Some mails are having punctuation errors. Some are offering us deals and discounts. Some mails are asking us for our personal information and passwords. All these mails are looking fishy and could be spam.
We then see try to establish a relationship between them. We plot histogram and other plots to see how the variables are distributed. See their maximum, minimum, mean and other parameters. We see that discount word occurring as discounts, discounted and discounting. Should we consider them as one. We must decide and start preparation of the data accordingly.
Data preparation that is also known as data cleaning requires 70% to 90% time of our project and if done properly will give us a correct model. We remove all the mails which are duplicate that is they are having same content. We consider all the mails having words such as discount, discounts, discounted and discounting as one. We do feature engineering and we add or delete columns as per our findings.
Once we have removed, added and did all the other operations on our data, we merge all the data in one table which is data frame. With this our data preparation is completed.
6. Modelling and EvaluationWe have different algorithms and libraries which we can download and install and build models with our data frame. Try different algorithms and choose the one which gives you the most accuracy. To achieve this, you have to understand the question for which you are solving the problem for, in our case it is whether the mail is spam or not. Then select an analytic approach or method to solve the problem.
Then we do the evaluation of our model, that is check its quality. We split our data set into a training set and a test set. We build our model using the training set. Then we will test the model on test set and compare the spam emails that the model predicts to the actual spam emails.
7. DeploymentIt is the actual application of the model. We deploy our model to classify mail as spam and non-spam and the spam mail goes into the spam folder.
8. FeedbackThis is tested by the client whether the model is giving the right output or not. If it is giving the right results, we setup it for the client. If it is not giving the right result, we again start the modelling process with feedback given by the client. If we require to collect new data, we collect it. It is an iterative process and continues until the client gets the right result.
Hope you got to know and understand the data science methodology case study emails.