Big Data and Analytics (BDA601) Assignment Help

https://gradespire.com/

 

ASSESSMENT 2 BRIEF

Subject Code and Title 

BDA601—Big Data and Analytics

Assessment 

Visualisation and Model Development

Individual/Group 

Individual

Length 

Source Code and Report 1,000 words (+/—10%)

Learning Outcomes 

The Subject Learning Outcomes demonstrated by the successful  completion of the task below include:  

c) Apply data science principles to the cleaning, manipulation, and  visualisation of data 

d) Design analytical models based on a given problems; and 

e) Effectively report and communicate findings to an appropriate  audience.

Submission 

Due by 11.55 pm AEST on the Sunday at the end of Module 8.

Weighting 

30%

Total Marks 

100 marks

Task Summary 

Customer churn, also known as customer attrition, refers to the movement of customers from one  service provider to another. It is well known that attracting new customers costs significantly more than retaining existing customers. Additionally, long-term customers are found to be less costly to  serve and less sensitive to competitors’ marketing activities. Thus, predicting customer churn is  valuable to telecommunication industries, utility service providers, paid television channels, insurance  companies and other business organisations providing subscription-based services. Customer-churn  prediction allows for targeted retention planning. 

In this Assessment, you will build a machine learning (ML) model to predict customer churn using the  principles of ML and big data tools. 

As part of this Assessment, you will write a 1,000-word report that will include the following: 

a) A predictive model from a given dataset that follows data mining principles and techniques; b) Explanations as to how to handle missing values in a dataset; and 

c) An interpretation of the outcomes of the customer churn analysis. 

Please refer to the Task Instructions (below) for details on how to complete this task.

Task Instructions 

1. Dataset Construction 

Kaggle telco churn dataset is a sample dataset from IBM, containing 21 attributes of approximately 7,043 telecommunication customers. In this Assessment, you are required to  work with a modified version of this dataset (the dataset can be found at the URL provided below). Modify the dataset by removing the following attributes: MonthlyChargesOnlineSecurity, StreamingTV, InternetService and Partner

As the dataset is in .csv format, any spreadsheet application, such as Microsoft Excel or Open  Office Calc, can be used to modify it. You will use your resulting dataset, which should  comprise 7,043 observations and 16 attributes, to complete the subsequent tasks. The ‘Churn’ attribute (i.e., the last attribute in the dataset) is the target of your churn analysis. 

2. Model Development 

From the dataset constructed in the previous step, present appropriate data visualisation and  descriptive statistics, then develop a ‘decision-tree’ model to predict customer churn. The  model can be developed in Jupyter Notebook using Python and Spark’s Machine Learning  Library (Pyspark MLlib). You can use any other platform if you find it more efficient. The  notebook should include the following sections: 

a) Problem Statement 

In this section, briefly state the context and the problem you will solve in the  notebook. 

b) Exploratory Data Analysis 

In this section, perform both a visual and statistical exploratory analysis to gain  insights about the dataset. 

c) Data Cleaning and Feature Selection 

In this section, perform data pre-processing and feature selection for the model,  which you will build in the next section. 

d) Model Building 

In this section, use the pre-processed data and the selected features to build a  ‘decision-tree’ model to predict customer churn. 

In the notebook, the code should be well documented, the graphs and charts should be neatly labelled, the narrative text should clearly state the objectives and a logical justification for  each of the steps should be provided. 

3. Handling Missing Values 

The given dataset has very few missing values; however, in a real-world scenario, data scientists often need to work with datasets with many missing values. If an attribute is  important to build an effective model and have significant missing values, then the data scientists need to come up with strategies to handle any missing values.  

From the ‘decision-tree’ model, built in the previous step, identify the most important  attribute. If a significant number of values were missing in the most important attribute column, implement a method to replace the missing values and describe that method in your  report.  

4. Interpretation of Churn Analysis 

Modelling churn is difficult because there is inherent uncertainty when measuring churn.  Thus, it is important not only to understand any limitations associated with a churn analysis  but also to be able to interpret the outcomes of a churn analysis. 

In your report, interpret and describe the key findings that you were able to discover as part  of your churn analysis. Describe the following facts with supporting details: 

The effectiveness of your churn analysis: What was the percentage of time at which your analysis was able to correctly identify the churn? Can this be considered a  satisfactory outcome? Explain why or why not; 

Who is churning: Describe the attributes of the customers who are churning and  explain what is driving the churn; and 

Improving the accuracy of your churn analysis: Describe the effects that your previous  steps, model development and handling of missing values had on the outcome of your  churn analysis and how the accuracy of your churn analysis could be improved. 

Submission Instructions 

• Zip the following files and submit the .zip files via the Assessment link in the main  navigation menu in BDA601—Big Data and Analytics

o Modified dataset (.csv file) constructed in Task 1; 

o Notebook (.ipynb file) from Task 2; and 

o Report (.pdf file) from Task 3. 

The Learning Facilitator will provide feedback via the Grade Centre in the LMS portal. Feedback can  be viewed in My Grades. 

Academic Integrity Declaration 

I declare that except where referenced, the work I am submitting for this assessment task is my own  work. I have read and am aware of the Academic Integrity Policy and Procedure of Torrens University, Australia, viewable online at

I am also aware that I need to keep a copy of all submitted material and any drafts and I agree to do  so. 

 

Assessment Rubric

Assessment  

Attributes

Fail  

(Yet to Achieve Minimum  Standard) 

0–49%

Pass 

(Functional) 

50–64%

Credit 

(Proficient) 

65–74%

Distinction 

(Advanced) 

75–84%

High Distinction 

(Exceptional) 

85–100%

Knowledge and  

understanding of  

exploratory data  

analysis 

15%

Demonstrates partial or  unsatisfactory knowledge  and understanding of the  exploratory data analysis. 

Demonstrates unsatisfactory  skills in: 

Exploring the data using  both the measure of  

central tendency and  

the measure of  

dispersions; and/or 

Exploring the data using  various visual  

representations, such as a histogram, scatter  

plot, box plot, heatmap,  pair plot or probability  

distribution plot.

Demonstrates functional  knowledge and  

understanding of the  

exploratory data analysis. 

Demonstrates satisfactory  skills in: 

Exploring the data using  both the measure of  

central tendency and the measure of dispersions;  

and 

Exploring the data using  various visual  

representations, such as  a histogram, scatter plot,  box plot, heatmap, pair  

plot or probability  

distribution plot.

Demonstrates solid 

knowledge and  

understanding of the  

exploratory data analysis. 

Demonstrates solid skills in: Exploring the data  

using both the measure  of central tendency  

and the measure of  

dispersions; and 

Exploring the data  

using various visual  

representations, such  

as a histogram, scatter  

plot, box plot,  

heatmap, pair plot or 

probability distribution  plot. 

Only selective statistics  were produced from  

the above-mentioned  

visuals.

Demonstrates advanced  knowledge and  

understanding of the  

exploratory data analysis. 

Demonstrates advanced  skills in: 

Exploring the data  

using both the measure  of central tendency and  the measure of  

dispersions; and 

Exploring the data  

using various visual  

representations, such  

as a histogram, scatter  

plot, box plot,  

heatmap, pair plot or 

probability distribution  plot. 

Appropriate statistics  were produced from  

the above-mentioned  

visuals.

Demonstrates exceptional  knowledge and  

understanding of the  

exploratory data analysis. 

Demonstrates exemplary  skills in: 

Exploring the data  using both the  

measure of central  

tendency and the 

measure of  

dispersions; and 

Exploring the data  using various visual 

representations, such  

as a histogram, scatter  plot, box plot,  

heatmap, pair plot or 

probability distribution  plot. 

Appropriate statistics  were produced from  

the above-mentioned 

visuals. 

Gained unique insights  about the dataset 

     

through the statistical  

observations.

Analytical design for data pre-processing and feature selection 

15%

Demonstrates partial or  unsatisfactory knowledge  and understanding of data  pre-processing and feature  selection. 

Completed less than 50% of the following tasks and the  tasks completed were 

unsatisfactory in terms of quality, accuracy and 

completeness: 

Handling data  

anomalies; 

Conducting the  

redundancy and  

correlation analysis;  

and/or 

Selecting the feature for  model building.

Demonstrates satisfactory  knowledge and  

understanding of data pre processing and feature  

selection. 

Completed most of the  

following tasks with accuracy and completeness to a  

satisfactory quality: 

Handling data anomalies; Conducting the  

redundancy and  

correlation analysis;  

and/or 

Selecting the feature for  model building.

Demonstrates solid 

knowledge and  

understanding of data pre processing and feature  selection. 

Completed most of the  following tasks with  

accuracy and completeness to a good quality: 

Handling data  

anomalies; 

Conducting the  

redundancy and  

correlation analysis;  

Selecting the feature  for model building.  

and 

Correctly interpreted 2  of the above tasks.

Demonstrates advanced  knowledge and  

understanding of data pre processing and feature  selection. 

Completed all of the  

following tasks with  

accuracy and completeness to a high quality: 

Handling data  

anomalies. 

Conducting the  

redundancy and  

correlation analysis;  

Selecting the feature  for model building.  

and 

Correctly interpreted all  3 of the above tasks.

Demonstrates exceptional  knowledge and  

understanding of data pre processing and feature  selection. 

Completed all of the  

following tasks with  

accuracy and completeness to an exceptionally high  quality: 

Handling data  

anomalies; 

Conducting the  

redundancy and  

correlation analysis;  

Selecting the feature  for model building

Correctly interpreted  all 3 of the above tasks. and 

Relevant analytical  insights were  

presented as part of  

the interpretation.

Predictive model  

building 

20%

Demonstrates partial or  unsatisfactory knowledge  and understanding of  

predictive model building. 

Completed less than 50% of the following tasks and the  tasks completed were

Demonstrates satisfactory  knowledge and  

understanding of predictive  model building. 

Completed most of the  

following tasks with accuracy

Demonstrates solid 

knowledge and  

understanding of predictive  model building. 

Completed most of the  following tasks with 

Demonstrates advanced  knowledge and  

understanding of predictive  model building. 

Completed all of the  

following tasks with 

Demonstrates exceptional  knowledge and  

understanding of predictive  model building. 

Completed all of the  

following tasks with  

accuracy and completeness

 

unsatisfactory in terms of quality, accuracy and 

completeness: 

Appropriately used the  data for training,  

validation and testing; 

Built a ‘decision-tree’  model using Spark’s  

MLlib library; 

Graphically represented  the decision-tree model;  and/or 

Correctly interpreted the decision-tree model.

and completeness to a  

satisfactory quality: 

Appropriately used the  data for training,  

validation and testing; 

Built a ‘decision-tree’  model using Spark’s  

MLlib library; 

Graphically represented  the decision-tree model; 

accuracy and completeness to a good quality: 

Appropriately used the  data for training,  

validation and testing; 

Built a ‘decision-tree’  model using Spark’s  

MLlib library; 

Graphically  

represented the  

decision-tree model;  

and/or 

Produced an 

ambiguous 

interpretation of the  

decision-tree model.

accuracy and completeness to a high quality: 

Appropriately used the  data for training,  

validation and testing; 

Built a ‘decision-tree’  model using Spark’s  

MLlib library; 

Graphically  

represented the  

decision-tree model;  

and 

Correctly interpreted the decision-tree  

model.

to an exceptionally high  quality: 

Appropriately used the  data for training,  

validation and testing; 

Built a ‘decision-tree’ model using Spark’s  

MLlib library; 

Graphically  

represented the  

decision-tree model;  

and  

Correctly interpreted the decision-tree  

model. 

Discovered unique  observations through  

the interpretation of  

the model.

Clarity and  

presentation of the  notebook 

10%

Lacks overall  

organisation. 

Codes are documented  unsatisfactorily. 

Charts and graphs are of  unsatisfactory quality. 

Narrative texts difficult  to follow.

Not well organised for  the most part. 

Codes are documented  satisfactorily. 

Charts and graphs are of  satisfactory quality. 

Narrative texts are not  cohesive but can still be  

followed.

Organised for the most  part. 

Code is very well  

documented. 

Charts and graphs are  neat and are of good  

quality. 

Narrative texts are  mostly cohesive.

Well organised 

Code is very well  

documented. 

Charts and graphs are  neat and of high  

quality. 

Narrative texts are  highly cohesive and  

easy to follow.

Exceptionally  

organised. 

Code is exceptionally  well documented. 

Charts and graphs are  neat and of  

exceptional quality. 

Narrative texts are  highly cohesive and  

easy to follow.

Knowledge and  

understanding of  

missing value  

handling strategy

Demonstrates partial or  unsatisfactory knowledge  and understanding of a

Demonstrates satisfactory  knowledge and  

understanding of a missing  value handling strategy.

Demonstrates solid 

knowledge and  

understanding of a missing  value handling strategy.

Demonstrates advanced  knowledge and 

understanding of a missing  value handling strategy.

Demonstrates exceptional  knowledge and  

understanding of a missing  value handling strategy.

10%

missing value handling  

strategy. 

Does not correctly  

identify the most  

important attribute 

from the decision tree. 

The formulated  

strategies are 

unsatisfactory in terms  

of accuracy and  

completeness. 

The overall organisation  and presentation of the  report is unsatisfactory.

Correctly identifies the  most important attribute  from the decision tree. 

The formulated  

strategies are  

satisfactorily accurate  

and complete. 

The overall organisation  and presentation of the  

report is satisfactory.

Correctly identifies the  most important 

attribute from the  

decision tree. 

The formulated  

strategies are mostly  

accurate and complete. The overall  

organisation and  

presentation of the  

report is good.

Correctly identifies the  most important 

attribute from the  

decision tree. 

The formulated  

strategies are accurate  and mostly complete. 

The overall  

organisation and  

presentation of the  

report is exceptionally  

good.

Correctly identifies the  most important 

attribute from the  

decision tree. 

The formulated  

strategies are accurate  and complete. 

The overall  

organisation and  

presentation of the  

report is exemplary.

Interpretation of  

data analysis 

30%

The outcomes and  

discussions were not focused  and missed all the following  points: 

A very limited number of outcomes were  

produced and the  

related discussions were  poor; 

The analysis produced  hardly any insights; and 

Any possible  

performance  

improvements were  

entirely missed.

The outcomes and  

discussions were limited in  focus and missed at least two of the following points: 

The outcomes were  measured, but the  

related discussions were  only basic; 

The analysis produced  some basic insights;  

and/or 

Possible performance  improvements were  

evaluated at a basic  

level.

The outcomes and  

discussions were focused  and missed at least one of  the following points: 

The outcomes were  measured and the  

related discussions 

were solid; 

The analysis produced  some solid insights;  

and/or 

Possible performance  improvements were  

evaluated.

The outcomes and  

discussions were mostly focused and included all of the following points: 

The outcomes were  measured and the  

related discussions 

were advanced; 

The analysis produced  advanced insights; and 

Possible performance  improvements were  

mostly evaluated.

The outcomes and  

discussions were well  

focused and included all of  the following points: 

The outcomes were  measured and the  

related discussions 

were exceptional; 

The analysis produced  thought-provoking  

insights; and 

Possible performance  improvements were  

fully and correctly  

evaluated.

The following Subject Learning Outcomes are addressed in this assessment 

SLO c) 

Apply data science principles to the cleaning, manipulation and visualisation of data.

SLO d) 

Design analytical models based on a given problem.

SLO e) 

Effectively report and communicate findings to an appropriate audience.