CB9165 – Big Data Analytics and Visualisation Individual project instruction 1

Question

CB9165 – Big Data Analytics and Visualisation Individual project instruction 1. Assessment structure The individual project accounts for 80% of the module grade. Please choose one data set in the list below for your project. The length of the report should be of 2000 words, excluding references. All the relevant literature and resources for your project should be properly cited in the Harvard referencing style (you will find this website helpful). For your convenience, a template is provided here. Marks allocated to Criteria criteria: 20% 1. Introduction to data and research question (700 words) Please introduce the data set used and its background. The relevant literature (e.g., academic journal articles and textbooks) should be surveyed and properly cited with Harvard referencing style. More importantly, please identify a problem to be addressed with this data set (i.e., the research question). Please note that the problem should be specific (i.e., relevant in the application domain and linked to the variables available from the data set). 15% 2. Data processing and exploration (300 words) Please explain: Which variables are available from the data set? Which variables have been selected for the analysis and why? What data transformations have been done and why? 25% 3. Data visualisation and interpretation (600 words) Please provide at least three data visualisations as descriptive analytical results (e.g., properties of the variables selected) and advanced analytical results (e.g., relationships between the variables selected, machine learning results). Please follow best practices taught in the module regarding data visualization. Importantly, please interpret the results and findings with details. Note that the data visualisations should be nontrivial representations of information, yet easy to interpret. 20% 4. Data insights and conclusions (400 words) Please provide the insights drawn from the analytics and summarise the findings. In particular, is the problem (i.e., research question) identified at the beginning addressed by the analytics? How? 20% 5. Writing, styling and references The clarity, logic and presentation of the report, including spelling, grammar and punctuation. The general styling and references should be clear and consistent. 2. Recommended data sets Please find a list of recommended data sets below. All of them have significant textual content (a major type of unstructured data). Therefore, text analytics tools should be employed. Your analysis could build on existing code shared by the online community (e.g., from Kaggle.com). If so, please cite the original sources (links or relevant publications) properly in the Harvard referencing style. • • • • • [Business] Amazon review data: This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs), covering 29 product categories from Amazon (you may focus on a single category for your individual project, please don’t use software or magazine as they were already used as examples in class). Please note that you will be asked to complete a short form regarding proper use of the data when first time downloading. [Society] COVID19 tweets: The tweets have #covid19 hashtag. Collection started on 25/7/2020, with an initial 17k batch. [Finance] Daily news for stock market prediction: A combination of news data of historical news headlines from Reddit WorldNews Channel and stock data of Dow Jones Industrial Average (DJIA). [Society] US Election 2020 Tweets: Tweets containing the hashtags of the candidates’ names collected during the election period. [Business] Women's e-commerce clothing reviews: This is a Women’s Clothing ECommerce dataset revolving around the reviews written by customers. You may choose another data set not listed here. If so, please contact the module convenor for approval before conducting the project. [Project title] 1. Introduction to data and research question [700 words] [Please introduce the data set used and its background. The relevant literature (e.g., academic journal articles and textbooks) should be surveyed and properly cited with Harvard referencing style. More importantly, please identify a problem to be addressed with this data set (i.e., the research question). Please note that the problem should be specific (i.e., relevant in the application domain and linked to the variables available from the data set).] 2. Data processing and exploration [300 words] [Please explain: Which variables are available from the data set? Which variables have been selected for the analysis and why? What data transformations have been done and why?] 3. Data visualisation and interpretation [600 words] [Please provide at least three data visualisations as descriptive analytical results (e.g., properties of the variables selected) and advanced analytical results (e.g., relationships between the variables selected, machine learning results). Please follow best practices taught in the module regarding data visualization. Importantly, please interpret the results and findings with details. Note that the data visualisations should be nontrivial representations of information, yet easy to interpret.] 4. Data insights and conclusions [400 words] [Please provide the insights drawn from the analytics and summarise the findings. In particular, is the problem (i.e., research question) identified at the beginning addressed by the analytics? How?] 5. References [Please add a list of references here corresponding to the citations used in the text above. Again, please follow the Harvard referencing style. References will NOT be included in word account.] CB9165 – Big Data Analytics and Visualisation Lecture 1 (Week 13) Topic 1 – Introduction Dr Zhen Zhu Z.Zhu@kent.ac.uk Agenda of Lecture 1 1. Module expectations 2. Definition of big data 1. Module expectations: a curious mind Photo Credits: Unsplash 1. Module expectations: theoretical side HDFS YARN LBB (2018) 1. Module expectations: engineering side SQL, NoSQL… Hadoop, MapReduce… HDFS, HBase, MongoDB… Photo Credits commons.wikimedia.org 1. Module expectations: practical hands Photo Credits commons.wikimedia.org 1. Module expectations 2012 By Matt Turck https://mattturck.com/a-chart-of-the-big-data-ecosystem/ 1. Module expectations 2020 By Matt Turck https://mattturck.com/data2020/ 1. Module expectations Simon Walkowiak (2016) Big Data Analytics with R Packt Publishing Code available @ https://github.com/PacktPublishing/BigData-Analytics-with-R 1. Module expectations Wilfried Lemahieu, Seppe vanden Broucke, Bart Baesens (2018) Principles of Database Management Cambridge University Press Slides available @ https://www.pdbmbook.com/ 2. Definition of big data How big is too big? 1KB (kilobyte) 1MB (megabyte) 1GB (gigabyte) 1TB (terabyte) 1PB (petabyte) 1EB (exabyte) 1ZB (zettabyte) = 10001B = 10002B = 10003B = 10004B = 10005B = 10006B = 10007B 2. Definition of big data: 3Vs VOLUME VELOCITY VARIETY Photo Credits: Unsplash 2. Definition of big data: 3Vs VOLUME VELOCITY VARIETY Photo Credits: Unsplash 2. Definition of big data: 3Vs VOLUME VELOCITY VARIETY Photo Credits: Unsplash 2. Definition of big data: 3Vs VOLUME VERACITY VELOCITY VALENCE VARIETY VALUE Recap of Lecture 1 1. Module expectations: a curious mind + practical hands 2. Definition of big data: 3Vs CB9165 – Big Data Analytics and Visualisation Lecture 2 (Week 14) Topic 1 – Introduction Dr Zhen Zhu Z.Zhu@kent.ac.uk Recap of Lecture 1 1. Module expectations: a curious mind + practical hands 2. Definition of big data: 3Vs Agenda of Lecture 2 1. Where is big data from? 2. Why now? 3. Analytical process 1. Where is big data from? 1. Where is big data from? business, human, machine By HITACHI https://www.slideshare.net/hdscorp/capitalize-on-big-data-through-hitachi-innovation 1. Where is big data from? structured v.s. unstructured structured 1. Where is big data from? structured v.s. unstructured unstructured Photo Credits: Pexels 1. Where is big data from? structured v.s. unstructured unstructured structured 20% 10% RAM 2. R is slow compared to other languages Acceptable at small-scale computations Generally lags behind C and even Python 3. R solutions from within 1. Data must fit within the available RAM Within-R solution: RAM-HDD mapping R packages such as ff, ffbase and ffbase2 2. R is slow compared to other languages Within-R solution: parallel computing R packages such as parallel, foreach and doParallel 3. R solutions from within: RAM-HDD mapping Import the data with ff Time the process Import by chunks Time spent 3. R solutions from within: RAM-HDD mapping Import the data as usual Time the process Time spent Usual way is faster Where is the advantage of using ff? 3. R solutions from within: RAM-HDD mapping Much less RAM used with ff!! 0.3% of the original size in this example! Where is the advantage of using ff? 3. R solutions from within: RAM-HDD mapping Manipulation with ffbase Time the process Time spent Manipulation as usual Time the process Time spent Usual way is faster 3. R solutions from within: RAM-HDD mapping The resulted object is the same. But much less RAM used with ff!! Where is the advantage of using ff? 3. R solutions from within: RAM-HDD mapping Further functions available from ffbase 3. R solutions from within: RAM-HDD mapping Operations as usual 3. R solutions from within: RAM-HDD mapping Operations as usual The start of Topic 2: Big Data Storage Recap of Lecture 3 1. Memory and storage basics Workbench (RAM) & shelves (HDD) 2. R limitations Data must fit within RAM & slow computing 3. R solutions from within RAM-HDD mapping & parallel computing CB9165 – Big Data Analytics and Visualisation Lecture 4 (Week 16) Topic 2 – Big Data Storage Main Reference: Walkowiak (ch. 5) LBB (ch. 1,2,6,7) Dr Zhen Zhu Z.Zhu@kent.ac.uk The start of Topic 2: Big Data Storage Recap of Lecture 3 1. Memory and storage basics 2. R limitations 3. R solutions from within The start of Topic 2: Big Data Storage Recap of Lecture 3 1. Memory and storage basics Workbench (RAM) & shelves (HDD) 2. R limitations Data must fit within RAM & slow computing 3. R solutions from within RAM-HDD mapping & parallel computing Agenda of Lecture 4 1. Fundamental concepts of database 2. SQL examples 1. Fundamental concepts of DB: Key definitions A database can be defined as a collection of related data items within a specific business process or problem setting has a target group of users and applications A Database Management System (DBMS), is the software package used to define, create, use and maintain a database consists of several software modules The combination of a DBMS and a database is then often called a database system Ch. 1, LBB (2018) 1. Fundamental concepts of DB: File v.s. database File Approach Invoicing CRM GIS CustomerNr CustomerName VATcode CustomerNr CustomerName Turnover CustomerNr CustomerName ZipCode Duplicate data! Ch. 1, LBB (2018) 1. Fundamental concepts of DB: File v.s. database File Approach Duplicate or redundant information will be stored Danger of inconsistent data Strong coupling between applications and data Hard to manage concurrency control Hard to integrate applications providing cross-company services Ch. 1, LBB (2018) 1. Fundamental concepts of DB: File v.s. database Database Approach Invoicing CRM GIS DBMS Raw data Catalog Ch. 1, LBB (2018) 1. Fundamental concepts of DB: File v.s. database Database Approach Superior in terms of efficiency, consistency and maintenance Loose coupling between applications and data Facilities provided for data querying and retrieval Ch. 1, LBB (2018) 1. Fundamental concepts of DB: File v.s. database File Approach Procedure FindCustomer; begin open file Customer.txt; Read(Customer) While not EOF(Customer) If Customer.name='Bart' then display(Customer); EndIf Read(Customer); EndWhile; End; Database Approach SELECT * FROM Customer WHERE name = 'Bart' Ch. 1, LBB (2018) 1. Fundamental concepts of DB: Data model A conceptual data model provides a high-level description of the data items with their characteristics and relationships usually represented using an Enhanced-Entity Relationship (EER) model Logical data model is a translation or mapping of the conceptual data model towards a specific implementation environment can be a hierarchical, relational, or NoSQL model Logical data model can be mapped to an internal data model that represents the data’s physical storage details clearly describes which data is stored where External data model contains various subsets of the data items in the logical model, also called views, tailored towards the needs of specific applications or groups of users Ch. 1, LBB (2018) 1. Fundamental concepts of DB: 3-layer Architecture Framework Ch. 1, LBB (2018) 1. Fundamental concepts of DB: 3-layer Architecture A Business Example Ch. 1, LBB (2018) 1. Fundamental concepts of DB: Database languages Data Definition Language (DDL) is used by the DBA to express the database's external, logical and internal data models definitions are stored in the catalog Data Manipulation Language (DML) is used to retrieve, insert, delete, and modify data DML statements can be embedded in a programming language, or entered interactively through a front-end querying tool Structured Query Language (SQL) offers both DDL and DML statements for relational database systems Ch. 1, LBB (2018) 2. SQL examples: DDL Column Constraints PRIMARY KEY constraint defines the primary key of the table FOREIGN KEY constraint defines a foreign key of a table UNIQUE constraint defines an alternative key of a table NOT NULL constraint prohibits NULL values for a column DEFAULT constraint sets a default value for a column CHECK constraint defines a constraint on the column values Ch. 7, LBB (2018) 2. SQL examples: DDL DROP & ALTER DROP command can be used to drop or remove database objects can be combined with CASCADE and RESTRICT ALTER statement can be used to modify table column definitions Ch. 7, LBB (2018) 2. SQL examples: DML SELECT INSERT DELETE UPDATE Ch. 7, LBB (2018) 2. SQL examples: DML SELECT INSERT DELETE UPDATE SELECT component FROM component [WHERE component] [GROUP BY component] [HAVING component] [ORDER BY component] Ch. 7, LBB (2018) Recap of Lecture 4 1. Fundamental concepts of database 2. SQL examples Recap of Lecture 4 1. Fundamental concepts of database File v.s. database approach Data model, 3-layer architecture Database languages 2. SQL examples DDL & DML CB9165 – Big Data Analytics and Visualisation Lecture 5 (Week 17) Topic 3 – Big Data Processing Main Reference: Walkowiak (ch. 4,7) LBB (ch. 19) Dr Zhen Zhu Z.Zhu@kent.ac.uk Recap of Lecture 4 1. Fundamental concepts of database 2. SQL examples Recap of Lecture 4 1. Fundamental concepts of database File v.s. database approach Data model, 3-layer architecture Database languages 2. SQL examples DDL & DML The end of Topic 2: Big Data Storage The start of Topic 3: Big Data Processing Agenda of Lecture 5 1. Hadoop 2. The Hadoop stack 3. Apache Spark 1. Hadoop Open-source software framework used for distributed storage and processing of big data sets Can be set up over a cluster of computers built from normal, commodity hardware Many vendors offer their implementation of a Hadoop stack (e.g. Amazon, Cloudera, Dell, Oracle, IBM, Microsoft) Ch. 19, LBB (2018) 1. Hadoop: History Key building blocks: • Google File System: a file system that could be easily distributed across commodity hardware, whilst providing fault tolerance • Google MapReduce: a programming paradigm to write programs that can be automatically parallelised and executed across a cluster of different computers Nutch web crawler prototype developed by Doug Cutting • Later renamed to Hadoop In 2008, Yahoo! open-sourced Hadoop as “Apache Hadoop” Ch. 19, LBB (2018) 1. Hadoop: The Hadoop stack Four modules: • Hadoop Common: a set of shared programming libraries used by the other modules • Hadoop Distributed File System (HDFS): a Java-based file system to store data across multiple machines • MapReduce framework: a programming model to process large sets of data in parallel • YARN (Yet Another Resource Negotiator): handles the management and scheduling of resource requests in a distributed environment Ch. 19, LBB (2018) 2. The Hadoop stack: HDFS Distributed file system to store data across a cluster of commodity machines High emphasis on fault-tolerance HDFS cluster is composed of a NameNode and various DataNodes Ch. 19, LBB (2018) 2. The Hadoop stack: HDFS NameNode • a server which holds all the metadata regarding the stored files • manages incoming file system operations • maps data blocks (parts of files) to DataNodes DataNode • handles file read and write requests • create, delete and replicate data blocks amongst their disk drives • continuously loop, asking the NameNode for instructions. Ch. 19, LBB (2018) 2. The Hadoop stack: HDFS An illustration of HDFS Ch. 19, LBB (2018) 2. The Hadoop stack: HDFS An illustration of HDFS Ch. 19, LBB (2018) 2. The Hadoop stack: HDFS An illustration of HDFS Ch. 19, LBB (2018) 2. The Hadoop stack: HDFS An illustration of HDFS Ch. 19, LBB (2018) 2. The Hadoop stack: MapReduce Programming paradigm made popular by Google and subsequently implemented by Apache Hadoop Focus on scalability and fault tolerance A map-reduce pipeline starts from a series of values and maps each value to an output using a given mapper function Ch. 19, LBB (2018) 2. The Hadoop stack: MapReduce A MapReduce pipeline in Hadoop starts from a list of key-value pairs, and maps each pair to one or more output elements The output elements are also key-value pairs Next, the output entries are grouped so all output entries belonging to the same key are assigned to the same worker (e.g. physical machine) These workers then apply the reduce function to each group, producing a new list of key-value pairs The resulting, final outputs can then be sorted Ch. 19, LBB (2018) 2. The Hadoop stack: MapReduce An example of MapReduce Ch. 4, Walkowiak (2016) 2. The Hadoop stack: MapReduce An example of MapReduce Ch. 4, Walkowiak (2016) 2. The Hadoop stack: MapReduce An example of MapReduce Ch. 4, Walkowiak (2016) 2. The Hadoop stack: YARN Yet Another Resource Negotiator (YARN) distributes a MapReduce program across different nodes and takes care of coordination Three important services • ResourceManager: a global YARN service that receives and runs applications (e.g., a MapReduce job) on the cluster • JobHistoryServer: keeps a log of all finished jobs • NodeManager: responsible to oversee resource consumption on a node Ch. 19, LBB (2018) 2. The Hadoop stack: YARN An illustration of YARN Ch. 19, LBB (2018) 2. The Hadoop stack: YARN An illustration of YARN Ch. 19, LBB (2018) 2. The Hadoop stack: YARN An illustration of YARN Ch. 19, LBB (2018) 2. The Hadoop stack: YARN An illustration of YARN Ch. 19, LBB (2018) 3. Apache Spark Open-source alternative for MapReduce New programming paradigm centred on a data structure called the resilient distributed dataset (RDD) which can be distributed across a cluster of machines and is maintained in a fault tolerant way RDDs can enable the construction of iterative programs that have to visit a data set multiple times, as well as more interactive or exploratory programs 10 – 100 times faster than MapReduce implementations Rapidly adopted by many Big Data vendors Ch. 7, LBB (2018) The start of Topic 3: Big Data Processing Recap of Lecture 5 1. Hadoop 2. The Hadoop stack 3. Apache Spark The start of Topic 3: Big Data Processing Recap of Lecture 5 1. Hadoop The elephant in the room 2. The Hadoop stack Four modules: Hadoop common, HDFS, MapReduce, YARN 3. Apache Spark Faster than MapReduce CB9165 – Big Data Analytics and Visualisation Lecture 6 (Week 18) Topic 3 – Big Data Processing Main Reference: Walkowiak (ch. 8) LBB (ch. 20) Dr Zhen Zhu Z.Zhu@kent.ac.uk The start of Topic 3: Big Data Processing Recap of Lecture 5 1. Hadoop 2. The Hadoop stack 3. Apache Spark The start of Topic 3: Big Data Processing Recap of Lecture 5 1. Hadoop The elephant in the room 2. The Hadoop stack Four modules: Hadoop common, HDFS, MapReduce, YARN 3. Apache Spark Faster than MapReduce Agenda of Lecture 6 1. Data preprocessing 2. Unsupervised machine learning 3. Supervised machine learning 1. Data preprocessing 1. Data preprocessing: Data types Continuous Defined on a continuous interval For example: income, sales, RFM variables Categorical Nominal No ordering between values For example: marital status Ordinal Implicit ordering between values For example: credit rating (AAA > AA, AA > A, …) Binary For example: Amazon review verified or not Ch. 20, LBB (2018) 1. Data preprocessing: Exploratory analysis Histogram Age 3500 Descriptive statistics 3000 2500 Mean Median Mode Standard deviation Percentile values 2000 1500 1000 500 0 0-5 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60 60-65 65-70 70-200 Ch. 20, LBB (2018) 2. Unsupervised machine learning 2. Unsupervised machine learning Unsupervised Without target variables Main methods Association rules Sequence rules Clustering Ch. 20, LBB (2018) 2. Unsupervised machine learning Clustering algorithms Ch. 20, LBB (2018) 2. Unsupervised machine learning Hierarchical methods Divisive Agglomerative Ch. 20, LBB (2018) 2. Unsupervised machine learning Distance btw. points Euclidean Manhattan Ch. 20, LBB (2018) 2. Unsupervised machine learning Distance btw. clusters Single Complete Average Centroid Ch. 20, LBB (2018) 2. Unsupervised machine learning K-means clustering • Step 1: Select K observations as initial cluster centroids (seeds) • Step 2: Assign each observation to cluster that has closest centroid (for example, in Euclidean sense) • Step 3: When all observations have been assigned, recalculate positions of K centroids • Step 4: Repeat until cluster centroids no longer change Ch. 20, LBB (2018) 3. Supervised machine learning 3. Supervised machine learning Supervised With target variables Main methods Linear Regression Logistic Regression Decision Trees Support Vector Machines Neural Networks Ch. 20, LBB (2018) 3. Supervised machine learning Linear regression Ch. 20, LBB (2018) 3. Supervised machine learning Linear regression Ch. 20, LBB (2018) 3. Supervised machine learning Logistic regression (classification) Ch. 20, LBB (2018) 3. Supervised machine learning 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -7 -5 -3 -1 1 3 5 7 Logistic regression (classification) Ch. 20, LBB (2018) Recap of Lecture 6 1. Data preprocessing 2. Unsupervised machine learning 3. Supervised machine learning Recap of Lecture 6 1. Data preprocessing Data types: continuous/categorical Exploratory analysis 2. Unsupervised machine learning Clustering: hierarchical (divisive/agglomerative), nonhierarchical (k-means) 3. Supervised machine learning Linear regression (continuous target variable) Logistic regression (categorical target variable; a classification method) CB9165 – Big Data Analytics and Visualisation Lecture 7 (Week 20) Topic 4 – Big Data Visualisation Main Reference: Knaflic (ch. 1,2) Dr Zhen Zhu Z.Zhu@kent.ac.uk Recap of Lecture 6 1. Data preprocessing 2. Unsupervised machine learning 3. Supervised machine learning Recap of Lecture 6 1. Data preprocessing Data types: continuous/categorical Exploratory analysis 2. Unsupervised machine learning Clustering: hierarchical (divisive/agglomerative), nonhierarchical (k-means) 3. Supervised machine learning Linear regression (continuous target variable) Logistic regression (categorical target variable; a classification method) The end of Topic 3: Big Data Processing The start of Topic 4: Big Data Visualisation Agenda of Lecture 7 1. Introduction to data visualisation 2. The importance of context (lesson 1) 3. Choosing an effective visual (lesson 2) 1. Introduction to data visualisation 1. Introduction to data visualisation Knaflic (2015) Storytelling with Data Wiley Companion website @ http://www.storytellingwithdata.com/ 1. Introduction to data visualisation Example 1 original Intro., Knaflic (2015) 1. Introduction to data visualisation Example 1 improved Intro., Knaflic (2015) 1. Introduction to data visualisation Example 2 original Intro., Knaflic (2015) 1. Introduction to data visualisation Example 2 improved Intro., Knaflic (2015) 1. Introduction to data visualisation Six lessons 1. 2. 3. 4. 5. 6. Understand the context Choose an appropriate visual display Eliminate clutter Focus attention where you want it Think like a designer Tell a story Intro., Knaflic (2015) 2. The importance of context 2. The importance of context Exploratory v.s. explanatory analysis Exploratory analysis • The process of figuring out what might be noteworthy • Hunting for pearls in oysters (e.g. find 2 pearls out of 100 oysters) Explanatory analysis • Explain a specific thing and tell a specific story (e.g. 2 pearls) • Data visualisation is to show explanatory analysis • Show the 2 pearls, not the 100 oysters! Ch. 1, Knaflic (2015) 2. The importance of context Communication mechanism continuum Ch. 1, Knaflic (2015) 2. The importance of context Who, what and how Example Imagine you are a fourth grade science teacher. You just wrapped up an experimental pilot summer learning programme on science that was aimed at giving kids exposure to the unpopular subject. You surveyed the children at the onset and end of the programme to understand whether and how perceptions toward science changed. You believe the data shows a great success story. You would like to continue to offer the summer learning programme on science going forward… Ch. 1, Knaflic (2015) 2. The importance of context Who, what and how Ch. 1, Knaflic (2015) 2. The importance of context Who, what and how Example Who: The budget committee that can approve funding for continuation of the summer learning programme. What: The summer learning programme on science was a success; please approve budget of £X to continue. How: Illustrate success with data collected through the survey conducted before and after the pilot programme. Ch. 1, Knaflic (2015) 3. Choosing an effective visual 3. Choosing an effective visual Visual types Ch. 2, Knaflic (2015) 3. Choosing an effective visual Text Original Improved Ch. 2, Knaflic (2015) 3. Choosing an effective visual Tables Minimise borders Ch. 2, Knaflic (2015) 3. Choosing an effective visual Tables Ch. 2, Knaflic (2015) 3. Choosing an effective visual Points Ch. 2, Knaflic (2015) 3. Choosing an effective visual Points Highlighted Ch. 2, Knaflic (2015) 3. Choosing an effective visual Lines Ch. 2, Knaflic (2015) 3. Choosing an effective visual Lines Ch. 2, Knaflic (2015) 3. Choosing an effective visual Lines Highlighted Ch. 2, Knaflic (2015) 3. Choosing an effective visual Bars Bad practice Ch. 2, Knaflic (2015) 3. Choosing an effective visual Bars Ch. 2, Knaflic (2015) 3. Choosing an effective visual Bars Ch. 2, Knaflic (2015) 3. Choosing an effective visual Bars Ch. 2, Knaflic (2015) 3. Choosing an effective visual Area Bad practice Ch. 2, Knaflic (2015) 3. Choosing an effective visual Area Bad practice Ch. 2, Knaflic (2015) The start of Topic 4: Big Data Visualisation Recap of Lecture 7 1. Introduction to data visualisation 2. The importance of context (lesson 1) 3. Choosing an effective visual (lesson 2) The start of Topic 4: Big Data Visualisation Recap of Lecture 7 1. Introduction to data visualisation Six lessons 2. The importance of context (lesson 1) Exploratory v.s. explanatory Who, what and how 3. Choosing an effective visual (lesson 2) Text, tables, points, lines, bars, area CB9165 – Big Data Analytics and Visualisation Lecture 8 (Week 21) Topic 4 – Big Data Visualisation Main Reference: Knaflic (ch. 1,2) Dr Zhen Zhu Z.Zhu@kent.ac.uk The start of Topic 4: Big Data Visualisation Recap of Lecture 7 1. Introduction to data visualisation 2. The importance of context (lesson 1) 3. Choosing an effective visual (lesson 2) The start of Topic 4: Big Data Visualisation Recap of Lecture 7 1. Introduction to data visualisation Six lessons 2. The importance of context (lesson 1) Exploratory v.s. explanatory Who, what and how 3. Choosing an effective visual (lesson 2) Text, tables, points, lines, bars, area Agenda of Lecture 8 1. Clutter is your enemy! (lesson 3) 2. Focus your audience’s attention (lesson 4) 1. Clutter is your enemy! 1. Clutter is your enemy! Cognitive load The mental effort required to learn new information Closely related to short-term memory The data-ink or signal-to-noise ratio Tufte: “The larger share of a graphic’s ink devoted to data, the better.” Duarte: “Maximising the signal-to-noise ratio.” Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example: Maximise data-ink ratio by minimising borders Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principles of visual perception Originated from the Gestalt School of Psychology in early 1900s Understand how people perceive order in the world around them Six principles Proximity Similarity Enclosure Closure Continuity Connection Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of proximity Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of proximity Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of similarity Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of similarity Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of enclosure Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of enclosure Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of closure Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of closure Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of continuity Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of connection Ch. 3, Knaflic (2015) 1. Clutter is your enemy! Gestalt principle of connection Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example of decluttering Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example of decluttering: Remove chart border Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example of decluttering: Remove gridlines Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example of decluttering: Remove point markers Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example of decluttering: Tidy up axis labels Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example of decluttering: Label data directly Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example of decluttering: Leverage consistent colour Ch. 3, Knaflic (2015) 1. Clutter is your enemy! An example of decluttering: Before v.s. after Ch. 3, Knaflic (2015) 2. Focus your audience’s attention 2. Focus your audience’s attention A simplified picture of how you see Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Iconic memory It happens super fast without you consciously realising it Implications for visualisation: Use preattentive attributes Short-term memory We can only process limited information with short-term memory Implications for visualisation: Reduce cognitive burden Long-term memory Work better with both visual and verbal hints Implications for visualisation: Combine visual with verbal Ch. 4, Knaflic (2015) 2. Focus your audience’s attention An example: count the 3s 756395068473 658663037576 860372658602 846589107830 Ch. 4, Knaflic (2015) 2. Focus your audience’s attention An example: count the 3s 756395068473 658663037576 860372658602 846589107830 Ch. 4, Knaflic (2015) 2. Focus your audience’s attention More examples of preattentive attributes Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Create a visual hierarchy of information with preattentive attributes Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Another example: Top 10 design concerns Engine power is less than expected 12.9 Tires make excessive noise while… 12.3 Engine makes abnormal/excessive… 11.6 Seat material concerns 11.6 Excessive wind noise 11.0 Hesitation or delay when shifting 10.3 Bluetooth system has poor sound… 10.0 Steering system/wheel has too much… 8.8 Bluetooth system is difficult to use 8.6 Front seat… 8.2 Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Another example: Top 10 design concerns Engine power is less than expected 12.9 Tires make excessive noise while… 12.3 Engine makes abnormal/excessive… 11.6 Seat material concerns 11.6 Excessive wind noise 11.0 Hesitation or delay when shifting 10.3 Bluetooth system has poor sound… Complaints about engine noise commonly cited after the car had not been driven for a while. 10.0 Steering system/wheel has too much… 8.8 Bluetooth system is difficult to use 8.6 Front seat… Comments indicate that noisy tire issues are most apparent in the rain. 8.2 Excessive wind noise is noted primarily in freeway driving at high speeds. Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Revisit the tickets example 300 250 200 Received 150 Processed 100 50 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Revisit the tickets example 300 250 200 Received 150 Processed 100 50 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Revisit the tickets example 300 250 200 Received 150 Processed 100 50 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Revisit the tickets example 300 241 250 202 237 184 200 160 184 150 160 180 149 148 177 161 181 160 139 132 156 150 140 123 100 Received 149 126 Processed 124 104 50 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Ch. 4, Knaflic (2015) 2. Focus your audience’s attention Revisit the tickets example 300 250 202 200 177 160 139 150 Received 149 156 140 126 100 Processed 124 104 50 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Ch. 4, Knaflic (2015) Recap of Lecture 8 1. Clutter is your enemy! (lesson 3) 2. Focus your audience’s attention (lesson 4) Recap of Lecture 8 1. Clutter is your enemy! (lesson 3) The data-ink or signal-to-noise ratio Gestalt principles of visual perception 2. Focus your audience’s attention (lesson 4) Iconic, short-term, long-term memories Preattentive attributes CB9165 – Big Data Analytics and Visualisation Lecture 9 (Week 22) Topic 4 – Big Data Visualisation Main Reference: Knaflic (ch. 5,7,8) Dr Zhen Zhu Z.Zhu@kent.ac.uk Recap of Lecture 8 1. Clutter is your enemy! (lesson 3) 2. Focus your audience’s attention (lesson 4) Recap of Lecture 8 1. Clutter is your enemy! (lesson 3) The data-ink or signal-to-noise ratio Gestalt principles of visual perception 2. Focus your audience’s attention (lesson 4) Iconic, short-term, long-term memories Preattentive attributes Agenda of Lecture 9 1. Think like a designer (lesson 5) 2. Tell a story (lesson 6) 3. Pulling it all together 1. Think like a designer 1. Think like a designer Affordances The aspects inherent to the design that make it obvious how the product is to be used. Examples: A knob affords turning, a button affords pushing Visualisation implications: • Highlight the important stuff • Eliminate distractions • Create a clear hierarchy of information Ch. 5, Knaflic (2015) 1. Think like a designer Affordances Ch. 5, Knaflic (2015) 1. Think like a designer Accessibility Designs should be usable by people of diverse abilities Examples: London underground tube map Visualisation implications: • Don’t overcomplicate • Text is your friend Ch. 5, Knaflic (2015) 1. Think like a designer Accessibility Ch. 5, Knaflic (2015) 1. Think like a designer Aesthetics People perceive more aesthetic designs easier to use than less aesthetic designs – whether they actually are or not Visualisation implications: • Be smart with colour • Pay attention to alignment • Leverage white space Ch. 5, Knaflic (2015) 1. Think like a designer Aesthetics Ch. 5, Knaflic (2015) 2. Tell a story 2. Tell a story The beginning Introduce the plot, build the context for your audience 1. 2. 3. 4. 5. The setting: When and where does the story take place? The main character: Who is driving the action? The imbalance: Why is it necessary, what has changed? The balance: What do you want to see happen? The solution: How will you bring about the changes? Ch. 7, Knaflic (2015) 2. Tell a story The middle Convince your audience of the need for action • • • • • • • • Further develop the situation/problem by covering background Incorporate external context or comparison points Give examples that illustrate the issue Include data that demonstrate the problem Articulate what will happen if no action is taken Discuss potential options for addressing the problem Illustrate the benefits of your recommended solution Make it clear to your audience that they can drive action Ch. 7, Knaflic (2015) 2. Tell a story The end A call to action One classical way to end a story is to tie it back to the beginning: • Recapping the problem • The resulting need for action • Reiterating any sense of urgency • Sending your audience off ready to act Ch. 7, Knaflic (2015) 2. Tell a story The power of repetition Ch. 7, Knaflic (2015) 2. Tell a story Horizontal logic Ch. 7, Knaflic (2015) 2. Tell a story Vertical logic Ch. 7, Knaflic (2015) 3. Pulling it all together 3. Pulling it all together Lesson 1: Understand the context Who: VP of Product, the primary decision maker in establishing our product’s price. What: Understand how competitors’ pricing has changed over time and recommend a price range. How: Show average retail price over time for Products A, B, C, D, and E. Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 2: Choose an appropriate display Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 2: Choose an appropriate display Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 2: Choose an appropriate display Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 2: Choose an appropriate display Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 2: Choose an appropriate display Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 3: Eliminate clutter Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 4: Draw attention where you want it Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 4: Draw attention where you want it Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 4: Draw attention where you want it Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 5: Think like a designer Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Lesson 6: Tell a story Ch. 8, Knaflic (2015) 3. Pulling it all together Before v.s. after Ch. 8, Knaflic (2015) Recap of Lecture 9 1. Think like a designer (lesson 5) 2. Tell a story (lesson 6) 3. Pulling it all together Recap of Lecture 9 1. Think like a designer (lesson 5) Affordances, accessibility, aesthetics 2. Tell a story (lesson 6) The beginning, the middle, the end The power of repetition, horizontal/vertical logic 3. Pulling it all together Review the 6 lessons Introduction Big Data Storage R Limitations: Memory Speed Definition: 3Vs Within R Solution Key R Libraries: ff, ffbase parallel Big Data Processing Big Data Solution Hadoop Stack: Common HDFS MapReduce YARN Apache Spark Big Data Visualisation Six Lessons: Context Visual Declutter Attention Designer Story All Together Text Analytics: Corpus Tokenisation Stop Words Sentiment N-Gram Visualisation Key R Libraries: sparklyr Relational DBs SQL Driving Forces: High Demand Open Software Cheap Hardware Key R Libraries: rsqlite Machine Learning: Supervised Unsupervised Key R Libraries: ggplot2 Key R Libraries: tidytext sentimentr # Created in Jan 2021 by Zhen Zhu # CB9165 Big Data Analytics and Visualisation # Seminar 2 (Week 14) # Topic 1 Introduction # 1. Explore the interface of RStudio --------------------------------------------------# # # # # Menu bar Panes/Windows Ctrl and + to zoom in Ctrl and - to zoom out Adjust panes positions # 2. RefresheR -----------------------------------------------------------# Check the location of your working directory getwd() # Change the location if you want to (for easier loading and saving) setwd('../Desktop/spring-2021/CB9165/practiceR/') # But note that wd will be reset to the default every time you restart RStudio. # If you want to change your working directory permanently, from RStudio menu bar go to "Tools" --> "Global Options" to change it. # Use R like a calculator ---#addition 1+1 #multiplication 2*2 #raise to power 3^3 # Data types in R ---#numeric myNum

Why Choose Us?

0% AI Guarantee

24/7 Support

Plagiarism Free

Expert Tutors

100% Confidential

On-Time Delivery

CB9165 – Big Data Analytics and Visualisation Individual project instruction 1

Expert Solution

Archived Solution

Why Choose Us?

0% AI Guarantee

24/7 Support

Plagiarism Free

Expert Tutors

100% Confidential

On-Time Delivery

CB9165 – Big Data Analytics and Visualisation Individual project instruction 1

Expert Solution

Related Questions

Archived Solution