RevoU Mini Course — Data Analytics
Have completed a 7 day mini course at Revou in data analytics class. I think it is important to be able to understand the data, and to understand the data we have to learn data analytics and I took the opportunity to study for 7 days in the mini course revou and this was very interesting and gave a lot of insight.
Data — Big Data
Data — numbers, characters, images or recording methods that can be processed to make decisions or actions. So the data itself is meaningless, but if it is processed it can provide information or insight.
Data type
- Structured data: Data that is processed, stored and retrieved in the same or neat format. Example: Employee details, job position and salary.
- Unstructured data: Data that lacks specific structure. In this unstructured data contains a lot of text and ad data in the form of dates, numbers or facts. Example: Email.
- Semi structured data: Combination of structured and unstructured data. Example: CSV and JSON documents
Qualitative and Quantitative Data
- Qualitative (object classification data based on attributes and characteristics. For example: skin softness, etc.)
- The data set is unstructured
- Asking “why”?
- Cannot be computed (because it is not statistical)
- Requires development of initial understanding to understand the problem. - Quantitative (data that can be measured and presented numerically or numerically. For example: height, shoe size, etc.)
- The data set is structured.
- About how much or how?
- This data is statistical and contains numbers.
- Recommend action for the end.
Common Tools for Data
- SQL
- Sheets/Excel
- Python
- tableau
Big Data — Big/lots/massive and messy/unstructured data that is too complex if only using traditional methods, therefore more processing is needed to be able to get value from that data.
Why is Big Data important? According to Mckinsey (2020), Big Data is very important for 93% of companies
3V’s of Big Data — Plus 2
Volume -> The amount is large/many
Velocity -> The data moves or updates quickly
Variety -> Structured or unstructured data types
Variability -> The variety of data flow.
Veracity -> Varied data quality
Big Data Implementation in Industry:
- E-Commerce: Consumer data collected from interactions in the e-commerce website which can then be analyzed to maximize sales strategies.
- Service Applications: Companies can collect data from drivers, consumers and merchants to optimize application user experience or driver personalization.
Real case example: GoJek application, at first they didn’t have GoFood service, there was GoShop. However, after analyzing the data, it is seen that there are more consumer patterns that make food transactions. Therefore, GoJek created a new feature, namely GoFood, which specializes in food delivery services. - Financial Industry: In the financial industry credit calculation is very important and data is used to speed up the process and manage risk.
- Health Services: Public health data is very important/crucial to improve health services from various places.
- Education: data from students that helps improvise educational operations, monitor learning performance, learning preferences and distance education.
- Media and Entertainment: the use of data analysis of consumer habits can adjust the company’s strategy to consumers personally.
Big Data Value Chain (Big Data Process to process data into insights)
- Big Data -> Capture raw data
- Data Warehousing (Data Management & Storage) -> data storage and data management — Engines -> Hadoop/Apache Technologies (Spark)
- Data Analytics (Prepare for Analytics) -> (ETL, System Touch) — Analytics(Real-time Analytics/Dashboard — Insights (Smart Decision, Visualization)
Real case : Tokopedia
Opening the application (all captured activities) -> stored in the data source -> data warehouse which will process the data to produce structured data -> visualization/providing insight carried out by data analysts.
Basic Statistical Parameters
- Measuring frequency -> is used when you want to know how often a response is given.
- Count, percent, frequency
- Summing up how often something happens - Measuring Tendency -> is used when you want to know the average or most in a response.
- Mean, Median and Mode
- Placing distributions with various points - Measuring the spread or variation -> is used to show how the data is spread, this is useful to know when the data is scattered which can affect the average value and find outliers.
- Identify the distribution of values from the initial interval
- Range = Max-Min points
- Variance or Standard Deviaton = the difference between the observed value and the average. - Measure position -> is used to compare values with normal values.
- Precentile Ranks, Quartile Ranks (Q1, Q2, Q3)
- Describes how values fall in relation to each other, based on standard values.
Normal Distribution and Hypothesis Testing
A statistical hypothesis is tested based on test data that is modeled and drawn based on a collection of random variables.
Correlation Analysis
Correlation is a statistical measure that indicates the range of two or more variables that fluctuate together or are related.
Exploratory Data Analysis (EDA)
- It is an approach to analyzing a dataset to summarize its main characteristics.
- Data visualization in EDA is the initial stage for modeling.
- EDA mainly helps analyze data to modeling.
Personal Bias
Personal bias or general prejudice or having a tendency to one thing or conclusion that can be considered unfair or only concerned with prejudiced desires or thoughts.
Database — SQL
Database -> A collection of information or data to store a collection of data, from this database later we can do searching or filtering what data we want to use.
Databases ? A set of Data. Databases support data storage and manipulation and make data management easier. Databases like excel are just a way of communicating them using SQL.
Example database: A telephone directory uses a database to store human data, telephone numbers and contact details.
Database Type:
- SQL (Relational Database Management System/RDBMS and Online Analytical Processing/OLAP Cube/Multi-dimensional array) and
- NoSQL(Key-Value, Graph, Document/file(Usually used when talking about Data Lake), Column Storage).
The main difference is that SQL is a table, NoSQL is not a table.
Data Warehousing Concept
Relational Database Management System/RDBMS
Example RDBMS -> Free application-based (MySQL, PostgreSQL), Free but there are limits and commonly used for analytics (Amazon Redshift, Google BigQuery).
RDBMS is made up of 3 layers:
- Database: the database itself
- Schema : grouping tables
- Table: the place where the data is stored
Example: Laptop is database, Schema is Folder and Table is File (Excel, Word, etc).
How to organize RDBMS ? all tables in RDMS have relations and relations in RDBMS are described by ERD (Entity Relationship Diagram).
Primary Key & Relational Database -> Primary Key(Unique data used to identify a table) and relate it to other Tables.
SQL (Structured Query Language)
What is SQL? SQL (Structured Query Language) is the language used to communicate with the Relational Database Management System.
Why do you need SQL? Able to communicate with the database, manage millions of rows / rows of data and easy to learn and use.
Basic SQL Query
- Select Statement: Used to retrieve data.
Syntax : SELECT ColumnName from TableName; or SELECT*FROMTableName;(to print all table contents) - Where Statement : Used to filter rows.
Syntax : SELECT ColumnName FROM TableName where condition; - Order By Statement: sort data from smallest to largest or vice versa
Syntax : SELECT ColumnName FROM TableName ORDER BY ColumnName ASC|DESC;
Order: WHERE-ORDER BY
One of the many platforms that can be used is Google BigQuery -> enterprise database that runs SQL queries quickly using processes from Google’s infrastructure. The good thing is that bigQuery can see the error at our queries before running the program.
Data Analytics
Data Analytics is how we analyze or process big data to get value. The data must be processed until it gets a value, otherwise the data cannot be useful. So, this data analytics is the process of processing data with data exploration to generate insights or solve problems.
Data Analytics is the science of extracting trends, patterns and relevant information from raw data to get decisions.
Data Analytics can help make effective decisions for business operations and can analyze data, generate profits, create new resources and improve managerial operations performance.
Why Data Analysis? The need for analytical skills is currently in high demand with the ability to process Big Data.
What does a data analyst do on a daily basis?
- Clean and Organize raw data.
- Verify the source and relevance of the raw data used.
- Utilize descriptive statistics to provide interpretation of data.
- Identify and analyze trends.
- Create a visual representation of the data.
Data Analytics Cycle (This work is flexible and iterative or agile)
- Discovery -> Studying the business and estimating or appraising existing sources
- Data preparation -> Execute ETL (Extract, Load & Transform)
- Planning Model -> Identifying techniques and data to understand the relationship between variables
- Build Model -> build data sets for testing, training and production
- Communicate results -> identify findings, business value and create narrative for stakeholders
- Operation -> deliver final reports, directives, codes and technical documents.
Process Data Analytics
1 — Business Question
This includes finding problems, project objectives and required solutions from a business perspective.
Good Problem Finding or Business Question :
- Has scope limitation (Not very general)
- Specific to complete
- Leads to the root causing the problem
- Can answer or have multiple solutions: who, what, where, when and how?
Example of good problem finding : How to increase sales by using social media with a target of 10% in 3 months?
2 — Get data
This step includes data acquisition, data wrangling (modifying data), data analysis and data modeling.
3 — Data exploration
At this step, we explore data and select the data needed for use including cleaning data, eliminating duplicates, formatting data from various data sources and transforming data into more useful variables.
4 — Prepare data
At this step we perform data mining, working with structured and unstructured data, using various tools and software (Google sheets, Excel, Python, SQL, R) for data transformation and integrating data in various sources.
5 — Data analysis
Using Exploratory Data Analytics (EDA), data analysts will analyze data performance and scientists will test various algorithms to find the best model for the data set.
6 — Presentation of findings
Data visualization (graphic representation of data using attractive charts, graphs or maps). Why is it needed? Because according to research there is a state that 80% of the information captured into the brain is in the form of visuals and the human brain processes images 60,000 times faster than text.
Data Analytics — Data Scientist
Data analytics look for insights from visualizations while data scientists are more into machine learning or predictive modeling that works on how to predict or provide recommendations to users from data.
- Data Analytics
- Descriptive Analysis : What Happened?
- Descriptive analysis -> Analyzing what happened and has passed, the aim is to summarize the findings, focus on summarizing the findings in facts. Tools: Excel, SPSS, Matlab.
Example: Company report, while working in E-Commerce and want to know business performance during month X and Y, from here means that we already have data or facts that will be summarized and made a report.
- Diagnostic Analysis : Why did this happen? -> Help find or identify why this is happening, do a deeper look at the data to understand the root cause or finding and understand the relationship or path.
Techniques: Data Mining, Data Discovery, Correlation, Drilldown.
Example: Why sales decreased in month X. - Data Scientist
- Predictive Analysis : What will happen?
- Prediction Analysis -> predicting future results or those that have not yet occurred.
Tools : Machine Learning Algorithms (Random Forest, SVM), Python, R
Example: Analyzing sentiment on social media, identifying target markets for campaigns, forecasting, predicting, providing recommendation systems.
- Perspective Analysis : How can we make it happen?
- Perspective Analysis -> Provide solutions or suggestions from the predictions generated
Example: It is possible for promoters and staff to be more confident in their campaigns and can outreach more customers.
Data Visualization
Visualization is a representation of data that is used to understand data more easily to gain insight in it and for decision making.
Why do you need to visualize data?
- Retrieve information quickly
- Identify patterns and relationships
- Determining the trend
- Communicating stories to people
- Help make better decisions
Tools in data visualization
- R/Python -> can process larger data, can perform more complicated statistical methods and statistical modeling, can process data from multiple sources.
- Spreadsheet/Excel -> can process less data, can perform simple statistical methods, can process data from limited sources.
- BI Tools/Data Viz Tool (Tableau, Google Data Studio, PowerBI, etc.) -> can process larger data, can perform simple statistical methods but can be integrated with Python/R, can process data from limited sources.
Dashboard Visualization
Dashboard is a tool, combination of graphs from multiple sources, that can hel business owners to monitors business health by visually tracking, analyzing, and displaying key data points. It usually shows real-time representation of data.
Good visualization ?
4 Keys to successful data visualization
- What is the story of your data is trying to tell?
- What chart type will be most efficient?
- What type of data do you want to explain?
- Who is your audience that will hear your story?
Determine What to present?
- Comparison
- Relationships
- Distribution
- Connection
- Composition
- Location
The frequently used visualizations:
- Line Chart
When is it used?
- To know the trend
- To know Comparison
- To see progress
When should it be avoided?
- Categorical data
- Too much data - Horizontal Bar Chart
When is it used?
- To sort
- To find out the comparison
When should it be avoided?
- When you want to know developments or for example monthly trends - Stacked Bar Chart
When is it used?
- To know the composition
- To find out data that changes over a period of time
When should it be avoided?
- Too many categories - Pie Chart
When is it used?
- Composition only one
- When should it be avoided?
- Too many categories/slices - Scatter plot
When is it used?
- Knowing Correlation
- Finding outliers
Huge dataset
When should it be avoided?
- Uncorrelated metrics
- Small dataset