AWS Sagemaker

ML Pipleine

Section1:

Problem Formulation

Formulate problem according to the business problem
Section2:
Collect and Label Data
:Collect data according to the problem and if needed label the data
Section3:
Evaluate Data
EDA (Explorative Data analysis)
Section4:
Feature Engineering
Section5:
Select and train model
Section6:
Meets Business goal?
Deploy model
Section7:
Evaluate model
If the model needs more refinement~
Section8:
Tune model
Change the model network structure. Revert the parameter of the model
Additional process
Section4 -> Section5 -> Section7 -> Section8 circulates

If the model is tuned, feature engineering, training, evaluating the model process should be repeated.
If the model meets the business goal, the model is deployed
If the model doesn’t meet the goal, augmentation process can be needed
=? Augmentation
Feature Augmentation

In the current situation, the current features are incapable of having a satisfactory inference level
=> Add features or combine the original features and make new features
Data Augmentation

The satisfactory level of inference does not come out because there is lack of data or training/test data evaluate scores show totally different scores
Solution: Additional collection/labeling of data is done. Or data augmentation (revert the original data and add the reverted data)

Define Business Goal
How will the business measure success?
How will the solution be used?
Do similar solutions exist, which might you learn from?
What assumptions have been made?
Who are the domain experts?

How would you frame this problem?
Is this a machine learning problem?
Is the problem supervised or unsupervised?
What is the target to predict?
Do you have access to the data?
What is the minimum performance?
How would you solve the problem manually?
What’s the simplest solution?

Example. You want to identify fraudulent credit card transactions so that you can stop the transaction before it processes.
Why?
Reduce the number of customers who end their membership because of fraud
Can you measure it?
Move from qualative statements to quantative statements that can be measured.
10% reductions in fraud claims in retail

Collect and label Data

Collecting and securing data
What data do you need?
How much data do you have, and where is it?
Do you have access to that data?
What solution can you use to bring all of this data into one centralized repository?
Data sources
Private data:
Data that customers create
Commercial data:
AWS Data Exchange, AWS Marketplace, and other external providers
Open-source data:
Data that is publicly available (check for limits on usage)
Kaggle, World Health Organization, US Census Bureau, National Oceanic and Atmospheric Administration (US), UC Irvine Machind Learning Repository

Observations
ML also need a lot of data(feature/target data) - observations - where the target answer or prediction is already known
Get a domain expert
Storing data in AWS
S3 is mostly used. Used a lot as Big data analysis, Deep Learning model repository. Due to its High data IO speed, security, and stability.

Extract, transform, load(ETL)
Original data in data stores ( Datas can be in different formats and places) -> Bring the data -> Catalog the data -> Write the Transform Script that reverts the data source -> Write the schedule script that implements the tranform script => Data transform script is implemented -> Data source is reverted -> Data is stored again as store the refined data final table data set.
ETL with AWS Glue
-Service to do ETL
-Simplifies the complicated ETL service process
-Runs the ETL process
-ML functionality
-Crawls data sources to create catalogs that other systems can query
-Bring Datasources, revert the data, emit single endpoint

Data Encryption
Amazon S3 encryption
Amazon RDS encryption
AWS CloudTrail for Audit
AWS cloudtrail track user activity or application programming interface(API) usage
What kind of access was made in the past by whom by what role

Section3: Evaluate Data

-Format Data
-Examine Data types
-Perform descriptive statistics
-Visualize data

You must understand your data(Format Data, Examine Datatypes)
Ensure that it’s the right data format for your analysis
-Whether this is a table, whether the attribute that is supposed to contain float is a string

-Perform descriptive statistics

Load python Dataframe
Reformats data into tabular representation
Converts common data formats CSV, Json, Excel, Pickle and others

Data Analysis using pandas
Dataframe schema, Descriptive statistics(Overall statistics, Multivariate Statistics, Attribute Statistics

AWS Sagemaker

Collect and label Data

Section3: Evaluate Data

Section4: Feature Engineering