Collecting and securing data
What data do you need?
How much data do you have, and where is it?
Do you have access to that data?
What solution can you use to bring all of this data into one centralized repository?
Data sources
Private data:
Data that customers create
Commercial data:
AWS Data Exchange, AWS Marketplace, and other external providers
Open-source data:
Data that is publicly available (check for limits on usage)
Kaggle, World Health Organization, US Census Bureau, National Oceanic and Atmospheric Administration (US), UC Irvine Machind Learning Repository
Observations
ML also need a lot of data(feature/target data) - observations - where the target answer or prediction is already known
Get a domain expert
Storing data in AWS
S3 is mostly used. Used a lot as Big data analysis, Deep Learning model repository. Due to its High data IO speed, security, and stability.
Extract, transform, load(ETL)
Original data in data stores ( Datas can be in different formats and places) -> Bring the data -> Catalog the data -> Write the Transform Script that reverts the data source -> Write the schedule script that implements the tranform script => Data transform script is implemented -> Data source is reverted -> Data is stored again as store the refined data final table data set.
ETL with AWS Glue
-Service to do ETL
-Simplifies the complicated ETL service process
-Runs the ETL process
-ML functionality
-Crawls data sources to create catalogs that other systems can query
-Bring Datasources, revert the data, emit single endpoint
Data Encryption
Amazon S3 encryption
Amazon RDS encryption
AWS CloudTrail for Audit
AWS cloudtrail track user activity or application programming interface(API) usage
What kind of access was made in the past by whom by what role
-Format Data
-Examine Data types
-Perform descriptive statistics
-Visualize data
You must understand your data(Format Data, Examine Datatypes)
Ensure that it’s the right data format for your analysis
-Whether this is a table, whether the attribute that is supposed to contain float is a string
-Perform descriptive statistics
Load python Dataframe
Reformats data into tabular representation
Converts common data formats CSV, Json, Excel, Pickle and others
Data Analysis using pandas
Dataframe schema, Descriptive statistics(Overall statistics, Multivariate Statistics, Attribute Statistics