Version: 1.0

datalake-archi

Best practice

Use Event-Sourcing to facilitate backups (store now, analyze later)
Layer data based on user’s skills (data analytic, engendering, ...)
Keep the Datalake open (avoiding vendor lock-in, or overbalance on a single tool or database)
Plan for performance
See the link

Datalake
- Pros:
  - Store raw data
  - Store different type of data (structured, semi and non-structured)
  - Allows flexibly
- Cons:
  - Multiples issues of data quality
Lambda
- Batch layer
- Realtime layer
- Serving layer
Kappa
- Simplify lamda architecture by fusionionning batch realtime layers
Partitioning
- Enables efficient data filtering and retrieval, as queries can skip irrelevant partitions during processing, which is also referred to as data pruning.
- Choose frequently used columns
- Two types
  - Static
  - Dynamic
- Partition size should be balanced to avoid data skew
Bucketing
- Bucketing improves query performance by grouping similar data together and reducing the number of files to scan during processing
- Reduce the number of files to scan and improves data locality

indexing_system