Skip to main content
Version: 1.0

architecture

  1. How to choose storage tool
  2. Managing risks
  3. SQL and NOSQL

How to choose storage tool

  • Data model

    • Key/value
    • Semi-structured
      • Column-oriented
      • Document-oriented
  • Storage model

    • In-memory
    • Persistent
  • Consistency model

    • Strictly
    • Eventually consistent
  • Physical model

    • Distributed
    • Single machine
  • Read/write performance

    • Read / write equals ? equal ?
    • Does it support range scans or is it better suited doing random reads
  • Secondary indexes

    • Does it support secondary indexes ?
    • How strong it supports secondary indexes
  • Failure handling

    • How it handles servers failure ?
    • It is able to continue operating ?
    • How it backups data in server replacement ?
    • How it performs server decommissioning
  • Compression

    • Is the compression method pluggable?
    • What types are available?
  • Load balancing

    • Does it accept load balancing
    • How it performs load balancing in case of high throughput
  • Atomic read-modify-write

  • Locking, waits and dead-locks

    • Can it be free for wait and therefore deadlocks ?
  • Security

    • ??

Questions to ask

  • Types of data to be stored
  • How the data will be transformed
  • Who should access the data
  • What are the typical access patterns

Steps to mount cluster

  • Chose OS
  • Select the FS to use with disk
    • ext3
    • ext4
    • XFS

Managing risk

Categories of risks

  • Technology risk
    • Individual component risk
    • Interation between components
    • Unfamiliarity with na technology used in designing the system
  • Team risk
    • Knowledge level and strength of team
    • Dependency on external team
    • Potentially disruptive team member
  • Requirements
    • Poorly defined requirements or poorly defined problem
    • New technologies on which team member didn't worked on

How to manage risks

  • Categorize risk in the architecture: break the architecture into pieces. Allows the risk of each component to be contained within that component.
    • Data ingest
    • Data serving
    • Data processing (compute)
    • Access pattern (JDBC, ODBC, REST, ...)
    • Data storage
  • Assign team members to work on each component

SQL and NOSQL

  • CAP
    • Consistency: data source should provide the most recent data in every copy after successfully write
    • Availability: no downtime and data system is available every time
    • Partition: system should continue to serve data even system is down in some partitions

Data Sources

  • API
  • OLTP Databases
  • OLAP
  • CDC
  • Logs

Logs

Logs should capture who, what and when

  • Who: The human, system, or service account associated with the event (e.g., a web browser user agent or a user ID)
  • What happend: the event and the related metadata
  • When: the timestamp of the even