GCP - preparing for your cloud architect journey
2023-08-05
Yeah, 考過了!
接下來要準備 OCPJP SE 11
VPC
Projects, networks, subnetworks
- A project
- Associates objects and servicing with billing
- Contains networks
- A network
3 VPC network types
- Default
- Every project
- One subnet per region
- Default firewall rules
- Auto Mode
- Default network
- One subnet per region
- Regional IP allocation
- Fixed /20 subnetwork per region
- Expandable up to /16
- Custom Mode
- No default subnets created
- Full control of IP ranges
- Regional IP allocation
- Expandable to IP ranges you specify
Expand subnets without re-creating instances
-
Cannot overlap with other subnets
-
IP range must be a unique valid CIDR block
-
New subnet IP ranges have to fall within valid IP ranges
-
Can expand but not shrink
-
Auto mode can be expanded from /20 to /16 ip range
-
Avoid creating large subnets
Do not scale your subnet beyond what you actually need.
IP Addresses
-
Internal IP
- Allocated from subnet range to VMs by DHCP
- DHCP lease is renewed every 24 hours
- VM name + IP is registered with network-scoped DNS
-
External IP
- Assigned from pool (ephemeral)
- Reserved (static)
- Bring Your Own IP address (BYOIP)
- VM doesn’t know external IP; it is mapped to the internal IP
External IPs are mapped to internal IPs
DNS resolution for internal addresses
-
Each instances has a hostname that can be resolved to an internal IP address:
-
The hostname is the same as the instance name
-
FQDN is [hostname].[zone].c.[project-id].internal
Example: my-server.us-central1-a.c.guestbook-151617.internal
-
-
Name resolution is handled by internal DNS resolver:
- Provided as part of Compute Engine (169.254.168.254).
- Configured for use on instance via DHCP
- Provides answer for internal and external addresses
DNS resolution for external addresses
- Instances with external IP address can allow connections from hosts outside the project.
- Users connect directly using external IP address
- Admins can also publish public DNS records pointing to the instance
- Public DNS records are not published automatically
- DNS records for external addresses can be published using existing DNS servers (outside of Google Cloud)
- DNS zones can be hosted using Cloud DNS
Host DNS zones using Cloud DNS
- Google’s DNS service
- Translate domain names into IP address
- Low latency
- High available (100% uptime SLA)
- Create and update millions of DNS records
- UI, command line or API
Assign a range of IP addresses as aliases to a VM’s network interface using alias IP ranges
Routes and firewall rules
A route is a mapping of an IP range to a destination
Every network has:
- Routes that let instances in a network send traffic directly to each other
- A default route that directs packets to destinations that are outside the network
Firewall rules must also allow the packet.
Routes map traffic to destination networks
- Apply to traffic egressing a VM.
- Forword traffic to most specific route.
- Are created when a subnet is created.
- Enable VMs on same network to communicate.
- Destination is in CIDR notation.
- Traffic is delivered only if it also matches a firewall rule.
Firewall rules protect your VM instances from unapproved connections
- VPC network functions as a distributed firewall
- Firewall rules are applied to the network as a whole.
- Connections are allowed or denied at the instance level.
- Firewall rules are stateful.
- Implied deny all ingress and allow all egress rule
Google Cloud firewall use case: Egress
- Conditions:
- Destination CIDR ranges
- Protocols
- Ports
- Action:
- Allow: permit the matching egress connection
- Deny: block the matching egress connection
Virtual machines
x Compute Engine GKE App Engine std env App Engine flexible env Cloud Functions Cloud Run Language support Any Any Py, Node.js, Go, Java, Ruby, PHP Py, Node.js, Go, Java, PHP, Ruby, .NET, Custom … Py, Node.js, Go, Java Any Usage model IaaS IaaS,PaaS PaaS PaaS Microservices architecture PaaS Scaling Server autoscaling Cluster Autoscaling managed servers Autoscaling managed servers Serverless Serverless Primary use case General workloads Container workloads Scalable web applications,Mobile backend applications Scalable web applications,Mobile backend applications Lightweight event actions Deploy & scale containerized apps Infrastructure as a Service
- Predefined or custom machine types
- vCPUs (cores) and Memory (RAM)
- Storage
- Zonal or regional persistent disk (HDD or SSD)
- Local SSD
- Cloud Storage
- Networking
- Linux or Windows
Compute Engine features
- Machine rightsizing
- Recommendation engine for optimum machine size
- Cloud Monitoring statistics
- New recommendation 24 hrs after VM create or resize
- (-)
- Instance metadata
- Startup scripts
- Availability policies
- Live migrate
- Auto restart
- Global load balancing
- Multiple regions for availability
- OS patch management
- Create patch approvals
- Set up flexible scheduling
- Apply advanced patch configuration settings
- (-)
- Per-second billing
- Sustained use discounts
- Committed use discounts
- Preemptible and Spot VMs:
- Up to 91% discount
- No SLA
VM access
Linux: SSH
The creator has SSH capability and can use the Cloud Console to generate a unsername and password.
- SSH from the Google Cloud console or Cloud Shell via the Google Cloud SDK
- SSH from computer or third-party client and generate key pair
- Requires firewall rule to allow
tcp:22
Windows: RDP
- RDP clients
- Powershell terminal
- Requires setting the Windows password
- Requires firewall to allow
tcp:3389
[Machine family]
\ General-Purpose
\ Compute-Optimized
\ Memory-Optimized
\ Accelerator-Optimized
[Machine series]
\
[Machine type]
Availability policy : Automatic changes
Called “scheduling options” in SDK/API
Automatic restart
- Automatic VM restart due to crash or maintenance event
- Not preemption or a user-initiated terminate
On host maintenance
- Determines whether host is live-migrated or terminated due to a maintenance event. Live migration is the default
Live migration
- During maintenance event, VM is migrated to different hardware without interruption
- Metadata indicates occurrence of live migration
Patching management is an essential part of managing an infrastructure
Manage OSes easily through Google Cloud
- Keep infrastructures up-to-date
- Redue the risk of security vulnerabilities
Two main components of OS patch management service:
- Patch compliance reporting
- reports insights on the patch status of your VM instances across Windows and Linux distributions
- Patch deployment
- automates the Operating System and software patch update process
There are several tasks that can be performed with patch management
You can … Create patch approval Set up flexible scheduling Apply advance patch configuration (e.g. pre/post patch scripts) Manage these patch jobs or updates from a centralized location
Charges for stopped (terminated) VM
No charges | Charges ($$) |
---|---|
Memory | Attached disks |
CPU resources | Reserved IP addresses |
Virtual Machines Note: Notice that you cannot change the machine type, the CPU platform, or the zone.
You can add network tags and allow specific network traffic from the internet through firewalls. Some properties of a VM are integral to the VM, are established when the VM is created, and cannot be changed. Other properties can be edited.
You can add additional disks and you can also determine whether the boot disk is deleted when the instance is deleted.
Normally the boot disk defaults to being deleted automatically when the instance is deleted. But sometimes you will want to override this behavior. This feature is very important because you cannot create an image from a boot disk when it is attached to a running instance.
So you would need to disable Delete boot disk when instance is deleted to enable creating a system image from the boot disk.
Availability policies
Note: You cannot convert a non-preemptible instance into a preemptible one. This choice must be made at VM creation. A preemptible instance can be interrupted at any time and is available at a lower cost.
If a VM is stopped for any reason, (for example an outage or a hardware failure) the automatic restart feature will start it back up. Is this the behavior you want? Are your applications idempotent (written to handle a second startup properly)?
During host maintenance, the VM is set for live migration. However, you can have the VM terminated instead of migrated.
If you make changes, they can sometimes take several minutes to be implemented, especially if they involve networking changes like adding firewalls or changing the external IP.
Compute options
Four Compute Engine machine families
-
General purpose
- the best price performance; the most flexible vCPU to memory ratios
series workload applications E2 (Cost-optimized) Day-to-day computing at a lower cost Web/App serving; Back office apps; Small-medium DBs; Microservices; Virtual desktops; Dev environments N2, N2D, N1 (Balanced) Balanced price/performance across a wide range of VM shapes Web/App serving; Back office apps; Medium-large DBs; Cache; Media/streaming Tau T2D (Scale-out optimized) Best performance/ cost for scale-out workloads Scale-out workloads; Web serving; Containerized microservices; Media transcoding; Large-scale Java apps
- the best price performance; the most flexible vCPU to memory ratios
-
Compute-optimized
Series Workload Applications C2 Ultra high performance for compute-intensive workloads Compute-bond workloads; High-performance web serving; Gaming (AAA game servers); Ad serving; High-performance computing (HPC); Media transcoding; AI/ML C2D Ultra high performance for compute-intensive workloads Memory-bound workloads; Gaming (AAA game servers); High-performance computing (HPC); High-performance DBs; Electronic Design Automation (EDA); Media transcoding -
Memory-optimized machine family
Series Workload Applications M1 Ultra high-memory workloads Medium in-memory DBs such as SAP HANA; Tasks that require intensive use of memory with hight memory-to-vCPU ratios than the general-purpose high-memory machine types; In-memory DBs and in-memory analytics, business warehousing (BW) workloads, genomics analysis, SQL analysis services; MS SQL Server and similar DBs M2 Ultra high-memory workloads Large in-memory DBs such as SAP HANA; In-memory DBs and in-memory analytics, business warehousing (BW) workloads, genomics analysis, SQL analysis services -
Accelerator-optimized
Series Workload Applications A2 Optimized for high-performance computing workloads CUDA-enabled ML training and interface; HPC; Massive parallelized computation
Create custom machine types
When:
- Requirements fit between the predefined types
- Need more memory or more CPU
Customize the amount of memory and vCPU for your machine:
- Either 1vCPU or even number of vCPU
- 0.9 GB per vCPU, up to 6.5 GB per vCPU (default)
- Total memory must be multiple of 256 MB
Disk options
Boot disk
- VM comes with a single root persistent disk
- Image is loaded onto root disk during first boot:
- Bootable: you can attach to a VM and boot from it
- Durable: can survive VM terminate
- Some OS images are customized for Compute Engine
- Can survive VM deletion if “Delete boot disk when instance is deleted” is disabled
[1] Persistent disks
Network storage appearing as a block device
- Attached to a VM through the network interface
- Durable storage: can survive VM terminate
- Bootable: you can attach to a VM and boot from it
- Snapshots: incremental backups
- Performance: Scales with size
- HDD (magnetic) or SSD (faster solid-state) options
- Disk resizing: even running and attached!
- Can be attached in read-only mode to multiple VMs
- Zonal or Regional
- Zonal - offer efficient, reliable block storage
- Regional - provide active-active disk replication across two zones in the same region
pd-standard
pd-ssd
pd-balanced
pd-extreme
(zonal only)
- Encryption keys
- Google-managed
- Customer-managed : Cloud KMS
- Customer-supplied
[2] Local SSD disks
Physically attached to the VM
- More IOPS, lower latency, higher throughput than persistent disk
- 375 GB in size; can attach up to 24 SSD partitions, total of 9TB per instance
- Data survives a reset, but not a VM stop or terminate
- VM-specific : CANNOT be reattached to a different VM
[3] RAM disk
tmpfs
- Fastest type of performance available if you need small data structures
- Faster than local disk, slower than memory
- Use when your application expects a file system structure and cannot directly store its data in memory
- Fast scratch disk, or fast cache
- Very volatile; erase on stop or restart
- May need a larger machine type if RAM was sized for the application
- Consider using a persistent disk to back up RAM disk data
- | Persistent disk HDD | Persistent disk SSD | Local SSD disk | RAM disk |
---|---|---|---|---|
Data redundancy | Yes | Yes | No | No |
Encryption at rest | Yes | Yes | Yes | N/A |
Snapshotting | Yes | Yes | No | No |
Bootable | Yes | Yes | No | No |
Use case | General, bulk file storage | Very random IOPS | High IOPS and low latency | Low latency and risk of data loss |
just need capacity | high performance need |
Compute pricing
Pricing
- Per-second billing, with minimum of 1 minute
- vCPUs, GPUs, and GB of memory
- Resource-based pricing:
- Each vCPU and each GB of memory is billed separately
- Discounts:
- Sustained use
- Committed use
- Preemptible VM instances
- Recommendation Engine:
- Notifies you of underutilitzed instances
- Free usage limits:
Special compute configurations
Preemptible
- Lower price for interruptible service (up to 91%)
- VM might be terminated at any time
- No charge if terminated in the first minute
- 24 hours max
- 30-second terminate warning, but not guaranteed
- Time for a shutdown script
- No live migrate; no auto restart
- You can request that CPU quota for a region be split between regular and preemption
- Default: preemptible VMs count against region CPU quota
Spot VMs
- the latest version of preemptible VMs
- Spot VM and preemptible VM share the same pricing model
- No minimum or maximum runtime
- Spot VMs are finite Compute Engine Resources, so they might not always be available
- No live migrate; no auto restart
- Best practice use cases help you get the most of using Spot VMs
Common network designs
Increased availability with multiple zones
- Get improved availability: Place 2 vm into multiple zone but use a single-sub-network
Globalization w/ multiple regions
- Putting resource in different regions: better failure independence
Common Comput Engine actions
Metadata and scripts
# Boot -> Run -> Maintenance -> Shutdown
startup-script-url=URL
shutdown-script-url=URL
Move an instance to a new zone
gcloud compute instance move
- Automated process (moving within region)
gcloud compute instances move
- Update references to VM; not automatic
- Manual process (moving between regions):
- Snapshot all persistent disks on the source VM
- Create new persistent disks in destination zone restored from snapshots
- Create new VM in the destination zone and attach new persistent disks
- Assign static IP to new VM
- Update references to VM
- Delete the snapshots, original disks, and original VM
Quizzes
GCP infrastructure core services - IAM #3
Which of the following is not a type of IAM member?
A. Organization Account
B. Service Account
C. Google Workspace domain
D. Google Group
E. Cloud identity domain
Cloud VPN securely connects your on-premises network to your Google Cloud VPC network
- Useful for low-volume data connections
- Classic: 99.9% SLA
- HA VPN: 99.99% SLA
- Supports
- Site-to-site VPN
- Static routes (Classic VPN only)
- Dynamic routes (Cloud Router)
- IKEv1 and IKEv2 ciphers
In order to create a connection between two VPN gateways, you must establish two VPN tunnels.
When using Cloud VPN, the maximum transmission unit (MTU) for your on-premises VPN gateway cannot be greater than 1460 bytes.
HA VPN overview
-
Provides 99.99% service availability.
-
Google Cloud automatically chooses two external IP addresses
-
Each of the HA VPNs supports multiple tunnels
You can only configure an HA VPN gateway with only ONE active interface and one external IP address.
-
VPN tunnels connected to HA VPN gateways must use dynamic (BGP) routing
-
-
HA VPN supports site-to-site VPN for different topologies/configuration scenarios:
- An HA VPN gateway to peer VPN devices
- An HA VPN gateway to an Amazon Web Services (AWS) virtual private gateway
- Two HA VPN gateways connected to each other
Three typical peer gateway configuration for HA VPN:
- An HA VPN gateway to two separate peer VPN devices, each with its own IP address
- An HA VPN gateway to one peer VPN device that uses two separate IP addresses
- An HA VPN gateway to one peer VPN device that uses one IP address.
Use either transit gateway or a virtual private gateway, when configuring an HA VPN external VPN gateway to AWS
-
Only the transit gateway supports equal-cost multipath (ECMP) routing
ECMP 等價多路徑路由:
To distribute network traffic evenly across multiple available paths with the same cost
Cloud VPN supports both static routes and dynamic routes.
You need to configure Cloud Routers in order to use dynamic routes.
Cloud Router can manage routes for a Cloud VPN tunnel using Border Gateway Protocol.
BGP routing method allows for routes to be updated and exchanged without changing the tunnel configuration.
To set up BGP, an additional IP address has to be assigned to each end of the VPN tunnel.
To guarantee 99.99% availability SLA for HA VPN connections, properly configure two or four tunnels from your HA VPN gateway to your peer VPN gateway or to another HA VPN gateway.
Use Cloud Interconnect when a dedicated high-speed connection is required between networks
- Dedicated Interconnect provides a direct connection to a colocation facility.
- From 10 to 200 Gbps
- Partner Interconnect provides a connection through a service provider
- Can purchase less bandwidth from 50 Mbps
- Allows access to VPC resources using internal IP address space
- Private Google Access allows on-premises hosts to access Google services using private IPs
Cloud Interconnect and Peering
* | Dedicated | Shared |
---|---|---|
Layer 3 | Direct Peering [Cloud | VPN] Carrier Peering |
Layer 2 | Dedicated Interconnect | Partner Interconnect |
Dedicated Interconnect provides direct physical connections.
- 99.9% or 99.99% up-time SLA
- If your place is no where near any of the colocation facilities locations, consider partner interconnect.
Partner Interconnect provides connectivity through a supported service provider
- Useful when your data center is not close to colocation facility locations, or if your data needs don’t warrant a dedicated interconnect.
Comparison of Interconnect options
Connection | Provides | Capacity | Requirements | Access Type |
---|---|---|---|---|
IPsec VPN tunnel | Encrypted tunnel to VPC networks through the public internet | 1.5-3 Gbps per tunnel | On-premises VPN gateway | Internal IP addresses |
Dedicated Interconnect | Dedicated, direct connection to VPC networks | 10 Gbps or 100 Gbps per link | Connection in colocation facility | Internal IP addresses |
Partner Interconnect | Dedicated bandwidth, connection to VPC network through a service provider | 50 Mbps - 10 Gbps per connection | Service provider | Internal IP addresses |
Direct Peering provides a direct connection between your business network and Google’s
- Broad-reaching edge network locations
- Exchange BGP routes
- Reach all of Google’s services
- No SLA
- Peering requirements
Edge Points of Presence (PoPs)
- PoPs 邊緣據點 - where Google’s network connects to the rest of the Internet via peering
- 如果你的 data center 還是離 PoPs 很遠,可以考慮 Carrier Peering
Carrier Peering provides connectivity through a supported partner
- Carrier peering partner
- Reach all of Google’s services
- Partner requirements
- No SLA
Comparision of Peering options
Connection | Provides | Capacity | Requirements | Access Type |
---|---|---|---|---|
Direct Peering | Dedicated, direct connection to Google’s network | 10 Gbps per link | Connection in Google Cloud PoPs | Public IP addresses |
Carrier Peering | Peering through service provider to Google’s public network | Varies based on partner offering | Service provider | Public IP addresses |
Choosing a network connection option
interconnect | peering |
---|---|
Direct access to RFC1918 IPs in your VPC - with SLA | Access to Google public IPs only - without SLA |
1. Dedicated Interconnect2. Partner Interconnect3. Cloud VPN | 1. Direct Peering2. Carrier Peering |
Managed instance groups
- Deploy identical instances based on instance template
- Instance group can be resized
- Manager ensures all instances are RUNNING
- Typically used with autoscaler
- Can be single zone or regional
Managed instance groups offer autoscaling capabilities
Dynamically add/remove instances:
- Increases in load
- Decreases in load
Autoscaling policy:
- CPU utilization
- Load balancing capacity
- Monitoring metrics
- Queue-based workload
HTTP(S) load balancing
HTTP(s) load balancing
- Global load balancing
- Anycast IP address
- HTTP or port 80 or 8080
- HTTPs on port 443
- IPv4 or IPv6
- Autoscaling
- URL maps
HTTP(s) load balancing
- Target HTTP(S) proxy
- One signed SSL certificate installed (minimum)
- Client SSL session terminates at the load balancer
- Support the QUIC transport layer protocol
Network endpoint groups (NEGs)
A network endpoint group (NEG) is a configuration object that specifies a group of backend endpoints or services.
There are four types of NEGs:
- Zonal
- Internet
- Hybrid connectivity
- Serverless
Cloud CDN
content delivery network
Why? Cloud CDN caches contents. You can enable it whiling setting up backend service load balancer.
Cloud CDN cache modes
- Cache modes control the factor that determine whether or not Cloud CDN caches your content.
- Cloud CDN offers three cache modes:
- USER_ORIGIN_HEADERS
- Requires origin responses to set valid cache directives and valid caching header
- CACHE_ALL_STATIC
- Automatically caches static content that doesn’t have the no-store, private or no-cache directive
- Origin responses that set valid caching directives are also cached.
- FORCE_CACHE_ALL
- Unconditionally caches responses, overriding any cache directives set by the origin.
- Careful not to caches private per-user content if using a share backend with this mode configured.
- USER_ORIGIN_HEADERS
SSL proxy load balancing
- Global load balancing for encrypted, non-HTTP traffic
- Terminates SSL session at load balancing layer
- IPv4 or IPv6 clients
- Benefits
- Intelligent routing
- Certificate management
- Security patching
- SSL policies
Network load balancing
Network load balancing
- Regional, non-proxide load balancer
- Forwarding rules (IP protocol data)
- Traffic
- UDP
- TCP/SSL ports
- Architecture
- Backend service-based
- Target pool-based
Backend service-based architecture
- Regional backend service
- Defines the behavior of the load balancer and how it distributes traffic to its backend instance groups
- Backend services enable new features not supported with legacy target pools
- Non-legacy health checks
- Auto-scaling with managed instance groups
- Connection draining
- Configurable failover policy
Target pool-based architecture
- Forwarding rules (TCP and UDP)
- Each project can have up to 50 target pools
- Each target pool can only have one health check
- All instances in a target pool must be in the same region (same limitation as the load balancer)
Internal load balancing
Internal TCP/UDP load balancing
- Regional, private load balancing
- VM instances in same region
- RFC 1918 IP address
- TCP/UDP traffic
- Reduced latency, simpler configuration
- Software-defined, fully distribute load balancing
Andromeda: directly deliver client traffic to backend instance
Internal HTTP(s) load balancing
- Regional, private loading balancing
- VM instances in same region
- RFC 1918 IP addresses
- HTTP, HTTPS, or HTTP/2 protocols
- Based on open source Envoy proxy
Choosing a load balancer
- Support for IPv6: only HTTPS, SSL proxy and TCP proxy support IPv6.
-
Google Cloud console:
Recommanded when you are new to using a service or if you prefer UI
-
Cloud Shell:
When you are confortable using a specific service, or
you want to quickly create resources using the command line
Terraform
Infrastructure as code(IaC) allows quick provisioning and removing of infractures
- Build an infrastructure when needed.
- Destroy the infrastructure when not in use.
- Create identical infrastructures for dev, test, and prod.
- Can be part of a CI/CD pipeline
- Templates are the building blocks for disaster recovery procedures
- Manage resource dependencies and complexity
- Google Cloud supports many IaC tools
- Terraform, Ansible, Chef, Packer, puppet
Terraform is an infrastructure automation tool
-
Repeatable deployment process
-
Declarative language
- HashCorp Language (HCL)
-
Focus on the application
-
Parallel deployment
-
Template-driven
Terraform language
-
Terraform language is the interface to declare resources
-
Resources are infrastructure objects (containers, compute engine, etc)
-
The configuration file guides the management of the resource
- Blocks: Represent objects
- Argument: to assign values to names
resource "google_compute_network" "default" { name = "${var.network_name}" auto_create_subnetworks = false } <BLOCK TYPE> "<BLOCK LABEL" "<BLOCK LABEL>" { # Block body <IDENTIFIER> = <EXPRESSION> # Argument }
Terraform can be used on multiple public and private clouds
- Considered a first-class tool in Google Cloud
- Already installed in Cloud Shell
|
|
Deploying infrastructure with Terraform
terraform init
# to initialize the new Terraform configuratio
# make sure that the Google provider plug-in is downloaded and installed in a subdirectory of the current working directory along with various other bookkeeping files
terraform plan
# to perfrom a refresh unless explicitly disabled and then determines what actions are necessary to achieve the desired state specified in the configuration files
# convenient for checking whether the execution plan for a set of changes matches your expectations, without making any changes to real resources or to the state
terraform apply
# to create the infrastructure defined in the main.tf file
# once completed, you will be able to access the defined infrastructure
Google Cloud Marketplace
Google Cloud Marketplace
- Deploy production-grade solutions
- Single bill for Google Cloud and third party services
- Manage solutions using Terraform
- Notifications when a security updated is available
- Direct access to partner support
Managed Services
BigQuery
BigQuery is Google Cloud’s serverless, highly scalable and cost-effective cloud data warehouse
- Fully managed
- Petabyte scale
- SQL interface
- Very fast
You can access BigQuery by using the cloud console, by using a command line tool, or by making calls to the BigQuery rest API using a variety of client libraries such as Java, .net, or Python.
Query example
|
|
Dataflow
Use Dataflow to execute a wide variety of data processing patterns
-
Serverless, fully managed data processing
-
Batch and stream processing with autoscale
-
Open source programming using
beam
Cloud Dataflow supports fast, simplified pipeline development via expressive SQL, Java and Python APIs in the Apache Beam SDK.
-
Intelligently scale to millions of QPS
Dataprep
Use Dataprep to visually explore, clean and prepare data for analysis and machine learning
- Serverless, works at any scale
- no infrastructure to deploy or manage
- Suggests ideal data transformation
- Focus on data analysis (skip data-profiling)
- Integrated partner service operated by Trifecta
Dataproc
Dataproc is a service for running Apache Spark and Apache Hadoop clusters
-
Low cost (per-second billing, preemptible)
-
Super fast to start, scale, and shut down
- 90 secs vs 5~30 mins
-
Integrated with Google Cloud, e.g.
(BigQuery, Cloud Storage, Cloud Bigtable, Stackdriver Logging, Stackdriver Monitoring)
- Provides complete data platform rather than just a Spark/Hadoop cluster
-
Managed service
-
Simple and familiar
Dataflow versus Dataproc
[ Dependencies on specific ]
+------[ tools/packages in the Apache ]-----+
| [ Hadoop/Spark ecosystem? ] |
[Yes] [No]
| |
| [ Do you prefer the manual or ]
| +--- [ automatic provisioning of clusters? ] ---+
| | |
| [Manual] [Automatic]
| | |
↓ | ↓
[Dataproc] ← ------------+ [Dataflow]
Serverless:
Servers or Compute Engine instances are obfuscated so that you don’t have to worry about the infrastructure.
Design and Process
Microservices:
- Allow a large application to be decomposed into independent constituent parts with each part having its own area of responsibility
- To serve a single user or API requests, a microservice’s base application can call many internal microservices to compose its response
Requirements, Analysis and Design
Qualitative requirements define systems from the user’s point of view
Ws | Examples |
---|---|
Who | Who are the users?Who are the developers?Who are the stakeholders? |
What | What does the system do?What are the main features? |
Why | Why is the system needed? ⭐️⭐️⭐️ |
When | When do the users need and/or want the solution?When can the developers be done? |
How | How will the system work?How many users will there be?How much data will there be? |
Roles represent the goal of a user at some point
Roles are not people or job titles | Roles should describe a users objective | Examples of Roles |
---|---|---|
1. People can play multiple roles2. A single role can be played by multiple people | 1. What does the user want to do?2. “User” is not a good role (everyone is a user) | 1. Shopper2. Account holder3. Customer4. Adminstrator5. Manager |
Using persona can provide further insights.
Evaluate user stories with the INVEST criteria
- Indepedent
- Negotiable
- Valuable
- Estimatable
- Small
- Testable
Quantitative requirements are things that are measurable
Given the contraints:
- Time
- Finance
- People
What can be achieved:
- How many users are there
- How much data is there
- What are the rewards and risk
DevOps Automation
Continuous intregration pipelines automate building applications
-
Developers check-in code
Use a Git repo for each microservice and branches for versions.
-
Run unit tests
If the tests don’t pass, stop.
-
Build deployment package
Create a Docker image.
-
Deploy
Save your new Docker image in a container registry.
Google provides the components required for a continuous integration pipeline
-
Continuous Integration
-
Cloud Source Repositories
Developers push to a central repository when they want a build to occur
-
Cloud Build
Build system executes the steps required to make a deployment package or Docker image
-
Build triggers
Watches for changes in the Git repo and starts the build
-
Container Registry
Store your Docker images or deployment packages in a central location for deployment
-
Cloud Source Repositories provides managed Git repositories
Control access to your repos using IAM within your Google Cloud projects.
Cloud Build lets you build software quickly across all languages
-
Google-hosted Docker build services
- Alternative to using Docker build command
-
Use the CLI to submit a build
gcloud builds submit --tag gcr .io/your-project-id/image-name
Build triggers watch a repository and build a container whenever code is pushed
-
Supports Maven, custom builds, and Docker
[Select source] –> [Select repository] –> [Trigger settings]
Container Registry is a Google Cloud-hosted Docker repository
-
Images built using Cloud Build are automatically saved in Container Registry
- Tag images with the prefix gcr.io/your-project-id/image-name
-
Can use Docker push and pull commands with Container Registry
docker push gcr.io/your-project-id/image-name docker pull gcr.io/your-project-id/image-name
Binary authorization allows you to enforce deploying only trusted containers into GKE
- Enable binary authorization on GKE cluster
- Add a policy that requires signed images
- When an image is built by Cloud Build an “attestor” verifies that it was from a trusted repository (Source Repositories, for example)
- Container Registry includes a vulnerability scanner that scans containers
Infrastructure as Code
Moving to the cloud requires a mindset change
On-Premises
- Buy machines
- Keep machines running for years
- Prefer fewer big machines
- Machines are capital expenditures
Cloud
- Rent machines
- Turn machines off as soon as possible
- Prefer lots of small machines
- Machines are monthly expenses
In the cloud, all infrastructure needs to be disposable
- Don’t fix broken machines
- Don’t install patches
- Don’t upgrade machines
- If you need to fix a machine, delete it and re-create a new one
- To make infrastructure disposable, automate everything with code:
- Can automate using scripts
- Can use declarative tools to define infrastructure
//TODO: Move section Terraform here
Key Storage Characteristics
Google Cloud storage and database portfolio
Relational | Relational | NoSQL | NoSQL | Object | Warehouse | In memory |
---|---|---|---|---|---|---|
Cloud SQL | Cloud Spanner | Firestore | Cloud Bigtable | Cloud Storage | BigQuery | Memorystore |
Good for:Web frameworks | Good for:RDBMS+scale, HA, HTAP | Good for:Hierarchical, mobile, web | Good for:Heavy read + write, events | Good for: Binary object data | Good for:Enterprise data warehouse | Good for:Caching for Web/Mobile apps |
Such as:CMS, eCommerce | Such as:User metadata, Ad/Fin/MarTech | Such as:User profiles, Game State | Such as:AdTech, financial, IoT | Such as:Images, media serving, backups | Such as:Analytics, dashboards | Such as:Game state, user sessions |
Scales to 30 TB. MySQL, PostgreSQL, SQL Server | Scales infinitely Regional or multi-regional | Completely managed Document database | Scales infinitely Wide-column NoSQL | Completely managed Infinitely scalable | Completely Managed SQL analysis | Managed Redis DB |
Fixed schema | Fixed schema | Schemaless | Schemaless | Schemaless | Fixed schema | Schemaless |
|
|
Different data storage services have different availability SLAs
Storage Choice | Availability SLA % |
---|---|
Cloud Storage (multi-region bucket) | >= 99.95 |
Cloud Storage (regional bucket) | 99.9 |
Cloud Storage (coldline) | 99.0 |
Spanner (multi-region) | 99.999 |
Spanner (single region) | 99.99 |
Firestore (multi-region) | 99.999 |
Firestore (single region) | 99.99 |
Durability represents the odds of losing data
Preventing data loss is a shared responsibility
Storage Choice | Google Cloud Provides | What you should do |
---|---|---|
Cloud Storage | 11 9’s durability Versioning (optional) | Turn versioning on |
Disks | Snapshots | Schedule snapshot jobs |
Cloud SQL | Automated machine backupsPoint-in-time recoveryFailover server (optional) | Run SQL database backups |
Spanner | Automatic replication | Run export jobs |
Firestore | Automatic replication | Run export jobs |
The amount of data and number of read and writes is important when selecting a data storage servic
- Scale horizontally by adding nodes
- Bigtable
- Spanner
- Scale vertically by making machine larger
- Cloud SQL
- Memorystore
- Scale automatically with NO limits
- Cloud Storage
- BigQuery
- Firestore
Do you need strong consistency?
Strongly consistent databases update all copies of data within a transaction.
Ensures everyone gets the latest copy of the data on reads:
- Storage
- Cloud SQL
- Spanner
- Firestore
Eventually consistent databases update one copy of the data and the rest asynchronously.
Can handle a large volume of writes:
- Bigtable
- Memorystore replicas
Calculate the total cost per GB when choosing a storage service
-
Bigtable and Spanner would be too expensive for storing smaller amounts of data
-
Firestore is less expensive per GB, but you also pay for reads and writes
-
Cloud Storage is relatively cheap, but you can’t run a database in storage
-
BigQuery storage is relatively cheap, but doesn’t provide fast access to records and you have to pay for running queries
You need to choose the right storage solutions for each of your microservices based on their requirements
Choosing Google Cloud Storage and Data Solutions
Storage Transfer Service
Import online data to Cloud Storage
- Amazon S3
- HTTP/HTTPS Location
- Transfer data between Cloud Storage buckets
Scheduled jobs
- One time or recurring, import at a scheduled time of day
- Options for delete objects not in source or after transfer
- Filter on file name, creation date
Use the Storage Transfer Service for on-premises data for large-scale uploads from your data center
-
Install on-premises agent on your servers
-
Agent runs in a Docker container
-
Set up a connection to Google Cloud
-
Requires a minimum of 300 Mbps bandwidth
To use the STS for on-premises:
- A Posix-compliance source
- At least 300 megabits per second network connection
- A Docker supported Linux server, ports 80 and 443 open for outbound connections
-
Scales to billions of files and 100s of TBs
-
Secure
-
Automatic retires
-
Logged
-
Easy to monitor via the Cloud Console
Transfer Appliance
Suitable for data upto 1PB; rackable device up to 1PB shipped to Google
Use Transfer Appliance if uploading your data would take too long
AES256 encrypted the moment of capture ; data erased per NSIT 800-88 standard
Shipped in tamper-evident seals
Steps:
-
Request Transfer Appliance
-
Encrypt and copy your data
-
Ship it back to Google
-
Google loads the data
-
You decrypt your data
(You control the encryption key.)
Designing Google Cloud Networks
In Google Cloud, VPC networks are global
- When creating networks, create subnets for the regions you want to operate in
- Resources across regions can reach each other without any added interconnect
- If you are a global company, choose regions around the world
- If your users are close together, choose the region closest to them plus a backup region
- A project can have multiple networks
When creating custom subnets, specify the region and the internal IP address range
- IP address ranges cannot overlap
- Machines in the same VPC can communicate via their internal IP address regardless of the subnet region
- Subnets don’t need to be derived from a single CIDR block
- Subnets are expandable without down time
- IP Aliasing or Secondary range can be set on the subnet
A single VM can have multiple network interfaces connecting to different networks
- Each network must have a subnet in the region the VM is created in
- Each interface must be attached to a different VPC
- Maximum of 8 interfaces per VM
A Shared VPC is created in one project, but can be shared and used by other projects
Requires an organization
- Create the VPC in the host project
- Share VPC adin shares the VPC with other service projects
Allows centralized control over network configuration
- Network admins configure subnets, firewall rules, routes, etc.
- Remove network admin rights from developers
- Developers focus on machine creation and configuration in the shared network
- Disable the creation of the default network using an organizational policy
Designing Google Cloud Load Balancers
Use a global load balancer to provide access to services deployed in multiple regions
- Global load balancing supported by HTTP load balancer and TCP and SSL proxies
- HTTP load balancer routes requests to the region closest to the user
- Uses a global, anycast IP address
Use a regional load balancer to provide access to services deployed in a single region
- Supported by HTTP, TCP and UDP load balancers
- Can have a public or private IP address
- Can use any TCP or UDP port
If your load balancers have public IPs, secure them using SSL
- Supported by HTTP and TCP load balancers
- Self-managed and Google-managed SSL certificates
For lower-latency and decreased egress cost leverage Cloud CDN
- Can be enable when configuring the HTTP global load balancer
- Caches static content worldwide using Google Cloud edge-caching locations
- Cache static data from web servers in Compute Engine instances, GKE pods, or Cloud Storage Buckets
Connecting Networks
- Peering
- Cloud VPN
- Cloud Interconnect
Use VPC peering to connect networks when they are both in Google Cloud
- Can be the same or different organizations
- Subnet ranges cannot overlap
- Network admins for each VPC must approve the peering requests
+-Organization---------+ +-Organization---------+
| Google Cloud Project | | Google Cloud Project |
| +-----------------+ | | +-----------------+ |
| | VPC | | | | VPC | |
| | +-------------+ | | VPC Peering | | +-------------+ | |
| | | Subnet |-----------------------| Subnet | | |
| | | 10.1.1.0/24 | | | | | | 10.1.2.0/24 | | |
| | +-------------+ | | | | +-------------+ | |
| +-----------------+ | | +-----------------+ |
+----------------------+ +----------------------+
Cloud VPN securely connects your on-premises network to your Google
When designing for reliability, consider these key performance metrics
-
Availability (可用性 - 系統可接收請求的時間%)
The percent of time a system is running and able to process requests
- Achieved with fault tolerance
- Create backup systems
- Use health checks
- Use clear box metrics to count real traffic success and failure
-
Durability (因為硬體系統錯誤而遺失資料的可能性)
The odds of losing data because of a hardware or system failure
- Achieved by replicating data in multiple zones
- Do regular backups
- Practice restoring from backups
-
Scalability (可擴張性)
The ability of a system to continue to work as user load and data grow
- Monitor usage
- Use capacity auto-scaling to add and remove servers in response to changes in load
Designing for Reliability
Avoid single points of failure
A spare spare, N+2
- Define your unit of deployment
- N+2L: Plan to have one unit out for upgrade or testing and survive another failing
- Make sure that each unit can handle the extra load
- Don’t make any single unit too large
- Try to make units interchangeable stateless clones
Beware of correlated failures
Correlated failures occur when related items fail at the same time
- If a single machine fails, all requests served by machine fail
- If a top-of-rack switch fails, entire rack fails
- If a zone or region is lost, all the resources in it fail
- Servers on the same software run into the same issue
- If a global configuration system fails, and multiple systems depend on it, they potentially fail too
The group of related items that could fail together is a failure domain
To avoid correlated failures…
Decouple servers and use microservices distributed among multiple failure domains
- Divide business logic into services based on failure domains
- Deploy to multiple zones and/or regions
- Split responsibility into components and spread over multiple processes
- Design independent, loosely coupled but collaborating services
Beware of cascading failures
Cascading failures occur when one system fails, causing others to be overloaded, such as a message queue becoming overloaded because of a failing backend
+-[600]->[Server A, max 1000 qps]
|
[Cloud Load Balancing]-+-[600]->[Server B, max 1000 qps]
# Server B fails, causing A to be overloaded [1200->A]
To avoid cascading failures
- Use health checks in Compute Engine or readiness and liveliness probes in Kubernetes to detect and then repair unhealthy instances
- Ensure that new server instances start fast and ideally don’t rely on other backends/systems to start up
+-[300]->[Server A, max 500 qps]
|
+-[300]->[Server B, max 500 qps]
|
[Cloud Load Balancing]-+-[300]->[Server C, max 500 qps]
|
+-[300]->[Server D, max 500 qps]
# Server D fails, sending [400] respectively to Server A/B/C
Plan against Query of death overload
- Where a request made to a service causes a failure in the service
- Reason: The error manifests itself as overconsumption of resource, but in reality is due to an error in business logic itself (overloads of service)
- Solution: To monitor query performance. Ensure that notification of these issues gets back to the developers
Plan against Positive feedback cycle overload failure
- Where a problem is caused by trying to prevent problems
- Problem: You try to make the system more reliable by adding retries, and instead you create the potential for an overload
- Solution: Prevent overload by carefully considering overload conditions whenever you are trying to improve reliability with feedback mechanisms to invoke retries (use intelligent retries - two strategies)
Use (1) truncated exponential backoff pattern to avoid positive feedback overload at the client
- If service invocation fails, try again:
- Continue to retry, but wait a while between attempts
- Wait a little longer each time the request fails
- Set a maximum length of time and a maximum number of requests
- Eventually, give up
- Example:
- Request fails: wait 1 sec + random_number_milliseconds and retry
- Req fails: wait 2 sec + random_number_milliseconds and retry
- Req fails: wait 4 sec + random_number_milliseconds and retry
- And so on, up to a maximum_backoff time
- Continue waiting and retrying up to some maximum number of retries
Use the (2) circuit breaker pattern to protect the service from too many retries
- Plan for degraded state operations
- If a service is down and all it’s clients are retrying, the increasing number of requests can make matters worse
- Protect the service behind a proxy that monitors service health (the circuit breaker)
- If the service is not healthy, don’t forward requests to it
- If using GKE, leverage istio to automatically implement circuit breakers
Use lazy deletion to reliably recover when users delete data by mistake
Disaster Planning
High availability can be achieved by deploying to multiple zones in a region
- Deploy multiple servers
- Orchestrate servers with a regional managed instance group
- Create a failover database in another zone, or use a distributed database like Firestore or Spanner
Kubernetes clusters can also be deployed to single or multiple zones
- Kubernetes cluster consist of a collection of node pools
- Selecting Regional location type replicates node pools in multiple zones in the region specified
Create a health check when creating instance gorups to enable auto healing
- Create a test endpoint in your service
- Test endpoint needs to verify that the service is up, and also that it can communicate with dependent backend database and services
- If health check fails, the instance group will create a new server and delete the broken one
- Load balancers also use health checks to ensure that they send requests only to healthy instances
If using Cloud SQL, create a failover replica for high availability
- Replica will be created in another zone in the same region as the database
- Will automatically switch to the failover if the primary instance is unavailable
- Doubles the cost of the database
Disaster recovery: Cold standby
- Create snapshots, machine images, and data backups in multi-region storage
- If main region fails, spin up servers in backup region
- Route requests to new region
- Document and test recovery procedure regularly
Disaster recovery: Hot standby
- Create instance groups in multiple regions
- Use a global load balancer
- Store unstructured data in multi-region buckets
- For structured data, use a multi-region database such as Spanner or Firestone
When disaster planning, brainstorm scenarios that might cause data loss and/or service failure
- What could happen that would cause a failure?
- What is the Recovery Point Objective (amount of data that would be acceptable to lose)?
- What is the Recovery Time Objective (amount of time it can take to be back up and running)?
Service | Scenario | Recovery Point Objective | Recovery Time Objective | Priority |
---|---|---|---|---|
Product Rating Service | Programmer deleted all ratings accidentally | 24 hours | 1 hour | Med |
Orders service | Database server crashed | 0 | 1 minutes | High |
Based on your disaster scenarios, formulate a plan to recover
- Devise a backup strategy based on risk and recovery point and time objectives
- Communicate the procedure for recovering from failures
- Test and validate the procedure for recovering from failures regularly
- Ideally, recovery becomes a streamlined process, part of daily operations
Resource | Backup Strategy | Backup Location | Recovery Procedure |
---|---|---|---|
Ratings MySQL database | Daily automated backups | Multi-regional Cloud Storage bucket | Run Restore Script |
Orders Spanner database | Multi-region deployment | us-east1 backup region | Snapshot and backup at regular intervals, outside of the serving infrastructure; e.g. Cloud Storage |
Network Security and Encryption
Network Security
Remove external IPs to prevent access to machines outside their network
-
Use a bastion host to provide access to private machines
-
Can also SSH into internal machines using Identity-Aware Proxy from the console and CLI
-
Use Cloud NAT to provide egress to the internet from internal machines
All internal traffic should terminate at a load balancer, third-party firewall(proxy or WAF) API Gateway, or IAP. That way, internal services cannot be launched and get public IP addresses.
Private access allows access to Google Cloud services using an internal address
- Enabled when creating subnets
- Allow access to Google Cloud services from VMs that only have internal IPs
- For example, a machine with only an internal IP would be able to reach a Cloud Storage bucket
gcloud compute networks subnets
update subnet-b \
--enable-private-ip-google-access
Configure firewall rules to allow access to VMs
- By default, ingress on all ports is denied
- Add firewall rules to control which clients have access to which VMs on which ports
- Application level security is the responsibility of the customer
Control access to APIs using Cloud Endpoints
- Protect and monitor your public APIs
- Control who has access to your API
- Validate every call with JSON Web Tokens and Google API keys
- Integrates with Identity Platform
Restrict access to your services to TLS only
- All Google Cloud service endpoints use HTTPS
- It’s up to you to configure your service endpoints
- In the load balancer setup, only create a secure frontend
Leverage Google Cloud network services for DDoS protection
- Global load balancers detect attacks and drop them
- Enabling the CDN will protect backend resources
Use Google Cloud Armor to create network security policies
- Can allow or deny access to your Google Cloud resources using IP addresses or ranges
- Create allow lists to allow known addresses
- Create deny lists to block known attackers
Cloud Armor supports layer 7 web application firewall (WAF) rules
-
Predefined rules for preventing common attacks like SQL injection and cross-site scripting
-
Flexible rules language allows you to allow or deny traffic using request headers, geographic location, ip addresses, cookies, etc.
-
Examples:
1 2 3 4 5 6
inIpRange(origin.ip, '9.9.9.0/24') request.headers['cookie'].contains('80=BLAH') origin.region_code == 'AU' inIpRange(origin.ip, '1.2.3.4/32') && request.headers['user-agent'].contains('WordPress') evaluatePreconfiguredExpr('xss-canary')
Encryption
Google Cloud provides server-side encryption of data at rest by default
- Data Encryption Key (DEK) uses AES-256 symmetric key
- Keys are encrypted by Key Encryption Keys (KEK)
- Google controls root keys in Cloud KMS
- Keys are automatically periodically rotated
- On-the-fly decryption by authorized user access with no visible performance impact
For compliance reasons, you may need to manage your own keys
- Customer-managed encryption keys are created in the cloud using Cloud Key Management Service(KMS)
- You create the keys and specify the rotation frequency
- You can then select your keys when creating storage resources like bucket and disks
Customer-supplied encryption keys are created in your environment and provided to Google Cloud
- Use your own keys with Google Cloud services
- CSEK are supplied by the calling application per-API call
- Only cached in RAM by Google
- They decrypt a single payload (or column) or block of returned data
- Supported by Compute Engine (persistent disks) and Cloud Storage
The Data Loss Prevention API can be used to protect sensitive data by finding it and redacting it
- Scans data in Cloud Storage, BigQuery, or Firestore
- Can also scan images
- Detects many different types of sensitive data, including:
- Emails
- Credit cards
- Tax IDs
- You can add your own information types
- Cloud DLP API can delete, mask, tokenize, or just identify the location of the sensitive data
Manage Versions and Cost Planning
Managing Versions
In a microservice architecture, be careful not to break clients when services are updated
- Include version in URI
- If you deploy a breaking change, you need to change the version
- Need to deploy new versions with zero downtime
- Need to effectively test versions prior to going live
Rolling updates allow you to deploy new versions with no downtime
- Typicall, you have multiple instances of a service behind a load balancer
- Update each instance one at a time
- Rolling updates work when it is ok to have 2 different versions running simultaneously during the update
- Rolling updates are a feature of instance groups; just change the instance template
- Rolling updates are the default in Kubernetes; just change the Docker image
- Completely automated in App Engine
Use a blue/green deployment when you don’t want multiple versions of a service running simultaneously
When you want to test a new software version → deploy it to the GREEN environment
Once testing complete, the workload is shifted from the current, i.e. BLUE env to the GREEN env.
- In Compute Engine, you can use DNS to migrate requests from one load balancer to another
- In Kubernetes, configure your service to route to the new pods using labels
- Simple configuration change
- In App Engine, use the Traffic Splitting feature
Canary releases can be used prior to a rolling update to reduce the risk
- The current service version continues to run
- Deploy an instance of the new version and give it a portion of requests
- Monitor for errors
- In Compute Engine, you can create a new instance group and add it as an additional backend in your load balancer
- In Kubernetes, create a new pod with the same labels as the existing pods; the service will automatically route a portion of requests to it
- In App Engine, use the Traffic Splitting feature
Cost Planning
Capacity planning is a continuous, iterative cycle
[Continuous Integration : F -> A -> A -> D -> F … ]
-
Forecast
Estimate capacity needed Monitor Repeat
-
Allocate
Determine resources required to meet forecasted capacity
-
Approve
Cost estimation versus risks and rewards
-
Deploy Monitor to see how accurate your forecasts were
Optimizing cost of compute
- Start with small VMs, and test to see whether they work
- Consider more small machines with auto scalng turned on
- Consider committed use discounts
- Consider at least some preemptible instances:
- 80% discount
- Use auto healing to recreate VMs when they are preempted
- Google Cloud rightsizing recommendations will alert you when VMs are underutilized
Optimizing disk cost
- Don’t over-allocate disk space
- Determine what performance characteristics your applications require:
- I/O Pattern: small reads and writes or large reads and writes
- Configure your instances to optimize storage performance
- Depending on I/O requirements, consider Standard over SSD disks
To optimize network costs, keep machines close to your data
- Egress in the same zone is free
- Egress to a different Google Cloud service within the same region using an external IP address or an internal IP address is free
- Except for some services such as Memorystore for Redis
- Egress between zones in the same region is charged
- All internet egress is charged
GKE usage metering can prevent over-provisioning Kubernetes clusters
- Agent collects consumption metrics in addition to the resource requests by polling PodMetrics objects from the metrics server
- The resource request records and resource consumption records are exported to two separate tables in a BigQuery dataset that you specify
- Comparing requested with consumed resources makes it easy to spot waste and take corrective measures
Consider alternative services to save cost rather than allocating more resources
- CDN - for static content
- Caching - Memorystore as a cache
- Messaging / Queuing -
- Instead of using datastore between two apps, use messaging or queuing with Pub/Sub to decouple communicating services and reduce storage needs
- etc.
Use the Google Cloud Pricing Calculator to estimate costs
-
Base your cost estimates on your forecasting and capacity planning
-
Compare the costs of different compute and storage services
Billing reports provide detailed cost breakdowns
For advanced cost analysis, export billing data to BigQuery
Visualize spend with Google Data Studio
Billing Dashboard:
- Daily View / Monthy View / Overall
Set budgets and alerts to keep your team aware of how much they are spending
- The alerts send emails to Billing Admins after spend exceeds a percent of the budget
- In addition to receiving an email, you can use Pub/Sub notifications to programmatically receive spend updates about this budget.
- You can even create a Cloud Function that listens to the Pub/Sub topic to automate cost management
Monitoring Dashboards
Monitoring Dashboard
Google Cloud unifies the tools you need to monitor your service SLOs and SLAs
Monitoring, Logging, Trace, Debugger, Error Reporting, Profiler
Monitoring dashboards monitor your services
- Monitor the things you pay for:
- CPU use
- Storage capacity
- Reads and writes
- Network egress
- etc.
- Monitor your SLIs to determine whether you are meeting your SLOs
k8s
Cloud Computing and Google Cloud
Cloud computing - 5 fundamental attributes
-
On-demand self-service
No human intervention needed to get resources
-
Broad network access Access from anywhere
-
Resource pooling Provider shares resources to customers
-
Rapid elasticity Get more resources quickly as needed
-
Measured service Pay only for what you consume
Google Cloud offers a range of services
- Compute Engine
- run VM on demand
- maximum flexibility
- GKE
- run containerization application
- package code, highly portable
- App Engine
- platform as service
- resources managed by Google Cloud
- Cloud Function
- only pay for the service while your code runs
Managed service
Storage
- Cloud Bigtable
- Cloud Storage
- Cloud SQL
- Cloud Spanner
- Datastore
Big Data
- BigQuery
- Pub/Sub
- Dataflow
- Dataproc
- Notebooks
Machine Learning
- Vision API
- Vertex AI
- Speech-to-Text API
- Cloud Translation API
- Cloud Natural Language API
Resource Management
Multi-Region
-
the Americas
-
Asia-Pacific
-
Europe
-
Region : e.g.
europe-west2
-
Zone : e.g.
europe-west2-a
europe-west2-b
europe-west2-c
-
-
PoPs and network
- designed to provide the highest possible throughput, lowest possible latencies
Quotas
All resources are subject to project quotas or limits
- How many resources you can create per project
- 15 VPC networks/project
- How quickly you can make API requests in a project: rate limits
- 5 admin actions/sec (Cloud Spanner)
- How many resources you can create per region
- 24 CPUs region/project
Why use project quotas?
- Prevent runaway consumption in case of an error or malicious attack
- Prevent billing spikes or surprises
- Forces sizing consideration and periodic review
Labels
Labels are a utility for organizing Google Cloud resources
- Attached to resources : VM, disk, snapshot, image
- Google Cloud console,
gcloud
, or API
- Google Cloud console,
- Example uses of labels:
- Inventory
- Filter resources
- In scripts
- Help analyze costs
- Run bulk operations
Use labels for
-
Team or Cost Center
team:marketing team:research
-
Components
component:redis component:frontend
-
Environment or stage
environment:prod environment:test
-
Owner or contact
owner:steve contact:hon
-
State
state:inuse state:readyfordeletion
Billing
How billing works
- Billing account pays for project resources
- A billing account is linked to one or more projects
- Charged automatically or invoiced every month or at threshold limit
- Subaccounts can be used for separate billing for projects
How to keep your billing under control
- Budgets and alerts
- to set up a web hook, or to trigger shut down script
- Billing export
- to send billing data to BigQuery dataset
- Reports
Quotas are helpful limits
- Rate Quota
- Reset after an amount of time
- Example: 1,000 requests per 100 seconds
- Allocation Quota
- Govern the resource you can have in your projects
- Example: 5 networks per project
Many quotas are changeable - by requesting an increase from Google Cloud support
Interacting with Google Cloud
-
Google Cloud Console
Web user interface (GUI)
-
Cloud SDK and Cloud Shell
Command-line interface
gcloud
kubectl
gsutil
bq
- Cloud Shell Editor
-
Cloud Console mobile app
For iOS and Android
- Start, stop and use SSH to connect to Compute Engine instances
- Billing info and alert
- Set up customized stuff
-
REST-based API
For custom applications
Containers and Container Images
Introduction to Containers
We used to build apps on individual servers
Deployment ~months, low utilization, not portable
+--------------------+
| Dedicated server |
| +----------------+ |
| |Application Code| |
| +----------------+ |
| +----------------+ |
| | Dependencies | |
| +----------------+ |
| +----------------+ |
| | Kernel | |
| +----------------+ |
| +----------------+ |
| | Hardware | |
| +----------------+ |
+--------------------+
Hypervisors create and manage virtual machines
- Hypervisor
- the software layer that breaks the dependencies of an operating system with its underlying hardware, and allow several virtual machines to share that same hardware.
- e.g. KVM
- Deployment ~days (mins), improved utilization, Hypervisor-specific
- Dependencies and OS are still bundled together
- Not very easy to move from a VM from one hypervisor product to another
- Applications with shared dependencies are NOT isolated from each other
- The resources requirements from one application can starve out other applications of the resources that they need (搶奪資源)
- A dependency upgrade for one app might cause another app to stop working
+--------------------+
| Dedicated server |
| +----------------+ |
| |Application Code| |
| +----------------+ |
| +----------------+ |
| | Dependencies | |
| +----------------+ |
| +----------------+ |
| | Kernel | |
| +----------------+ |
| +----------------+ |
| | Hardware + | |
| | Hypervisor | |
| +----------------+ |
+--------------------+
The VM-centric way to solve problems
(Resource-starved-out and Dependency-upgrade)
- Each application maintains its own dependencies
- The kernel is isolated
Containers and Container Images
Why developers like containers
-
Image: application and its dependencies
-
Docker: build and run container images
-
Dockerfile: Docker formatted container image
- Each instruction in the Dockerfile specifies a layer inside the container image. Each layer is read only
- When a container runs from this image, it will also have a writable ephemeral top most layer
-
When you write your Dockerfile:
- Organize from layer least likely to change through, to layers most likely to change
|
|
Intro to Kubernetes
What is Kubernetes?
- Open source
- container-centric management environment
- Automation
- Container management
- Declarative configuration
- Describe a desired state you want to achieve, instead of command lines
- Imperative configuration
- For quick temporary fixes
Kubernetes features
- Supports both stateful and stateless applications
- Autoscaling
- Resource limits
- Extensibility
Intro to Google K8s Engine
Kubernetes is powerful, but managing the infrastructure is a full-time job.
Google Kubernetes Engines to the rescue! It helps to deploy, manage and scale K8S environment for your containerized application
Explaining GKE features
- Fully managed
- Container-optimized OS
- Auto upgrade
- Auto repair
- Cluster scaling
- Seamless integration
- Identity and Access Management
- Integrated logging and monitoring (integrated w/Stackdriver)
- Integrated networking
- Cloud Console
Computing Options Detail
Compute Engine
- Fully customizable virtual machines
- Persistent disks and optional local SSDs
- Global load balancing and autoscaling
- Per-second billing
Compute Engine use cases
- Complete control over the OS and virtual hardware
- Well suited for lift-and-shift migrations to the cloud
- Most flexible compute solution, often used when a managed solution is too restrictive
App Engine
- Provides a fully managed, code-first platform
- Streamlines application deployment and scalability
- Provides support for popular programming languages and application runtimes
- Supports integrated monitoring, logging and diagnostics
- Simplifies version control, canary testing, and rollbacks
App Engine use cases
- Websites
- Mobile app and gaming backends
- RESTful APIs
Google Kubernetes Engine
- Fully managed Kubernetes platform
- Supports cluster scaling, persistent disks, automated upgrades, and auto node repairs
- Built-in integration with Google Cloud services
- Cloud Build, Container Registry, Stackdriver Monitoring, Stackdriver Logging
- Portability across multiple environments
- Hybrid computing
- Multi-cloud computing
GKE use cases
- Containerized applications
- Cloud-native distributed systems
- Hybrid application
Cloud Run
- Enables stateless containers
- Build, deploy and manage modern serverless workloads
- Abstracts away infrastructure management
- Automatically scales up and down
- Never have to pay for over-provisioned resources
- only charged for the resources you use calculated down to the nearest 100 milliseconds
- Open API and runtime environment
Cloud Run use cases
- Deploy stateless containers that listen for requests or events
- Build applications in any language using any frameworks and tools
Cloud Functions
- Event-driven, serverless compute service
- Automatic scaling with highly available and fault-tolerant design
- Charges apply only when your code runs
- Each function, invocation, memory, CPU use is measured in the 100 milliseconds increments rounded
- Cloud Functions also provides a perpetual free tier
- Triggered based on events in Google Cloud services, HTTP endpoints, and Firebase
Cloud Functions use cases
- Supporting microservice architecture
- Serverless application backends
- Mobile and IoT backends
- Integrate with third-party services and APIs
- Intelligent applications
- Virtual assistant and chat bots
- Video and image analysis
K8s Concepts
There are two elements to Kubernetes objects
- Kubernetes objects
- Persistent entities representing the state of the cluster
- Object spec
- Desired state described by us
- Object status
- Current state described by Kubernetes
Containers in a Pod share resources
A pod embodies the environment where containers live and that environment can accommodate one or more containers.
If there’s more than one container in a Pod, they are tightly coupled and they share resources including networking and storage.
+---------------+ +-------------------------------------------+
| Pod | | Pod |
| | | +---------------------------------------+ |
| | | | Shared networking | |
| | | +---------------------------------------+ |
| +-----------+ | | +-----------+ +-----------+ +-----------+ |
| | Container | | | | Container | | Container | | Container | |
| +-----------+ | | +-----------+ +-----------+ +-----------+ |
| | | +---------------------------------------+ |
| | | | Shared storage | |
| | | +---------------------------------------+ |
+---------------+ +-------------------------------------------+
The K8S Control Plane
Cooperating processes make a Kubernetes cluster work
-
Cluster
-
Control plane
-
kube-APIserver:
any query or changes towards the cluster’s state must be addressed to the cube-APIserver
-
etcd: the cluster’s database, to reliably store the state of the cluster
-
all of the cluster configuration data
-
more dynamic information
- what nodes are part of the cluster
- what pods should be running and where they should be running
-
-
kube-scheduler:
-
Responsible for scheduling pods onto nodes
-
It discovers a pod object that doesn’t yet have an assignment to a node, it chooses a node and simply writes the name of that node into the pod object
-
-
kube-controller-manager
- continuously monitor the state of the cluster through kube-apiserver
-
kub-cloud-manager
-
manages controllers that interact with the underlying cloud providers
-
if you manually launch a K8S cluster on Google Compute Engine, kube-cloud-manager would be responsible for bringing in Google Cloud features like load balancers and storage volumes when you need them
-
Each node runs a small family of control plan components. e.g. a kubelet
Kubelet :
- Kubernetes agent on each node
- When the kube-apiserver wants to start a pod on a node, it connects to that nodes kubelet
-
-
-
-
GKE Concepts
kubeadm
: An open source command that can automate much of the initial set up of a cluster
GKE manages all the control plane components
- GKE still exposes an IP address to which we send all of our K8S API requests
- GKE takes charge of managing all of the control plane infrastructure behind it
GKE : More about nodes
❌ Kubernetes doesn’t create nodes. Cluster admins create nodes and add them to K8S
✅ GKE manages this by deploying and registering Compute Engine instances as nodes
Use node pools to manage different kinds of nodes
- Node Pool :
- A subset of nodes within a cluster that share a configuration, such as their amount of memory or their CPU generation
- Also provide an easy way to ensure that workloads run on the right hardware within your cluster
- Node pools are a GKE feature rather than a K8S feature
Zonal versus regional clusters
-
Zonal Cluster
-
By default, a cluster launches on a single Google Cloud Compute Zone with 3 identical nodes, all in one node pool
-
The number of nodes can be changed during or after the creation of cluster
-
Adding more nodes and deploying multiple replicas of an application will improve an app’s availability, but only up to a point
-
-
Region Cluster
- A regional cluster is spread across 3 zones, each containing 1 control plane and 3 nodes
- A zonal cluster cannot be converted to regional cluster
A regional or zonal GKE cluster can also be set up as a private cluser
- Hidden from the public internet
- Cluster control plances can be accessed
- by GC products through an internal IP address
- By authorized networks through external IP address
K8s Object Management
Incomplete :(
Migrate for Anthos
Incomplete :(
App Engine
With App Engine, you can choose from popular coding languages, libraries, and frameworks to develop apps with tools you’re familiar with.
Then automatically provision servers in scale app instances based on demand.
This means you can upload your code and Google will manage your apps availability.
Coding options include Eclipse, IntelliJ, Maven, Git, Jenkins, and PyCharm.
With App Engine, there are no servers to provision or maintain.
App Engine provides built-in services and APIs like NoSQL data stores, memcache, load balancing, health checks, application logging and a user authentication API that’s common to most applications.
App Engine also offer software development kits or SDKs to help you develop, deploy, and manage your apps on your local machine.
Each SDK includes all of the APIs and libraries available to App Engine, the simulated secure sandbox environment that emulates all of the App Engine services on your local computer and deployment tools to upload your application to the Cloud and manage different versions.
The SDK manages your application locally, and the Google Cloud Console manages your application in production.
You can use the Cloud Console web-based interface to create new applications, configure domain names, change which version of your application is live, examine access and error logs and much more.
From a security perspective, the Security Command Center, Google Cloud security and risk management platform, keeps web applications safe.
Through the Cloud Console, you can use the Security Command Center to automatically scan and detect common web application vulnerabilities.
App Engine environment
Flexible environment
- Instances are health-checked, healed, and co-located
- Critical, backward-compatible updates are automatically applied to the underlying operating system
- VM instances are automatically located by geographical region according to the settings in your project
- VM instances are restarted on a weekly basis
The flexible environment supports …
- Microservices
- Authorization
- SQL & NoSQL databases
- Traffic splitting
- Logging
- Search
- Versioning
- Security scanning
- Memcache
- CDN
App Engine flexible allows users to …
- Benefit from custom configurations and libraries, while focusing on writing code
- Customize the runtime and the operating system of your virtual machine. Standard runtimes include Python, Java, Go, Node.js, PHP, .NET, and Ruby
- Customize or provide runtimes by supplying a custom Docker image or Dockerfile
Standard vs Flexible
- | Standard environment | Flexible environment |
---|---|---|
Instance startup | Seconds fast | Minutes |
SSH access | No | Yes (although not by default) |
Write to local disk | No (some runtimes have read and write access to the /tmp directory) |
Yes, ephemeral (disk initialized on each VM startup) |
Support for 3rd party binaries | For certain languages | Yes |
Network access | Via App Engine services | Yes |
Pricing model | After free tier usage, pay per instance class, with automatic shutdown | Pay for resource allocation per hour; no automatic shutdown |
App Engine vs GKE …
- App Engine standard environment is for people who want the service to take maximum control of their web and mobile applications deployment and scaling
- Google Kubernetes Engine - gives the application owner the full flexibility of Kubernetes
- App Engine flexible environment - somewhere between the two
Google Cloud API management tools
- Cloud Endpoints
- Distributed API management system
- Provides an API console, hosting, logging, monitoring and other features
- Use with any APIs that support the OpenAPI Specification
- Supports applications running in App Engine, Google Kubernetes Engine and Compute Engine
- Clients include Android, iOS and Javascript
- API Gateway
- Backend implementation can vary for a single service provider
- Provide secure access to your backend services through a well-defined REST API
- Client consume your REST APIS to implement standalone apps
- Apigee API management
Cloud Run
- A managed compute platform that can run stateless containers
- Serverless, removing the need for infrastracture management
- Built on Knative, an open API and runtime environment built on Kubernetes
- Can automatically scale up and down from zero almost instantaneously, charging only for the resources used
Three step:
- Write your code
- Build and package
- Deploy to Cloud Run
Cloud Run then starts your container on demand to handle requests, and ensures that all incoming requests are handled by dynamically adding and removing containers
Development in the cloud
Cloud Source Repositories
- provide full-featured git repositories hosted on Google Cloud that support the collaborative development of any application or service, including those that run on App Engine and Compute Engine
- capable of having any number of private git repositories
- Diagnostic Tools : debugger, error reporting
- Can be migrated from github or gitlab
Cloud Function
- Lightweight, event-based, asynchronous compute solution
- Allows you to create small, single-purpose functions that respond to cloud events without the need to manage a server or a runtime environment
- Use these functions to construct applications from bite-sized business logic and connect and extend cloud services
- Billed to the nearest 100 milliseconds, and only while your code is running
- Supports writing source code in a number of programming languages, including Node.js, Python, Go, Java, .Net Core, Ruby, and PHP
- Events from Cloud Storage and Pub/Sub can trigger Cloud Functions asynchronously, or use HTTP invocation for synchronous execution
Deployment : Infrastructure as Code
Creating an environment in Google Cloud can mean lots of work like setting up a compute network and storage resources and then keeping track of their configurations.
This process can be done manually by writing the commands you need to set up your environment the way you want.
However, this is labor-intensive and requires updating commands if you want to change the environment or manually writing new commands if you want to clone an environment.
It’s more efficient to use a template.
Using a template allows you to write the specification to your application environment in the same way you’d write a configuration file.
A Go template can then be deployed in a scout environment to quickly create as many identical
application environments as needed.
Terraform:
- Create a template file using HashiCorp Configuration Language (HCL) that describes what the components of the environment should look like
- Terraform uses that template to determine the actions needed to create the environment your template describes
- Use Terraform to update the environment to match the change
- Store and version-control Terraform templates in Cloud Source Repositories
Virtual Private Cloud
VPC
You can provision your GCP resources, connect them to each other, and isolate them from each other in a virtual private cloud.
You can also define fine-grained network and policies within GCP and between GCP and on-premises or other public Clouds.
VPC objects
- Projects
- Networks
- default, auto mode, custom mode
- Subnetworks - divide or segregate your environment
- Regions
- Zones
- IP addresses
- Internal, external, ranges
- Virtual machines
- Routes
- Fire rules
Projects, networks, subnetworks
A project:
- Associates objects and services with billing
- Contains networks (up to 15) that can be shared/peered
A network:
- Has no IP address range
- Is global and spans all available regions (spreads across multi-regions)
- Contains subnetworks (regional)
- Is available as default, auto, or custom
3 VPC network types
Default
- Every project
- One subnet per region
- Default firewall rules
Auto Mode
- Default network
- One subnet per region
- Regional IP allocation
- Fixed /20 mask subnetwork per region
- Expandable up to /16 mask
- all of these subnets fit within the
10.128.0.0/9
CIDR block
- all of these subnets fit within the
Custom Mode
- No default subnets created
- Full control of IP ranges
- Regional IP allocation
- Expandable to IP ranges you specify
- These IP ranges cannot overlap between subnets of the same network
- Custom mode network cannot be changed to auto mode network (one-way)
Subnetworks cross zones
- VMs can be on the same subnet but in different zones
- A single firewall rule can apply to both VMs even though they are in different zones
- Every subnet has four reserved IP addresses in its primary IP range
- Because a region contains several zones, subnetworks can cross zones
Expand subnets without re-creating instances
- The new subnet must NOT overlap with other subnets in the same VPC network in any region
- IP range must be a unique valid CIDR block
- New subnet IP ranges have to fall within valid IP ranges
- Can expand but not shrink
- Auto mode can be expanded from /20 to /16
- Avoid creating overly large subnets
- Do not scale your subnet beyond what you actually need
IP addresses
VMs can have internal and external IP addresses
Internal IP
- Allocated from subnet range to VMs by DHCP (dynamic host configuration protocol)
- DHCP lease is renewed every 24 hours
- VM name + IP is registered with netowork-scoped DNS
External IP
- Assigned from pool (ephemeral)
- Reserved (static)
- Bring Your Own IP address (BYOIP)
- VM doesn’t know external IP; it is mapped to the internal IP
Mapping IP addresses
External IPs are mapped to internal IPs
…
DNS resolution for internal addresses
Each instance has a hostname that can be resolved to an internal IP address:
- The hostname is the same as the instance name
- FQDN is [hostname].[zone].c.[project-id].internal
- Example:
my-server.us-central1-a.c.guestbook-151617.internal
- Example:
Name resolution is handled by internal DNS resolver:
- Provided as part of Compute Engine (169.254.169.254)
- Configured for use on instance via DHCP
- Provides answer for internal and external addresses
DNS resolution for external addresses
-
Instances with external IP addresses can allow connections from hosts outside the project
-
Users connect directly using external IP address
-
Admins can also publish public DNS records pointing to the instance
-
Public DNS records are not published automatically,
but admins can publish these using existing DNS servers.
-
-
-
DNS records for external addresses can be published using existing DNS servers (outside of Google Cloud)
-
DNS zones can be hosted using Cloud DNS
Host DNS zones using Cloud DNS
Cloud DNS is a scalable, reliable and managed authoritive Domain Name System
-
Google’s DNS service
-
Cloud DNS translate domain names into IP address
Example:
www.google.com
→74.125.29.101
-
Low latency
-
High availability (100% uptime SLA)
-
Create and update millions of DNS records
-
UI, command line, or API
Alias IP ranges
Assign a range of IP addresses as aliases to a VM’s network interface using alias IP ranges
Routes & firewall rules
A route is a mapping of an IP range to a destination
Every network has:
- Routes that let instances in a network send traffic directly to each other
- A default route that directs packets to destinations that are outside the network
Firewall rules must also allow the packet.
Routes map traffic to destination networks
A route is created when a network is created, enabling traffic delivery from “anywhere”.
Also a route is created when a subnet is created, this is what enables VMs on the same network to communicate.
- Apply to traffic egressing a VM
- Forward traffic to most specific route
- Are created when a subnet is created
- Enable VMs on same network to communicate
- Destination is in CIDR notation
- Traffic is delivered only if it also matches a firewall rule
Firewall rules protect your VM instances from unapproved connections
- VPC network functions as a distributed firewall
- Firewall rules are applied to the network as a whole
- Connections are allowed or denied at the instance level
- Firewall rules are stateful
- If a connection is allowed between a source and a target or a target at a destination, all subsequent traffic in either direction will be allowed
- In other words, firewall rules allow bidirectional communication once a session is established
- Implied deny all incress and allow all egress (if all firewall rules in a network are deleted)
A firewall rule is composed of…
Parameter | Details |
---|---|
direction |
Inbound connections are matched against ingress rules only. Outbound connections are matched against egress rules only. |
source or destination |
For the ingress direction, sources can be specified as part of the rule with IP addresses, source tags or a source service account. For the egress direction, destinations can be specified as part of the rule with one or more ranges of IP addresses. |
protocol and port |
Any rule can be restricted to apply to specific protocols only or specific combinations of protocols and ports only. |
acion |
To allow or deny packets that match the direction, protocol, port and source or destination of the rule |
priority |
Governs the order in which rules are evaluated; the first matching rule is applied |
Rule assignment |
All rules are assigned to all instances, but you can assign certain rules to certain instances only |
Google Cloud firewall use case : Egress
Conditions:
- Destination CIDR ranges
- Protocols
- Ports
Action:
- Allow: permit the matching egress connection
- Deny: block the matching egress connection
Google Cloud firewall use case : Ingress
Conditions:
- Source CIDR ranges
- Protocols
- Ports
Action:
- Allow: permit the matching ingress connection
- Deny: block the matching ingress connection
Pricing
Network pricing (subject to change)
Traffic type | Price |
---|---|
ingress | No charge |
egress to the same zone (internal IP address) | No charge |
egress to Google products (YouTube, Maps, Drive) | No charge |
egress to a different Google Cloud service (within same region; exceptions) | No charge |
egress between zones in the same region (per GB) | $0.01 |
egress to the same zone (external IP address, per GB) | $0.01 |
egress between regions within the US and Canada (per GB) | $0.01 |
Egress between regions, not including traffic between US regions | Varies by region |
Common network designs
Common network designs
Cloud NAT
- Google’s manged network address translation service
- let you provision your application instances without public IP address, while also allowing them to access the internet in a controlled and efficient manner
- Outbound NAT:
- Enable two private isntances to access an update server on the Internet
- Cloud NAT does NOT implement inbound NAT
- Hosts outside your VPC network cannot directly access any of the private instances behind the cloud NAT gateway
- This helps you keep your VPC network isolated and secure
SLOs, SLIs, and SLAs
KPIs and SLIs
Key performance indicators (KPIs) are metrics that can be used to measure success
In business, common KPIs include:
- Return on investment (ROI)
- Earnings before interest and taxes (EBIT)
- Employee turnover
- Customer churn
In software, common KPIs include:
- Page views
- User registrations
- Clickthroughs
- Checkouts
For KPIs to be effective, they must be SMART
- Specific “User Friendly” is not as specific as “Section 508 Accessible”
- Measurable You have to find an obejctive way to test whether you’re meeting your KPIs
- Achievable: “100% Availability” might sound good, but it’s not really possible
- Relevant: Does it really matter to the user? Will it help achieve application goals?
- Time-bound: 99% available: Per year? Per month? Per day? If we don’t know, how can we measure?
SLOs and SLAs
- SLI (indicators)
- An SLI is a measurable attribute of a service. A KPI
- Example: Availability
- SLO (objectives)
- The SLO is the number or the goal you want to achieve for a given SLI for a given duration.
- Do you want 95%, 99%, or 99.99% availability?
- Agreements (agreements)
- An SLA is a binding contract providing the customer compensation if the service doesn’t meet specific expectations
- The SLA is a more restrictive version of the SLO.
SLIs must be time-bound and measurable
❌ Fast response time
✅ HTTP GET requests respond within 400 ms aggregated per minutes
❌ Highly available
✅ Percentage of successful requests over all requests aggregated per minute
SLOs must be achievable and relevant
SLO | SLO (%) | SLO |
---|---|---|
HTTP POST photo uploads complete within 100ms aggregated per minute | 99% | ❌ If our users are using mobile phones, maybe this is overkill |
🔝 | 80% | ✅ This might be good enough |
Available as measured with an uptime check every 10 seconds aggregated per minute | 100% | ❌ Sounds good, but not practical |
🔝 | 99.999% | ❌ Possible, but maybe too expensive |
🔝 | 99% | ✅ Maybe good enough and easier and more cost-effective |
Tips for determining SLOs
- The goal isn’t to make SLOs as high as possible; the goal is to make them as low as you can get away with, while still making users happy. That’s why it’s important to understand your users
- The higher you set the SLO, the higher the cost in compute resources (redundancy) and operations effort (people time)
- Applications should not significantly outperform their SLOs, because users come to expect the level of reliability you usually give them
An SLA is a business contract between the provider and the customer
The SLA stipulates that:
- A penalty will apply to the provider if the service doesn’t maintain certain availability and/or performance thresholds
- If the SLA is broken, the customer will receive compensation from the provider
Not all services have an SLA, but all services should have an SLO
Your SLO thresholds should be stricter than your SLA
Example:
- SLI : The latency of successful HTTP responses (HTTP-200)
- SLO : The latency of 99% of the responses must be ≤ 200 ms (較嚴格)
- SLA : The user is compensated if 99th percentile latency exceeds 300 ms (較寬鬆)
Examples
User Story | SLO | SLI |
---|---|---|
Search Hotel and Flight | Available 99.95% | Fraction of 200 vs 500 HTTP responses from API endpint measured per month |
Search Hotel and Flight | 95% of requests will complete in under 200 ms | Time to last byte GET requests measured every 15 seconds aggregated per 5 minutes |
Supply Hotel Inventory | Error rate of < 0.00001% | Upload errors measured as a percentage of bulk uploads per day by custom metric |
Supply Hotel Inventory | Available 99.9% | Fraction of 200 vs 500 HTTP responses from API endpoint measured per month |
Analyze sales performance | 95% of queries will complete in under 10s | Time to last byte GET requests measured every 60 seconds aggregated per 10 minutes |
Microservices
Microservices
Microservices divide a large program into multiple smaller, independent services
- Monolithic applications implement all features in a single code base with a database for all data
- Microservices have multiple code bases, and each service manages its own data
- Each services have their own databases
Pros and cons of microservice architectures
[Pros]
- Easier to develop and maintain
- Reduced risk when deploying new versions
- Service scale independently to optimize use of infrastructure
- Faster to innovate and add new features
- Can use different languages and frameworks for different services
- Choose the runtime appropriate to each service
[Cons]
- Increased complexity when communicating between services
- Increased latency across service boundaries
- Concerns about securing inter-service traffic
- Multiple deployments
- Need to ensure that you don’t break clients as versions change
- Must maintian backward compatibility with clients as the microservice evolves
The key to architecting microservice applications is recognizing service boundaries
I. Decompose app by feature to minimize dependencies | II. Organize services by architectural layer | III. Isolate services that provide shared functionality |
---|---|---|
1. Reviews service2. Orders service3. Products service4. Etc. | 1. Web, Android, and iOS user interfaces2. Data access services | 1. Authentication service2. Reporting service3. Etc. |
Stateful services have different challenges than stateless ones
Stateful services manage stored data over time
- Harder to scale
- Harder to upgrade
- Need to back up
Stateless services get their data from the environment or other stateful services
- Easy to scale by adding instances
- Easy to migrate to new versions
- Easy to administer
Avoid storing shared state in-memory on your servers
- Requires sticky sessions(session affinity) to be set up in the load balancer
- Hinders elastic autoscaling
Store state using backend storage services shared by the frontend server
- Cache state data for faster access
- Take advantage of Google Cloud-managed data services
- Firestore, Cloud SQL, etc. for state
- Memorystore for Redis for caching
Microservices Best Practices
The Twelve-Factor App is a set of best practices for building web or software-as a-service applications
Helps to decouple components of the application
- Maximize portability
- Deploy to the cloud
- Enable continuous deployment
- Scale easily
The 12 factors
-
Codebase
One codebase tracked in revision control, many deploys
-
Use a version control system like Git
-
Each app has one code repo and vice versa
Cloud source repository
-
-
Dependencies Explicitly declare and isolate dependencies
- Use a package manager like Maven, Pip, NPM to install dependencies
- Declare dependencies in your code base
Container Registry (for storing image)
-
Config
Store config in the environment
- Don’t put secrets, connection strings, endpoints, etc., in source code
- Store those as environment variables
-
Backing Services
Treat backing services as attached resources
- Databases, caches, queues, and other services are accessed via URLs
- Should be easy to swap one implementation for another
-
Build, release, run Strictly separate build and run stages
- Build creates a deployment package from the source code
- Release combines the deployment with configuration in the runtime environment
- Run executes the application
-
Processes Execute the app as one or more stateless processes
- Apps run in one or more processes
- Each instance of the app gets its data from a separate database service
-
Port binding
Export services via port binding
- Apps are self-contained and expose a port and protocol internally
- Apps are not injected into a separate server like Apache
Such apps can be deployed on platform services such as Compute Engine, GKE, App Engine, or Cloud Run
-
Concurrency
Scale out via the process model
- Because apps are self-contained and run in separate process, they scale easily by adding instances
-
Disposability
Maximize robustness with fast startup and graceful shutdown
- App instances should scale quickly when needed
- If an instance is not needed, you should be able to turn it off with no side effects
-
Dev/prod parity
Keep development, staging, and production as similar as possible
- Container systems like Docker makes this easier
- Leverage infrastructure as code to make environments easy to create
To build workflow & keep the environments consistent: Cloud Source Repositories Cloud Storage Container Registry Terraform
-
Logs
Treat logs as event streams
- Write log messages to standard output and aggregate all logs to a single source
To help with collection, processing, and structured analysis of log: Stackdriver Logging Cloud Logging Cloud Logging Log-based Metrics
Cloud Pub/Sub BigQuery Cloud Dataflow
-
Admin processes
Run admin/management tasks as one-off processes
- Admin tasks should be repeatable processes, not one-off manual tasks
- Admin tasks shouldn’t be a part of the application
Many options depending on your deployment on Google Cloud:
Cron jobs in GKE
Cloud tasks on App Engine
Cloud Scheduler
REST and APIs
REST
A good microservice design is loosely coupled
-
Clients shouldn’t need to know too many details of services they use
-
Services communicate via HTTPS using text-based payloads
- Client makes GET, POST, PUT or DELETE request
- Body of the request is formatted as JSON or XML
- Results returned as JSON, XML, or HTML
-
Services should add functionality without breaking existing clients
- Add, but don’t remove, items from responses
If microservices aren’t loosely coupled, you’ll end up with a really complicated monolith
REST architecture supports loose coupling
- REST stands for Representational State Transfer
- Protocol independent
- HTTP is most common
- Others possible like gRPC
- Service endpoints supporting REST are called RESTful
- Client and Server communicate with Request - Response processing
RESTful services communicate over the web using HTTP(S)
- URIs (or endpoints) identify resources
- Responses return an immutable reprentation of the resource information
- REST applications provide consistent, uniform interfaces
- Representation can have links to additional resources
- Caching of immutable representations is appropriate
Resources and representations
- Resource is an abstract notion of information
- Representation is a copy of the resource information
- Representations can be single items or a collection of items
Passing representations between services is done using standard text-based formats
- JSON, HTML, XML, CSV
HTTP
Clients access services using HTTP requests
- VERB: GET, PUT, POST, DELETE
- URI: Uniform Resource Identifier (endpoint)
- Request Header: metadata about the message
- Preferred representation formats (e.g. JSON, XML)
- Request Body: (Optional) Request state
- Representation (JSON, XML) of resource
HTTP requests are simple and text-based
GET / HTTP/1.1
Host: jingle.ysatshei.com
POST /add HTTP/1.1
Host: jingle.ysatshei.com
Content-Type: json
Content-Length: 35
{"name":"Noir","breed":"Schnoodle"}
The HTTP verb tells the server what to do
-
GET
is used to retrieve data -
POST
is used to create data- Generates entity ID and returns it to the client
-
PUT
is used to create data or alter existing data- Entity ID must be known
PUT
should be idempotent, which means that whether the request is made once or multiple times, the effects on the data are exactly the same
-
DELETE
is used to remove data
Services return HTTP responses
- Response Code: 3-digit HTTP status code
- 200 - success
- 400 codes - client errors
- 500 codes - server errors
- Response Body: contains resource representation
- JSON, XML, HTML, etc.
All services need URIs (Uniform Resource Identifiers)
- Plural nouns for sets (collections)
- Singular nouns for individual resources
- Strive for consistent naming
- URI is case-insensitive
- Don’t use verbs to identify a resource
- Include version information
APIs
OpenAPI is an industry standard for exposing APIs to clients
- Standard interface description format for REST APIs
- Language independent
- Open-source (based on Swagger)
- Allows tools and humans to understand how to use a service without needing its source code
gRPC is a lightweight protocol for fast, binary communication between services or devices
- Developed at Google
- Supports many languages
- Easy to implement
- gRPC is supported by Google services
- Global load balancer (HTTP/2)
- Cloud Endpoints
- Can expose gRPC services using an Envoy Proxy in GKE
Google Cloud provides two tools for managing APIs : Cloud Endpoints and Apigee
Both provide tools for:
- User authentication
- Monitoring
- Securing APIs
- etc.
Both support OpenAPI and gRPC