图书介绍
网站运维工程(影印版)PDF|Epub|txt|kindle电子书版本网盘下载
- BETSY BEYER,CHRIS JONES,JENNIFER PETOFF,NIALL RICHARD MURPHY编 著
- 出版社: 南京:东南大学出版社
- ISBN:9787564172961
- 出版时间:2018
- 标注页数:528页
- 文件大小:75MB
- 文件页数:554页
- 主题词:网站建设-英文;网站-维护-英文
PDF下载
下载说明
网站运维工程(影印版)PDF格式电子书版下载
下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台)。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用!后期资源热门了。安装了迅雷也可以迅雷进行下载!
(文件页数 要大于 标注页数,上中下等多册电子书除外)
注意:本站所有压缩包均有解压码: 点击下载压缩包解压工具
图书目录
Part Ⅰ.Introduction3
1.Introduction3
The Sysadmin Approach to Service Management3
Google’s Approach to Service Management:Site Reliability Engineering5
Tenets of SRE7
The End of the Beginning12
2.The Production Environment at Google,from the Viewpoint of an SRE13
Hardware13
System Software That“Organizes”the Hardware15
Other System Software18
Our Software Infrastructure19
Our Development Environment19
Shakespeare:A Sample Service20
Part Ⅱ.Principles25
3.Embracing Risk25
Managing Risk25
Measuring Service Risk26
Risk Tolerance of Services28
Motivation for Error Budgets33
4.Service Level Objectives37
Service Level Terminology37
Indicators in Practice40
Objectives in Practice43
Agreements in Practice47
5.Eliminating Toil49
Toil Defined49
Why Less Toil Is Better51
What Qualifies as Engineering?52
Is Toil Always Bad?52
Conclusion54
6.Monitoring Distributed Systems55
Definitions55
Why Monitor?56
Setting Reasonable Expectations for Monitoring57
Symptoms Versus Causes58
Black-Box Versus White-Box59
The Four Golden Signals60
Worrying About Your Tail(or,Instrumentation and Performance)61
Choosing an Appropriate Resolution for Measurements62
As Simple as Possible,No Simpler62
Tying These Principles Together63
Monitoring for the Long Term64
Conclusion66
7.The Evolution of Automation at Google67
The Value of Automation67
The Value for Google SRE70
The Use Cases for Automation70
Automate Yourself Out of a Job:Automate ALL the Things!73
Soothing the Pain:Applying Automation to Cluster Turnups75
Borg:Birth of the Warehouse-Scale Computer81
Reliability Is the Fundamental Feature83
Recommendations84
8.Release Engineering87
The Role of a Release Engineer87
Philosophy88
Continuous Build and Deployment90
Configuration Management93
Conclusions95
9.Simplicity97
System Stability Versus Agility97
The Virtue of Boring98
I Won’t Give Up My Code!98
The“Negative Lines of Code”Metric99
Minimal APIs99
Modularity100
Release Simplicity100
A Simple Conclusion101
Part Ⅲ.Practices107
10.Practical Alerting from Time-Series Data107
The Rise of Borgmon108
Instrumentation of Applications109
Collection of Exported Data110
Storage in the Time-Series Arena111
Rule Evaluation114
Alerting118
Sharding the Monitoring Topology119
Black-Box Monitoring120
Maintaining the Configuration121
Ten Years On...122
11.Being On-Call125
Introduction125
Life of an On-Call Engineer126
Balanced On-Call127
Feeling Safe128
Avoiding Inappropriate Operational Load130
Conclusions132
12.Effective Troubleshooting133
Theory134
In Practice136
Negative Results Are Magic144
Case Study146
Making Troubleshooting Easier150
Conclusion150
13.Emergency Response151
What to Do When Systems Break151
Test-Induced Emergency152
Change-Induced Emergency153
Process-Induced Emergency155
All Problems Have Solutions158
Learn from the Past.Don’t Repeat It.158
Conclusion159
14.Managing Incidents161
Unmanaged Incidents161
The Anatomy of an Unmanaged Incident162
Elements of Incident Management Process163
A Managed Incident165
When to Declare an Incident166
In Summary166
15.Postmortem Culture:Learning from Failure169
Google’s Postmortem Philosophy169
Collaborate and Share Knowledge171
Introducing a Postmortem Culture172
Conclusion and Ongoing Improvements175
16.Tracking Outages177
Escalator178
Outalator178
17.Testing for Reliability183
Types of Software Testing185
Creating a Test and Build Environment190
Testing at Scale192
Conclusion204
18.Software Engineering in SRE205
Why Is Software Engineering Within SRE Important?205
Auxon Case Study:Project Background and Problem Space207
Intent-Based Capacity Planning209
Fostering Software Engineering in SRE218
Conclusions222
19.Load Balancing at the Frontend223
Power Isn’t the Answer223
Load Balancing Using DNS224
Load Balancing at the Virtual IP Address227
20.Load Balancing in the Datacenter231
The Ideal Case232
Identifying Bad Tasks:Flow Control and Lame Ducks233
Limiting the Connections Pool with Subsetting235
Load Balancing Policies240
21.Handling Overload247
The Pitfalls of“Queries per Second”248
Per-Customer Limits248
Client-Side Throttling249
Criticality251
Utilization Signals253
Handling Overload Errors253
Load from Connections257
Conclusions258
22.Addressing Cascading Failures259
Causes of Cascading Failures and Designing to Avoid Them260
Preventing Server Overload265
Slow Startup and Cold Caching274
Triggering Conditions for Cascading Failures276
Testing for Cascading Failures278
Immediate Steps to Address Cascading Failures280
Closing Remarks283
23.Managing Critiol State:Distributed Consensus for Reliability285
Motivating the Use of Consensus:Distributed Systems Coordination Failure288
How Distributed Consensus Works289
System Architecture Patterns for Distributed Consensus291
Distributed Consensus Performance296
Deploying Distributed Consensus-Based Systems304
Monitoring Distributed Consensus Systems312
Conclusion313
24.Distributed Periodic Scheduling with Cron315
Cron315
Cron Jobs and Idempotency316
Cron at Large Scale317
Building Cron at Google319
Summary326
25.Data Processing Pipelines327
Origin of the Pipeline Design Pattern327
Initial Effect of Big Data on the Simple Pipeline Pattern328
Challenges with the Periodic Pipeline Pattern328
Trouble Caused By Uneven Work Distribution328
Drawbacks of Periodic Pipelines in Distributed Environments329
Introduction to Google Workflow333
Stages of Execution in Workflow335
Ensuring Business Continuity337
Summary and Concluding Remarks338
26.Data Integrity:What You Read Is What You Wrote339
Data Integrity’s Strict Requirements340
Google SRE Objectives in Maintaining Data Integrity and Availability344
How Google SRE Faces the Challenges of Data Integrity349
Case Studies360
General Principles of SRE as Applied to Data Integrity367
Conclusion368
27.Reliable Product Launchesat Scale369
Launch Coordination Engineering370
Setting Up a Launch Process372
Developing a Launch Checklist375
Selected Techniques for Reliable Launches380
Development of LCE384
Conclusion387
Part Ⅳ.Management391
28.Accelerating SREs to On-Call and Beyond391
You’ve Hired Your Next SRE(s),Now What?391
Initial Learning Experiences:The Case for Structure Over Chaos394
Creating Stellar Reverse Engineers and Improvisational Thinkers397
Five Practices for Aspiring On-Callers400
On-Call and Beyond:Rites of Passage,and Practicing Continuing Education406
Closing Thoughts406
29.Dealing with Interrupts407
Managing Operational Load408
Factors in Determining How Interrupts Are Handled408
Imperfect Machines409
30.Embedding an SRE to Recover from Operational Overload417
Phase 1:Learn the Service and Get Context418
Phase 2:Sharing Context420
Phase 3:Driving Change421
Conclusion423
31.Communication and Collaboration in SRE425
Communications:Production Meetings426
Collaboration within SRE430
Case Study of Collaboration in SRE:Viceroy432
Collaboration Outside SRE437
Case Study:Migrating DFP to F1437
Conclusion440
32.The Evolving SRE Engagement Model441
SRE Engagement:What,How,and Why441
The PRR Model442
The SRE Engagement Model443
Production Readiness Reviews:Simple PRR Model444
Evolving the Simple PRR Model:Early Engagement448
Evolving Services Development:Frameworks and SRE Platform451
Conclusion456
Part Ⅴ.Conclusions459
33.Lessons Learned from Other Industries459
Meet Our Industry Veterans460
Preparedness and Disaster Testing462
Postmortem Culture465
Automating Away Repetitive Work and Operational Overhead467
Structured and Rational Decision Making469
Conclusions470
34.Conclusion473
A.Availability Table477
B.A Collection of Best Practices for Production Services479
C.Example Incident State Document485
D.Example Postmortem487
E.Launch Coordination Checklist493
F.Example Production Meeting Minutes497
Bibliography501
Index513