图书介绍

网站运维工程（影印版）PDF|Epub|txt|kindle电子书版本网盘下载

BETSY BEYER，CHRIS JONES，JENNIFER PETOFF，NIALL RICHARD MURPHY编著
出版社：南京：东南大学出版社
ISBN：9787564172961
出版时间：2018
标注页数：528页
文件大小：75MB
文件页数：554页
主题词：网站建设－英文；网站－维护－英文

PDF下载

点此进入-本书在线PDF格式电子书下载【推荐-云解压-方便快捷】直接下载PDF格式图书。移动端-PC端通用
种子下载[BT下载速度快]温馨提示：（请使用BT下载软件FDM进行下载）软件下载地址页直链下载[便捷但速度慢] [在线试读本书] [在线获取解压码]

点击复制MD5值：02f9f2d134df757b7722eb64daadd25e

下载说明

网站运维工程（影印版）PDF格式电子书版下载

下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。

点击复制85GB完整离线版磁力链接到迅雷FDM等BT下载工具进行下载详情点击-查看共享计划

建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台）。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用！后期资源热门了。安装了迅雷也可以迅雷进行下载！

（文件页数要大于标注页数，上中下等多册电子书除外）

注意：本站所有压缩包均有解压码： 点击下载压缩包解压工具

图书目录

Part Ⅰ．Introduction3

1．Introduction3

The Sysadmin Approach to Service Management3

Google’s Approach to Service Management：Site Reliability Engineering5

Tenets of SRE7

The End of the Beginning12

2．The Production Environment at Google,from the Viewpoint of an SRE13

Hardware13

System Software That“Organizes”the Hardware15

Other System Software18

Our Software Infrastructure19

Our Development Environment19

Shakespeare：A Sample Service20

Part Ⅱ．Principles25

3．Embracing Risk25

Managing Risk25

Measuring Service Risk26

Risk Tolerance of Services28

Motivation for Error Budgets33

4．Service Level Objectives37

Service Level Terminology37

Indicators in Practice40

Objectives in Practice43

Agreements in Practice47

5．Eliminating Toil49

Toil Defined49

Why Less Toil Is Better51

What Qualifies as Engineering?52

Is Toil Always Bad?52

Conclusion54

6．Monitoring Distributed Systems55

Definitions55

Why Monitor?56

Setting Reasonable Expectations for Monitoring57

Symptoms Versus Causes58

Black-Box Versus White-Box59

The Four Golden Signals60

Worrying About Your Tail（or,Instrumentation and Performance）61

Choosing an Appropriate Resolution for Measurements62

As Simple as Possible,No Simpler62

Tying These Principles Together63

Monitoring for the Long Term64

Conclusion66

7．The Evolution of Automation at Google67

The Value of Automation67

The Value for Google SRE70

The Use Cases for Automation70

Automate Yourself Out of a Job：Automate ALL the Things!73

Soothing the Pain：Applying Automation to Cluster Turnups75

Borg：Birth of the Warehouse-Scale Computer81

Reliability Is the Fundamental Feature83

Recommendations84

8．Release Engineering87

The Role of a Release Engineer87

Philosophy88

Continuous Build and Deployment90

Configuration Management93

Conclusions95

9．Simplicity97

System Stability Versus Agility97

The Virtue of Boring98

I Won’t Give Up My Code!98

The“Negative Lines of Code”Metric99

Minimal APIs99

Modularity100

Release Simplicity100

A Simple Conclusion101

Part Ⅲ．Practices107

10．Practical Alerting from Time-Series Data107

The Rise of Borgmon108

Instrumentation of Applications109

Collection of Exported Data110

Storage in the Time-Series Arena111

Rule Evaluation114

Alerting118

Sharding the Monitoring Topology119

Black-Box Monitoring120

Maintaining the Configuration121

Ten Years On...122

11．Being On-Call125

Introduction125

Life of an On-Call Engineer126

Balanced On-Call127

Feeling Safe128

Avoiding Inappropriate Operational Load130

Conclusions132

12．Effective Troubleshooting133

Theory134

In Practice136

Negative Results Are Magic144

Case Study146

Making Troubleshooting Easier150

Conclusion150

13．Emergency Response151

What to Do When Systems Break151

Test-Induced Emergency152

Change-Induced Emergency153

Process-Induced Emergency155

All Problems Have Solutions158

Learn from the Past.Don’t Repeat It．158

Conclusion159

14．Managing Incidents161

Unmanaged Incidents161

The Anatomy of an Unmanaged Incident162

Elements of Incident Management Process163

A Managed Incident165

When to Declare an Incident166

In Summary166

15．Postmortem Culture：Learning from Failure169

Google’s Postmortem Philosophy169

Collaborate and Share Knowledge171

Introducing a Postmortem Culture172

Conclusion and Ongoing Improvements175

16．Tracking Outages177

Escalator178

Outalator178

17．Testing for Reliability183

Types of Software Testing185

Creating a Test and Build Environment190

Testing at Scale192

Conclusion204

18．Software Engineering in SRE205

Why Is Software Engineering Within SRE Important?205

Auxon Case Study：Project Background and Problem Space207

Intent-Based Capacity Planning209

Fostering Software Engineering in SRE218

Conclusions222

19．Load Balancing at the Frontend223

Power Isn’t the Answer223

Load Balancing Using DNS224

Load Balancing at the Virtual IP Address227

20．Load Balancing in the Datacenter231

The Ideal Case232

Identifying Bad Tasks：Flow Control and Lame Ducks233

Limiting the Connections Pool with Subsetting235

Load Balancing Policies240

21．Handling Overload247

The Pitfalls of“Queries per Second”248

Per-Customer Limits248

Client-Side Throttling249

Criticality251

Utilization Signals253

Handling Overload Errors253

Load from Connections257

Conclusions258

22．Addressing Cascading Failures259

Causes of Cascading Failures and Designing to Avoid Them260

Preventing Server Overload265

Slow Startup and Cold Caching274

Triggering Conditions for Cascading Failures276

Testing for Cascading Failures278

Immediate Steps to Address Cascading Failures280

Closing Remarks283

23．Managing Critiol State：Distributed Consensus for Reliability285

Motivating the Use of Consensus：Distributed Systems Coordination Failure288

How Distributed Consensus Works289

System Architecture Patterns for Distributed Consensus291

Distributed Consensus Performance296

Deploying Distributed Consensus-Based Systems304

Monitoring Distributed Consensus Systems312

Conclusion313

24．Distributed Periodic Scheduling with Cron315

Cron315

Cron Jobs and Idempotency316

Cron at Large Scale317

Building Cron at Google319

Summary326

25．Data Processing Pipelines327

Origin of the Pipeline Design Pattern327

Initial Effect of Big Data on the Simple Pipeline Pattern328

Challenges with the Periodic Pipeline Pattern328

Trouble Caused By Uneven Work Distribution328

Drawbacks of Periodic Pipelines in Distributed Environments329

Introduction to Google Workflow333

Stages of Execution in Workflow335

Ensuring Business Continuity337

Summary and Concluding Remarks338

26．Data Integrity：What You Read Is What You Wrote339

Data Integrity’s Strict Requirements340

Google SRE Objectives in Maintaining Data Integrity and Availability344

How Google SRE Faces the Challenges of Data Integrity349

Case Studies360

General Principles of SRE as Applied to Data Integrity367

Conclusion368

27．Reliable Product Launchesat Scale369

Launch Coordination Engineering370

Setting Up a Launch Process372

Developing a Launch Checklist375

Selected Techniques for Reliable Launches380

Development of LCE384

Conclusion387

Part Ⅳ．Management391

28．Accelerating SREs to On-Call and Beyond391

You’ve Hired Your Next SRE(s),Now What?391

Initial Learning Experiences：The Case for Structure Over Chaos394

Creating Stellar Reverse Engineers and Improvisational Thinkers397

Five Practices for Aspiring On-Callers400

On-Call and Beyond：Rites of Passage,and Practicing Continuing Education406

Closing Thoughts406

29．Dealing with Interrupts407

Managing Operational Load408

Factors in Determining How Interrupts Are Handled408

Imperfect Machines409

30．Embedding an SRE to Recover from Operational Overload417

Phase 1：Learn the Service and Get Context418

Phase 2：Sharing Context420

Phase 3：Driving Change421

Conclusion423

31．Communication and Collaboration in SRE425

Communications：Production Meetings426

Collaboration within SRE430

Case Study of Collaboration in SRE：Viceroy432

Collaboration Outside SRE437

Case Study：Migrating DFP to F1437

Conclusion440

32．The Evolving SRE Engagement Model441

SRE Engagement：What,How,and Why441

The PRR Model442

The SRE Engagement Model443

Production Readiness Reviews：Simple PRR Model444

Evolving the Simple PRR Model：Early Engagement448

Evolving Services Development：Frameworks and SRE Platform451

Conclusion456

Part Ⅴ．Conclusions459

33．Lessons Learned from Other Industries459

Meet Our Industry Veterans460

Preparedness and Disaster Testing462

Postmortem Culture465

Automating Away Repetitive Work and Operational Overhead467

Structured and Rational Decision Making469

Conclusions470

34．Conclusion473

A．Availability Table477

B．A Collection of Best Practices for Production Services479

C．Example Incident State Document485

D．Example Postmortem487

E．Launch Coordination Checklist493

F．Example Production Meeting Minutes497

Bibliography501

Index513