際際滷shows by User: MichaelKehoe3 / http://www.slideshare.net/images/logo.gif 際際滷shows by User: MichaelKehoe3 / Tue, 11 Jun 2019 15:23:10 GMT 際際滷Share feed for 際際滷shows by User: MichaelKehoe3 eBPF Workshop /slideshow/ebpf-workshop/149231197 bpfworkshop2-190611152310
eBPF workshop at Velocity Conf 2019 San Jose]]>

eBPF workshop at Velocity Conf 2019 San Jose]]>
Tue, 11 Jun 2019 15:23:10 GMT /slideshow/ebpf-workshop/149231197 MichaelKehoe3@slideshare.net(MichaelKehoe3) eBPF Workshop MichaelKehoe3 eBPF workshop at Velocity Conf 2019 San Jose <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/bpfworkshop2-190611152310-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> eBPF workshop at Velocity Conf 2019 San Jose
eBPF Workshop from Michael Kehoe
]]>
1429 4 https://cdn.slidesharecdn.com/ss_thumbnails/bpfworkshop2-190611152310-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
eBPF Basics /slideshow/ebpf-basics-149201150/149201150 ebpfbasics-190611051559
An Introduction to eBPF (and cBPF). Topics covered include history, implementation, program types & maps. Also gives a brief introduction to XDP and DPDK]]>

An Introduction to eBPF (and cBPF). Topics covered include history, implementation, program types & maps. Also gives a brief introduction to XDP and DPDK]]>
Tue, 11 Jun 2019 05:15:59 GMT /slideshow/ebpf-basics-149201150/149201150 MichaelKehoe3@slideshare.net(MichaelKehoe3) eBPF Basics MichaelKehoe3 An Introduction to eBPF (and cBPF). Topics covered include history, implementation, program types & maps. Also gives a brief introduction to XDP and DPDK <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ebpfbasics-190611051559-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> An Introduction to eBPF (and cBPF). Topics covered include history, implementation, program types &amp; maps. Also gives a brief introduction to XDP and DPDK
eBPF Basics from Michael Kehoe
]]>
3378 4 https://cdn.slidesharecdn.com/ss_thumbnails/ebpfbasics-190611051559-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Code Yellow: Helping operations top-heavy teams the smart way /slideshow/code-yellow-helping-operations-topheavy-teams-the-smart-way-138320342/138320342 codeyellow-190326195036
We will look at the process for Code Yellow, the term we use for this process of "righting the ship," and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.]]>

We will look at the process for Code Yellow, the term we use for this process of "righting the ship," and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.]]>
Tue, 26 Mar 2019 19:50:36 GMT /slideshow/code-yellow-helping-operations-topheavy-teams-the-smart-way-138320342/138320342 MichaelKehoe3@slideshare.net(MichaelKehoe3) Code Yellow: Helping operations top-heavy teams the smart way MichaelKehoe3 We will look at the process for Code Yellow, the term we use for this process of "righting the ship," and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/codeyellow-190326195036-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> We will look at the process for Code Yellow, the term we use for this process of &quot;righting the ship,&quot; and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.
Code Yellow: Helping operations top-heavy teams the smart way from Michael Kehoe
]]>
154 2 https://cdn.slidesharecdn.com/ss_thumbnails/codeyellow-190326195036-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
QConSF 2018: Building Production-Ready Applications /slideshow/qconsf-2018-building-productionready-applications/122343206 qconsf18building-production-ready-microservices-181107220330
In 2016, Susan Fowler released the 'Production Ready Microservices' book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate actionable items? This session will explore how to expertly deploy your microservice to production. The audience will learn best practice for designing, deploying, monitoring & documenting application. By the end of the session, attendees should feel confident that they have the knowledge to deploy a service that will be reliable and scalable.]]>

In 2016, Susan Fowler released the 'Production Ready Microservices' book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate actionable items? This session will explore how to expertly deploy your microservice to production. The audience will learn best practice for designing, deploying, monitoring & documenting application. By the end of the session, attendees should feel confident that they have the knowledge to deploy a service that will be reliable and scalable.]]>
Wed, 07 Nov 2018 22:03:30 GMT /slideshow/qconsf-2018-building-productionready-applications/122343206 MichaelKehoe3@slideshare.net(MichaelKehoe3) QConSF 2018: Building Production-Ready Applications MichaelKehoe3 In 2016, Susan Fowler released the 'Production Ready Microservices' book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate actionable items? This session will explore how to expertly deploy your microservice to production. The audience will learn best practice for designing, deploying, monitoring & documenting application. By the end of the session, attendees should feel confident that they have the knowledge to deploy a service that will be reliable and scalable. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/qconsf18building-production-ready-microservices-181107220330-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> In 2016, Susan Fowler released the &#39;Production Ready Microservices&#39; book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate actionable items? This session will explore how to expertly deploy your microservice to production. The audience will learn best practice for designing, deploying, monitoring &amp; documenting application. By the end of the session, attendees should feel confident that they have the knowledge to deploy a service that will be reliable and scalable.
QConSF 2018: Building Production-Ready Applications from Michael Kehoe
]]>
195 2 https://cdn.slidesharecdn.com/ss_thumbnails/qconsf18building-production-ready-microservices-181107220330-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Helping operations top-heavy teams the smart way /slideshow/helping-operations-topheavy-teams-the-smart-way-121080456/121080456 lisa2018code-yellow-181029155624
All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success. We will look at the process for Code Yellow, the term we use for this process of righting the ship, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.]]>

All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success. We will look at the process for Code Yellow, the term we use for this process of righting the ship, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.]]>
Mon, 29 Oct 2018 15:56:24 GMT /slideshow/helping-operations-topheavy-teams-the-smart-way-121080456/121080456 MichaelKehoe3@slideshare.net(MichaelKehoe3) Helping operations top-heavy teams the smart way MichaelKehoe3 All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success. We will look at the process for Code Yellow, the term we use for this process of righting the ship, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/lisa2018code-yellow-181029155624-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success. We will look at the process for Code Yellow, the term we use for this process of righting the ship, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.
Helping operations top-heavy teams the smart way from Michael Kehoe
]]>
426 2 https://cdn.slidesharecdn.com/ss_thumbnails/lisa2018code-yellow-181029155624-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
AllDayDevops: What the NTSB teaches us about incident management & postmortems /slideshow/alldaydevops-what-the-ntsb-teaches-us-about-incident-management-postmortems/119799478 alldaydevops2-181017220244
The National Transport Safety Bureau is one of the most widely known Government bodies in the world. Its their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSBs approach to its work and the procedure that drives it, is transferable to us as incident responders. Well talk about the NTSBs pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. Well consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.]]>

The National Transport Safety Bureau is one of the most widely known Government bodies in the world. Its their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSBs approach to its work and the procedure that drives it, is transferable to us as incident responders. Well talk about the NTSBs pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. Well consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.]]>
Wed, 17 Oct 2018 22:02:44 GMT /slideshow/alldaydevops-what-the-ntsb-teaches-us-about-incident-management-postmortems/119799478 MichaelKehoe3@slideshare.net(MichaelKehoe3) AllDayDevops: What the NTSB teaches us about incident management & postmortems MichaelKehoe3 The National Transport Safety Bureau is one of the most widely known Government bodies in the world. Its their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSBs approach to its work and the procedure that drives it, is transferable to us as incident responders. Well talk about the NTSBs pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. Well consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/alldaydevops2-181017220244-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> The National Transport Safety Bureau is one of the most widely known Government bodies in the world. Its their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSBs approach to its work and the procedure that drives it, is transferable to us as incident responders. Well talk about the NTSBs pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. Well consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.
AllDayDevops: What the NTSB teaches us about incident management & postmortems from Michael Kehoe
]]>
329 2 https://cdn.slidesharecdn.com/ss_thumbnails/alldaydevops2-181017220244-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Linux Container Basics /slideshow/linux-container-basics/117450345 containerbasics-180930150524
A primer into the barebones of Linux Containers and a look back at similar technologies]]>

A primer into the barebones of Linux Containers and a look back at similar technologies]]>
Sun, 30 Sep 2018 15:05:24 GMT /slideshow/linux-container-basics/117450345 MichaelKehoe3@slideshare.net(MichaelKehoe3) Linux Container Basics MichaelKehoe3 A primer into the barebones of Linux Containers and a look back at similar technologies <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/containerbasics-180930150524-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> A primer into the barebones of Linux Containers and a look back at similar technologies
Linux Container Basics from Michael Kehoe
]]>
523 1 https://cdn.slidesharecdn.com/ss_thumbnails/containerbasics-180930150524-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops /slideshow/papers-we-love-sept-2018-007-democratically-finding-the-cause-of-packet-drops/115690345 pwl2018-180921025014
Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur. We introduce 007, a lightweight, always-on diagnosis application that can find problematic links and also pinpoint problems for each TCP connection. 007 is completely contained within the end host. During its two month deployment in a tier-1 datacenter, it detected every problem found by previously deployed monitoring tools while also finding the sources of other problems previously undetected.]]>

Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur. We introduce 007, a lightweight, always-on diagnosis application that can find problematic links and also pinpoint problems for each TCP connection. 007 is completely contained within the end host. During its two month deployment in a tier-1 datacenter, it detected every problem found by previously deployed monitoring tools while also finding the sources of other problems previously undetected.]]>
Fri, 21 Sep 2018 02:50:14 GMT /slideshow/papers-we-love-sept-2018-007-democratically-finding-the-cause-of-packet-drops/115690345 MichaelKehoe3@slideshare.net(MichaelKehoe3) Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops MichaelKehoe3 Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur. We introduce 007, a lightweight, always-on diagnosis application that can find problematic links and also pinpoint problems for each TCP connection. 007 is completely contained within the end host. During its two month deployment in a tier-1 datacenter, it detected every problem found by previously deployed monitoring tools while also finding the sources of other problems previously undetected. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/pwl2018-180921025014-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur. We introduce 007, a lightweight, always-on diagnosis application that can find problematic links and also pinpoint problems for each TCP connection. 007 is completely contained within the end host. During its two month deployment in a tier-1 datacenter, it detected every problem found by previously deployed monitoring tools while also finding the sources of other problems previously undetected.
Papers We Love Sept. 2018: 007: Democratically Finding The Cause of Packet Drops from Michael Kehoe
]]>
286 2 https://cdn.slidesharecdn.com/ss_thumbnails/pwl2018-180921025014-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
What the NTSB teaches us about incident management & postmortems /slideshow/what-the-ntsb-teaches-us-about-incident-management-postmortems/112948363 ntsbim-180904143238
The National Transport Safety Bureau is one of the most widely known Government bodies in the world. Its their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSBs approach to its work and the procedure that drives it, is transferable to us as incident responders. Well talk about the NTSBs pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. Well consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.]]>

The National Transport Safety Bureau is one of the most widely known Government bodies in the world. Its their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSBs approach to its work and the procedure that drives it, is transferable to us as incident responders. Well talk about the NTSBs pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. Well consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.]]>
Tue, 04 Sep 2018 14:32:38 GMT /slideshow/what-the-ntsb-teaches-us-about-incident-management-postmortems/112948363 MichaelKehoe3@slideshare.net(MichaelKehoe3) What the NTSB teaches us about incident management & postmortems MichaelKehoe3 The National Transport Safety Bureau is one of the most widely known Government bodies in the world. Its their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSBs approach to its work and the procedure that drives it, is transferable to us as incident responders. Well talk about the NTSBs pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. Well consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/ntsbim-180904143238-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> The National Transport Safety Bureau is one of the most widely known Government bodies in the world. Its their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress. This session will detail how the NTSBs approach to its work and the procedure that drives it, is transferable to us as incident responders. Well talk about the NTSBs pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. Well consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.
What the NTSB teaches us about incident management & postmortems from Michael Kehoe
]]>
502 4 https://cdn.slidesharecdn.com/ss_thumbnails/ntsbim-180904143238-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
PyBay 2018: Production-Ready Python Applications /slideshow/pybay-2018-productionready-python-applications/110576008 pybay2018production-readypythonapplications-180819175950
In 2016, Susan Fowler released the 'Production Ready Microservices' book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate for Python applications? This session will explore how to expertly deploy your Python micro-service to production.]]>

In 2016, Susan Fowler released the 'Production Ready Microservices' book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate for Python applications? This session will explore how to expertly deploy your Python micro-service to production.]]>
Sun, 19 Aug 2018 17:59:50 GMT /slideshow/pybay-2018-productionready-python-applications/110576008 MichaelKehoe3@slideshare.net(MichaelKehoe3) PyBay 2018: Production-Ready Python Applications MichaelKehoe3 In 2016, Susan Fowler released the 'Production Ready Microservices' book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate for Python applications? This session will explore how to expertly deploy your Python micro-service to production. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/pybay2018production-readypythonapplications-180819175950-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> In 2016, Susan Fowler released the &#39;Production Ready Microservices&#39; book. This book sets an industry benchmark on explaining how microservices should be conceived, all the way through to documentation. So how does this translate for Python applications? This session will explore how to expertly deploy your Python micro-service to production.
PyBay 2018: Production-Ready Python Applications from Michael Kehoe
]]>
290 6 https://cdn.slidesharecdn.com/ss_thumbnails/pybay2018production-readypythonapplications-180819175950-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Helping operations top-heavy teams the smart way /slideshow/helping-operations-topheavy-teams-the-smart-way/97422576 sfreliabilitymeetupmay20182-180517184200
SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SREs, were programmed to keep fighting through the issues, when sometimes its best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. Well discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.]]>

SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SREs, were programmed to keep fighting through the issues, when sometimes its best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. Well discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.]]>
Thu, 17 May 2018 18:42:00 GMT /slideshow/helping-operations-topheavy-teams-the-smart-way/97422576 MichaelKehoe3@slideshare.net(MichaelKehoe3) Helping operations top-heavy teams the smart way MichaelKehoe3 SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SREs, were programmed to keep fighting through the issues, when sometimes its best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. Well discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/sfreliabilitymeetupmay20182-180517184200-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SREs, were programmed to keep fighting through the issues, when sometimes its best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. Well discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.
Helping operations top-heavy teams the smart way from Michael Kehoe
]]>
241 6 https://cdn.slidesharecdn.com/ss_thumbnails/sfreliabilitymeetupmay20182-180517184200-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
The Next Wave of Reliability Engineering /slideshow/the-next-wave-of-reliability-engineering/95795362 interop2018final-180503070313
In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks? This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.]]>

In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks? This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.]]>
Thu, 03 May 2018 07:03:13 GMT /slideshow/the-next-wave-of-reliability-engineering/95795362 MichaelKehoe3@slideshare.net(MichaelKehoe3) The Next Wave of Reliability Engineering MichaelKehoe3 In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google's inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks? This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/interop2018final-180503070313-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> In 2018, Site Reliability Engineering (SRE) will turn 15 years old. Since Google&#39;s inception of the term SRE, companies across the world have adopted a new operations mindset along with automation, deployment and monitoring principals. Most of what SRE does now is well established throughout the industry, so what is the next-wave of reliability principals and automation frameworks? This session will dive into what the future holds for reliability engineering as a field and what will be the next areas of investment and improvement for reliability teams.
The Next Wave of Reliability Engineering from Michael Kehoe
]]>
742 3 https://cdn.slidesharecdn.com/ss_thumbnails/interop2018final-180503070313-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Building Production-Ready Microservices: DevopsExchangeSF /slideshow/building-productionready-microservices-devopsexchangesf/95137636 devopsexchangesfproductionreadyservices-180426174337
Michael Kehoe talks about what production-ready means, how to build production-ready microservices and how to measure their readiness]]>

Michael Kehoe talks about what production-ready means, how to build production-ready microservices and how to measure their readiness]]>
Thu, 26 Apr 2018 17:43:37 GMT /slideshow/building-productionready-microservices-devopsexchangesf/95137636 MichaelKehoe3@slideshare.net(MichaelKehoe3) Building Production-Ready Microservices: DevopsExchangeSF MichaelKehoe3 Michael Kehoe talks about what production-ready means, how to build production-ready microservices and how to measure their readiness <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/devopsexchangesfproductionreadyservices-180426174337-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Michael Kehoe talks about what production-ready means, how to build production-ready microservices and how to measure their readiness
Building Production-Ready Microservices: DevopsExchangeSF from Michael Kehoe
]]>
460 1 https://cdn.slidesharecdn.com/ss_thumbnails/devopsexchangesfproductionreadyservices-180426174337-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering /slideshow/sf-chaos-engineering-meetup-building-disaster-recovery-via-resilience-engineering/92137734 goto-trafficshifting-180328051648
How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasnt sufficient to provide enough confidence in data centers capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!]]>

How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasnt sufficient to provide enough confidence in data centers capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!]]>
Wed, 28 Mar 2018 05:16:48 GMT /slideshow/sf-chaos-engineering-meetup-building-disaster-recovery-via-resilience-engineering/92137734 MichaelKehoe3@slideshare.net(MichaelKehoe3) SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering MichaelKehoe3 How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasnt sufficient to provide enough confidence in data centers capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day! <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/goto-trafficshifting-180328051648-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> How often have you heard stories where someone thought they had a disaster strategy, never tested it and it fails when you need it the most? LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasnt sufficient to provide enough confidence in data centers capacity. To solve this problem, LinkedIn moves live traffic to services site-wide by shifting traffic between datacenters to simulate a disaster every business day!
SF Chaos Engineering Meetup: Building Disaster Recovery via Resilience Engineering from Michael Kehoe
]]>
324 2 https://cdn.slidesharecdn.com/ss_thumbnails/goto-trafficshifting-180328051648-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn /slideshow/sreconeurope2017-reducing-mttr-and-false-escalations-event-correlation-at-linkedin/79328388 sreconemea17eventcorrelation-170831162958
LinkedIns production stack is made up of over 900 applications, 2200 internal APIs and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SREs who own the unhealthy service. Well discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIns oncall engineers.]]>

LinkedIns production stack is made up of over 900 applications, 2200 internal APIs and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SREs who own the unhealthy service. Well discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIns oncall engineers.]]>
Thu, 31 Aug 2017 16:29:58 GMT /slideshow/sreconeurope2017-reducing-mttr-and-false-escalations-event-correlation-at-linkedin/79328388 MichaelKehoe3@slideshare.net(MichaelKehoe3) SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn MichaelKehoe3 LinkedIns production stack is made up of over 900 applications, 2200 internal APIs and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SREs who own the unhealthy service. Well discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIns oncall engineers. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/sreconemea17eventcorrelation-170831162958-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> LinkedIns production stack is made up of over 900 applications, 2200 internal APIs and hundreds of databases. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SREs who own the unhealthy service. Well discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIns oncall engineers.
SRECon-Europe-2017: Reducing MTTR and False Escalations: Event Correlation at LinkedIn from Michael Kehoe
]]>
275 3 https://cdn.slidesharecdn.com/ss_thumbnails/sreconemea17eventcorrelation-170831162958-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
SRECon-Europe-2017: Networks for SREs /slideshow/sreconeurope2017-networks-for-sres/79287521 networksforsresrecon-170830133040
All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, weve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed! However, the way we troubleshoot the network in relation to the applications we support hasnt adapted. In this session, well review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously.]]>

All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, weve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed! However, the way we troubleshoot the network in relation to the applications we support hasnt adapted. In this session, well review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously.]]>
Wed, 30 Aug 2017 13:30:40 GMT /slideshow/sreconeurope2017-networks-for-sres/79287521 MichaelKehoe3@slideshare.net(MichaelKehoe3) SRECon-Europe-2017: Networks for SREs MichaelKehoe3 All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, weve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed! However, the way we troubleshoot the network in relation to the applications we support hasnt adapted. In this session, well review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/networksforsresrecon-170830133040-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> All of us depend on the underlying network to be stable whether in the datacenter or in the cloud. We all have a basic knowledge of how traditional networks run, however in the past 10 years, weve moved to building redundant physical topologies in our networks, optimized the routing methodologies accordingly, moved into the cloud and gotten greater visibility and tuneables in the Linux kernel network stack. A lot has changed! However, the way we troubleshoot the network in relation to the applications we support hasnt adapted. In this session, well review the progress that network infrastructure has made look at specific examples where traditional troubleshooting responses fail us and demonstrate our need to rethink our approach to making applications and the network interact harmoniously.
SRECon-Europe-2017: Networks for SREs from Michael Kehoe
]]>
390 3 https://cdn.slidesharecdn.com/ss_thumbnails/networksforsresrecon-170830133040-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale /slideshow/velocity-san-jose-2017-traffic-shifts-avoiding-disasters-at-scale/77188294 velocity2017avoidingdisasteratscale-170622205611
LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasnt sufficient to provide enough confidence in data centers capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load. Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.]]>

LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasnt sufficient to provide enough confidence in data centers capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load. Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.]]>
Thu, 22 Jun 2017 20:56:11 GMT /slideshow/velocity-san-jose-2017-traffic-shifts-avoiding-disasters-at-scale/77188294 MichaelKehoe3@slideshare.net(MichaelKehoe3) Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale MichaelKehoe3 LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasnt sufficient to provide enough confidence in data centers capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load. Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/velocity2017avoidingdisasteratscale-170622205611-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> LinkedIn has evolved from serving live traffic out of one data center to four data centers spread geographically. Serving live traffic from four data centers at the same time has taken the company from a disaster recovery model to a disaster avoidance model, where an unhealthy data center can be taken out of rotation and its traffic redistributed to the healthy data centers within minutes, with virtually no visible impact to users. As LinkedIn transitioned from big monolithic applications to microservices, it was difficult to determine capacity constraints of individual services to handle extra load during disaster scenarios. Stress testing individual services using artificial load in a complex microservices architecture wasnt sufficient to provide enough confidence in data centers capacity. To solve this problem, LinkedIn leverages live traffic to stress services site-wide by shifting traffic to simulate a disaster load. Michael Kehoe and Anil Mallapur discuss how LinkedIn uses traffic shifts to mitigate user impact by migrating live traffic between its data centers and stress test site-wide services for improved capacity handling and member experience.
Velocity San Jose 2017: Traffic shifts: Avoiding disasters at scale from Michael Kehoe
]]>
253 4 https://cdn.slidesharecdn.com/ss_thumbnails/velocity2017avoidingdisasteratscale-170622205611-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Reducing MTTR and False Escalations: Event Correlation at LinkedIn /slideshow/reducing-mttr-and-false-escalations-event-correlation-at-linkedin-73177586/73177586 eventcorrelationsrecon17americas-170315150901
LinkedIns production stack is made up of over 900 applications and over 2200 internal APIs. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SREs who own the unhealthy service. Well discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIns oncall engineers.]]>

LinkedIns production stack is made up of over 900 applications and over 2200 internal APIs. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SREs who own the unhealthy service. Well discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIns oncall engineers.]]>
Wed, 15 Mar 2017 15:09:01 GMT /slideshow/reducing-mttr-and-false-escalations-event-correlation-at-linkedin-73177586/73177586 MichaelKehoe3@slideshare.net(MichaelKehoe3) Reducing MTTR and False Escalations: Event Correlation at LinkedIn MichaelKehoe3 LinkedIns production stack is made up of over 900 applications and over 2200 internal APIs. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SREs who own the unhealthy service. Well discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIns oncall engineers. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/eventcorrelationsrecon17americas-170315150901-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> LinkedIns production stack is made up of over 900 applications and over 2200 internal APIs. With any given application having many interconnected pieces, it is difficult to escalate to the right person in a timely manner. In order to combat this, LinkedIn built an Event Correlation Engine that monitors service health and maps dependencies between services to correctly escalate to the SREs who own the unhealthy service. Well discuss the approach we used in building a correlation engine and how it has been used at LinkedIn to reduce incident impact and provide better quality of life to LinkedIns oncall engineers.
Reducing MTTR and False Escalations: Event Correlation at LinkedIn from Michael Kehoe
]]>
988 3 https://cdn.slidesharecdn.com/ss_thumbnails/eventcorrelationsrecon17americas-170315150901-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale /slideshow/apricot-2017-trafficshifting-avoiding-disasters-improving-performance-at-scale-72682406/72682406 apricotmichaelkehoelinkedin-170301060923
LinkedIn serves traffic for its 467 million members from four data centers and multiple PoPs spread geographically around the world. Serving live traffic from from many places at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center or PoP out of rotation and redistribute its traffic to a healthy one within minutes, with virtually no visible impact to users. The geographical distribution of our infrastructure also allows us to optimize the end-user's experience by geo routing users to the best possible PoP and datacenter. This talk provide details on how LinkedIn shifts traffic between its PoPs and data centers to provide the best possible performance and availability for its members. We will also touch on the complexities of performance in APAC, how IPv6 is helping our members and how LinkedIn stress tests data centers verify its disaster recovery capabilities.]]>

LinkedIn serves traffic for its 467 million members from four data centers and multiple PoPs spread geographically around the world. Serving live traffic from from many places at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center or PoP out of rotation and redistribute its traffic to a healthy one within minutes, with virtually no visible impact to users. The geographical distribution of our infrastructure also allows us to optimize the end-user's experience by geo routing users to the best possible PoP and datacenter. This talk provide details on how LinkedIn shifts traffic between its PoPs and data centers to provide the best possible performance and availability for its members. We will also touch on the complexities of performance in APAC, how IPv6 is helping our members and how LinkedIn stress tests data centers verify its disaster recovery capabilities.]]>
Wed, 01 Mar 2017 06:09:23 GMT /slideshow/apricot-2017-trafficshifting-avoiding-disasters-improving-performance-at-scale-72682406/72682406 MichaelKehoe3@slideshare.net(MichaelKehoe3) APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale MichaelKehoe3 LinkedIn serves traffic for its 467 million members from four data centers and multiple PoPs spread geographically around the world. Serving live traffic from from many places at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center or PoP out of rotation and redistribute its traffic to a healthy one within minutes, with virtually no visible impact to users. The geographical distribution of our infrastructure also allows us to optimize the end-user's experience by geo routing users to the best possible PoP and datacenter. This talk provide details on how LinkedIn shifts traffic between its PoPs and data centers to provide the best possible performance and availability for its members. We will also touch on the complexities of performance in APAC, how IPv6 is helping our members and how LinkedIn stress tests data centers verify its disaster recovery capabilities. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/apricotmichaelkehoelinkedin-170301060923-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> LinkedIn serves traffic for its 467 million members from four data centers and multiple PoPs spread geographically around the world. Serving live traffic from from many places at the same time has taken us from a disaster recovery model to a disaster avoidance model where we can take an unhealthy data center or PoP out of rotation and redistribute its traffic to a healthy one within minutes, with virtually no visible impact to users. The geographical distribution of our infrastructure also allows us to optimize the end-user&#39;s experience by geo routing users to the best possible PoP and datacenter. This talk provide details on how LinkedIn shifts traffic between its PoPs and data centers to provide the best possible performance and availability for its members. We will also touch on the complexities of performance in APAC, how IPv6 is helping our members and how LinkedIn stress tests data centers verify its disaster recovery capabilities.
APRICOT 2017: Trafficshifting: Avoiding Disasters & Improving Performance at Scale from Michael Kehoe
]]>
540 3 https://cdn.slidesharecdn.com/ss_thumbnails/apricotmichaelkehoelinkedin-170301060923-thumbnail.jpg?width=120&height=120&fit=bounds presentation 000000 http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
Couchbase Connect 2016: Monitoring Production Deployments The Tools LinkedIn /slideshow/couchbase-connect-2016-monitoring-production-deployments-the-tools-linkedin/68521788 connect2016-monitoring-final-161109221346
Good monitoring can be the difference between a great night's sleep or hearing your phone go off at 2:37 a.m. because of a production outage. Couchbase Server provides a large number of metrics which can be overwhelming if you do not know the critical things to focus on or how to expose that information to your monitoring system. In this talk we will look at example production incidents, going in depth around specific things to monitor, and how this information can be used to find issues, work out root cause, and discover trends.]]>

Good monitoring can be the difference between a great night's sleep or hearing your phone go off at 2:37 a.m. because of a production outage. Couchbase Server provides a large number of metrics which can be overwhelming if you do not know the critical things to focus on or how to expose that information to your monitoring system. In this talk we will look at example production incidents, going in depth around specific things to monitor, and how this information can be used to find issues, work out root cause, and discover trends.]]>
Wed, 09 Nov 2016 22:13:45 GMT /slideshow/couchbase-connect-2016-monitoring-production-deployments-the-tools-linkedin/68521788 MichaelKehoe3@slideshare.net(MichaelKehoe3) Couchbase Connect 2016: Monitoring Production Deployments The Tools LinkedIn MichaelKehoe3 Good monitoring can be the difference between a great night's sleep or hearing your phone go off at 2:37 a.m. because of a production outage. Couchbase Server provides a large number of metrics which can be overwhelming if you do not know the critical things to focus on or how to expose that information to your monitoring system. In this talk we will look at example production incidents, going in depth around specific things to monitor, and how this information can be used to find issues, work out root cause, and discover trends. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/connect2016-monitoring-final-161109221346-thumbnail.jpg?width=120&amp;height=120&amp;fit=bounds" /><br> Good monitoring can be the difference between a great night&#39;s sleep or hearing your phone go off at 2:37 a.m. because of a production outage. Couchbase Server provides a large number of metrics which can be overwhelming if you do not know the critical things to focus on or how to expose that information to your monitoring system. In this talk we will look at example production incidents, going in depth around specific things to monitor, and how this information can be used to find issues, work out root cause, and discover trends.
Couchbase Connect 2016: Monitoring Production Deployments The Tools LinkedIn from Michael Kehoe
]]>
725 2 https://cdn.slidesharecdn.com/ss_thumbnails/connect2016-monitoring-final-161109221346-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post http://activitystrea.ms/schema/1.0/posted 0
https://cdn.slidesharecdn.com/profile-photo-MichaelKehoe3-48x48.jpg?cb=1601354692 I am an engineer who prides himself on architecting reliable, scalable infrastructure. I specialise in maintaining large system infrastructure as demonstrated by work at LinkedIn (applications) and at The University of Queensland (networks). I possess high-level skills in maintaining Linux and Windows servers and their respective infrastructure services. My interpersonal skills allow me to interact with clients and colleagues in a professional manner using exemplary communication skills. michael-kehoe.io https://cdn.slidesharecdn.com/ss_thumbnails/bpfworkshop2-190611152310-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/ebpf-workshop/149231197 eBPF Workshop https://cdn.slidesharecdn.com/ss_thumbnails/ebpfbasics-190611051559-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/ebpf-basics-149201150/149201150 eBPF Basics https://cdn.slidesharecdn.com/ss_thumbnails/codeyellow-190326195036-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/code-yellow-helping-operations-topheavy-teams-the-smart-way-138320342/138320342 Code Yellow: Helping o...