ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
1
Erlang at Facebook
Eugene Letuchy
Apr 30, 2009
2
1 Facebook ... and Erlang
2 Story of Facebook Chat
3 Facebook Chat Architecture
4 Key Erlang Features
5 Then and Now
Agenda
3
Facebook ... and Erlang
4
The Facebook Environment
? The Site
? More than 200 million active users
? More than 3.5 billion minutes are spent on Facebook each day
? Fewer than 900 employees
? The Engineering Team
? Fast iteration: code gets out to production within a week
? Polyglot programming: interoperability is key
? Practical: high-leverage tools win
5
Erlang Projects
? Chat: the biggest and best known user
? AIM Presence: a JSONP validator
? Chat Jabber support (ejabberd)
6
Facebook Chat
7
2007: Facebook needs Chat
Messages, Wall, Links aren¡¯t enough
8
Enter a Hackathon (Jan 2007)
? Chat started in one night of coding
? Floating conversation windows
? No buddy list
? One server (no distribution)
? Erlang was there!
9
Enter Eugene (Feb 2007)
? I joined Facebook after Chat Hackathon
? What is this Erlang?
? Spring 2007:
? Learning Erlang from Joe Armstrong's thesis
? Lots of prototyping
? Evaluating infrastructure needs
? Summer 2007:
? Chris Piro works on Erlang Thrift bindings
10
Let¡¯s do this!
? Mid-Fall 2007: Chat becomes a ¡°real¡± project
? 4 engineers, 0.5 designer
? Infrastructure components get built and improved
? Feb 2008: ¡°Dark launch¡± testing begins
? Simulates load on the Erlang servers ... they hold up
? Apr 6, 2008: First real Chat message sent
? Apr 23, 2008: 100% rollout (Facebook has 70M users at the time)
11
Launch: April 2008
? Apr 6, 2008: gradual live rollout starts
? First message: "msn chat?"
? Apr 23, 2008: 100% rollout (to Facebook¡¯s 70M users)
? Graph of sends in the ?rst days of launch
0
3
6
9
12
15
Tue 00:00 12:00 Wed 00:00 12:00
millions of sends per hour
12
Chat ... one year later
? Facebook has 200M active users
? 800+ million user messages / day
? 7+ million active channels at peak
? 1GB+ in / sec at peak
? 100+ channel machines
? ~9-10 times the work at launch;
~2 as many machines
13
Chat Architecture
14
System challenges
? How does synchronous messaging work on the Web?
? ¡°Presence¡± is hard to scale
? Need a system to queue and deliver messages
? Millions of connections, mostly idle
? Need logging, at least between page loads
? Make it work in Facebook¡¯s environment
15
System overview
16
System overview - User Interface
Chat in the browser?
? Chat bar af?xed to the bottom of each Facebook page
? Mix of client-side Javascript and server-side PHP
? Works around transport errors, browser differences
? Regular AJAX for sending messages, fetching conversation history
? Periodic AJAX polling for list of online friends
? AJAX long-polling for messages (Comet)
17
System Overview - Back End
How does the back end service requests?
? Discrete responsibilities for each service
? Communicate via Thrift
? Channel (Erlang): message queuing and delivery
? Queue messages in each user¡¯s ¡°channel¡±
? Deliver messages as responses to long-polling HTTP requests
? Presence (C++): aggregates online info in memory (pull-based presence)
? Chatlogger (C++): stores conversations between page loads
? Web tier (PHP): serves our vanilla web requests
18
System overview
19
Message send
Me:
Lunch?
Eugene:
Lunch?
1 - ajax
2a - thrift
2b - thrift
3 - long poll
20
Channel servers (Erlang)
21
Channel servers
Architectural overview
? One channel per user
? Web tier delivers messages for that user
? Channel State: short queue of sequenced messages
? Long poll for streaming (Comet)
? Clients make an HTTP request
? Server replies when a message is ready
? One active request per browser tab
22
channel application
messages
authentication
online list messages
23
Channel servers
Architectural details
? Distributed design
? User id space is partitioned (division of labor)
? Each partition is serviced by a cluster (availability)
? Presence aggregation
? Channel servers are authoritative
? Periodically shipped to presence servers
? Open source: Erlang, Mochiweb, Thrift, Scribe, fb303,et al.
24
Key Erlang Features we love
25
Concurrency
? Cheap parallelism at massive scale
? Simpli?es modeling concurrent interactions
? Chat users are independent and concurrent
? Mapping onto traditional OS threads is unnatural
? Locality of reference
? Bonus: carries over to non-Erlang concurrent programming
26
Distribution
? Connected network of nodes
? Remote processes look like local processes
? Any node in a channel server cluster can route requests
? Naive load balancing
? Distributed Erlang works out-of-the-box (all nodes are trusted)
27
Fault Isolation
? Bugs in the initial versions of Chat:
? Process leaks in the Thrift bindings
? Unintended multicasting of messages
? Bad return state for presence aggregators
? (Horrible) bugs don¡¯t kill a mostly functional system:
? C/C++ segfault takes down the OS process and your server state
? Erlang badmatch takes down an Erlang process
? ... and noti?es linked processes
28
Error logging (Crash Reports)
? Any proc_lib-compliant process generates crash reports
? Error reports can be handled out of band (not where generated)
? Stacktraces point the way to bugs (functional languages win big here)
? ... but they could be improved with source line numbers
? Writing error_log handlers is simple:
? gen_event behavior
? Allows for massaging of the crash and error messages (binaries!)
? Thrift client in the error log
? WARNING: error logging can OOM the Erlang node
29
Hot code swapping
? Restart-free upgrades are awesome (!)
? Pushing new functional code for Chat takes ~20 seconds
? No state is lost
? Test on a running system
? Provides a safety net ... rolling back bad code is easy
? NOTE: we don¡¯t use the OTP release/upgrade strategies
30
Monitoring and Error Recovery
? Supervision hierarchies
? Organize (and control) processes
? Organize thoughts
? Systematize restarts and error recovery
? simple_one_for_one for dynamic child processes
? net_kernel (Distributed Erlang)
? sends nodedown, nodeup messages
? any process can subscribe
? heart: monitors and restarts the OS process
31
Remote Shell
? To invoke:
> erl -name hidden -hidden -remsh <node_name> -setcookie <cookie>
Eshell V5.7.1 (abort with ^G)
(<node_name>)1>
? Ad-hoc inspection of a running node
? Command-and-control from a console
? Combines with hot code loading
32
Erlang top (etop)
? Shows Erlang processes, sorted by
reductions, memory and message
queue
? OS functionality ... for free
33
Hibernation
? Drastically shrink memory usage with erlang:hibernate/3
? Throws away the call stack
? Minimizes the heap
? Enters a wait state for new messages
? ¡°Jumps¡± into a passed-in function for a received message
? Perfect for a long-running, idling HTTP request handler
? But ... not compatible with gen_server:call (and gen_server:reply)
? gen_server:call has its own receive() loop
? hibernate() doesn¡¯t support have an explicit timeout
? Fixed with a few hours and a look at gen.erl
34
Symmetric MultiProcessing (SMP)
? Take advantage of multi-core servers
? erl -smp runs multiple scheduler threads inside the node
? SMP is emphasized in recent Erlang development
? Added to Erlang R11B
? Erlang R12B-0 through R13B include ?xes and perf boosts
? Smart people have been optimizing our code for a year (!)
? Upgraded to R13B last night with about 1/3 less load
35
hipe_bifs
Cheating single assignment
? Erlang is opinionated:
? Destructive assignment is hard because it should be
? hipe_bifs:bytearray_update() allows for destructive array assignment
? Necessary for aggregating Chat users¡¯ presence
? Don¡¯t tell anyone!
36
Then and now Erlang in Progress
37
Then ... a steep learning curve
? Start of 2007:
? Few industry-focused English-language resources
? Few blogs (outside of Yariv¡¯s and Joel Reymont¡¯s)
? Code examples spread out and disorganized
? U.S. Erlang community limited in number and visibility
38
Now ...
? Programming Erlang (Jun 2007)
? Erlang Programming (upcoming...)
? More blogs and blog aggregators:
? Planet Erlang, Planet TrapExit
? Erlang Factory aggregates Erlang developments
? More code available:
? GitHub, CEAN
? More general-purpose Open Source Libraries
? U.S. -located conference and ErlLounges
39
(c) 2009 Facebook, Inc. or its licensors. ?"Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
40

More Related Content

Eugene letuchy erlangatfacebook

  • 1. 1
  • 2. Erlang at Facebook Eugene Letuchy Apr 30, 2009 2
  • 3. 1 Facebook ... and Erlang 2 Story of Facebook Chat 3 Facebook Chat Architecture 4 Key Erlang Features 5 Then and Now Agenda 3
  • 4. Facebook ... and Erlang 4
  • 5. The Facebook Environment ? The Site ? More than 200 million active users ? More than 3.5 billion minutes are spent on Facebook each day ? Fewer than 900 employees ? The Engineering Team ? Fast iteration: code gets out to production within a week ? Polyglot programming: interoperability is key ? Practical: high-leverage tools win 5
  • 6. Erlang Projects ? Chat: the biggest and best known user ? AIM Presence: a JSONP validator ? Chat Jabber support (ejabberd) 6
  • 8. 2007: Facebook needs Chat Messages, Wall, Links aren¡¯t enough 8
  • 9. Enter a Hackathon (Jan 2007) ? Chat started in one night of coding ? Floating conversation windows ? No buddy list ? One server (no distribution) ? Erlang was there! 9
  • 10. Enter Eugene (Feb 2007) ? I joined Facebook after Chat Hackathon ? What is this Erlang? ? Spring 2007: ? Learning Erlang from Joe Armstrong's thesis ? Lots of prototyping ? Evaluating infrastructure needs ? Summer 2007: ? Chris Piro works on Erlang Thrift bindings 10
  • 11. Let¡¯s do this! ? Mid-Fall 2007: Chat becomes a ¡°real¡± project ? 4 engineers, 0.5 designer ? Infrastructure components get built and improved ? Feb 2008: ¡°Dark launch¡± testing begins ? Simulates load on the Erlang servers ... they hold up ? Apr 6, 2008: First real Chat message sent ? Apr 23, 2008: 100% rollout (Facebook has 70M users at the time) 11
  • 12. Launch: April 2008 ? Apr 6, 2008: gradual live rollout starts ? First message: "msn chat?" ? Apr 23, 2008: 100% rollout (to Facebook¡¯s 70M users) ? Graph of sends in the ?rst days of launch 0 3 6 9 12 15 Tue 00:00 12:00 Wed 00:00 12:00 millions of sends per hour 12
  • 13. Chat ... one year later ? Facebook has 200M active users ? 800+ million user messages / day ? 7+ million active channels at peak ? 1GB+ in / sec at peak ? 100+ channel machines ? ~9-10 times the work at launch; ~2 as many machines 13
  • 15. System challenges ? How does synchronous messaging work on the Web? ? ¡°Presence¡± is hard to scale ? Need a system to queue and deliver messages ? Millions of connections, mostly idle ? Need logging, at least between page loads ? Make it work in Facebook¡¯s environment 15
  • 17. System overview - User Interface Chat in the browser? ? Chat bar af?xed to the bottom of each Facebook page ? Mix of client-side Javascript and server-side PHP ? Works around transport errors, browser differences ? Regular AJAX for sending messages, fetching conversation history ? Periodic AJAX polling for list of online friends ? AJAX long-polling for messages (Comet) 17
  • 18. System Overview - Back End How does the back end service requests? ? Discrete responsibilities for each service ? Communicate via Thrift ? Channel (Erlang): message queuing and delivery ? Queue messages in each user¡¯s ¡°channel¡± ? Deliver messages as responses to long-polling HTTP requests ? Presence (C++): aggregates online info in memory (pull-based presence) ? Chatlogger (C++): stores conversations between page loads ? Web tier (PHP): serves our vanilla web requests 18
  • 20. Message send Me: Lunch? Eugene: Lunch? 1 - ajax 2a - thrift 2b - thrift 3 - long poll 20
  • 22. Channel servers Architectural overview ? One channel per user ? Web tier delivers messages for that user ? Channel State: short queue of sequenced messages ? Long poll for streaming (Comet) ? Clients make an HTTP request ? Server replies when a message is ready ? One active request per browser tab 22
  • 24. Channel servers Architectural details ? Distributed design ? User id space is partitioned (division of labor) ? Each partition is serviced by a cluster (availability) ? Presence aggregation ? Channel servers are authoritative ? Periodically shipped to presence servers ? Open source: Erlang, Mochiweb, Thrift, Scribe, fb303,et al. 24
  • 25. Key Erlang Features we love 25
  • 26. Concurrency ? Cheap parallelism at massive scale ? Simpli?es modeling concurrent interactions ? Chat users are independent and concurrent ? Mapping onto traditional OS threads is unnatural ? Locality of reference ? Bonus: carries over to non-Erlang concurrent programming 26
  • 27. Distribution ? Connected network of nodes ? Remote processes look like local processes ? Any node in a channel server cluster can route requests ? Naive load balancing ? Distributed Erlang works out-of-the-box (all nodes are trusted) 27
  • 28. Fault Isolation ? Bugs in the initial versions of Chat: ? Process leaks in the Thrift bindings ? Unintended multicasting of messages ? Bad return state for presence aggregators ? (Horrible) bugs don¡¯t kill a mostly functional system: ? C/C++ segfault takes down the OS process and your server state ? Erlang badmatch takes down an Erlang process ? ... and noti?es linked processes 28
  • 29. Error logging (Crash Reports) ? Any proc_lib-compliant process generates crash reports ? Error reports can be handled out of band (not where generated) ? Stacktraces point the way to bugs (functional languages win big here) ? ... but they could be improved with source line numbers ? Writing error_log handlers is simple: ? gen_event behavior ? Allows for massaging of the crash and error messages (binaries!) ? Thrift client in the error log ? WARNING: error logging can OOM the Erlang node 29
  • 30. Hot code swapping ? Restart-free upgrades are awesome (!) ? Pushing new functional code for Chat takes ~20 seconds ? No state is lost ? Test on a running system ? Provides a safety net ... rolling back bad code is easy ? NOTE: we don¡¯t use the OTP release/upgrade strategies 30
  • 31. Monitoring and Error Recovery ? Supervision hierarchies ? Organize (and control) processes ? Organize thoughts ? Systematize restarts and error recovery ? simple_one_for_one for dynamic child processes ? net_kernel (Distributed Erlang) ? sends nodedown, nodeup messages ? any process can subscribe ? heart: monitors and restarts the OS process 31
  • 32. Remote Shell ? To invoke: > erl -name hidden -hidden -remsh <node_name> -setcookie <cookie> Eshell V5.7.1 (abort with ^G) (<node_name>)1> ? Ad-hoc inspection of a running node ? Command-and-control from a console ? Combines with hot code loading 32
  • 33. Erlang top (etop) ? Shows Erlang processes, sorted by reductions, memory and message queue ? OS functionality ... for free 33
  • 34. Hibernation ? Drastically shrink memory usage with erlang:hibernate/3 ? Throws away the call stack ? Minimizes the heap ? Enters a wait state for new messages ? ¡°Jumps¡± into a passed-in function for a received message ? Perfect for a long-running, idling HTTP request handler ? But ... not compatible with gen_server:call (and gen_server:reply) ? gen_server:call has its own receive() loop ? hibernate() doesn¡¯t support have an explicit timeout ? Fixed with a few hours and a look at gen.erl 34
  • 35. Symmetric MultiProcessing (SMP) ? Take advantage of multi-core servers ? erl -smp runs multiple scheduler threads inside the node ? SMP is emphasized in recent Erlang development ? Added to Erlang R11B ? Erlang R12B-0 through R13B include ?xes and perf boosts ? Smart people have been optimizing our code for a year (!) ? Upgraded to R13B last night with about 1/3 less load 35
  • 36. hipe_bifs Cheating single assignment ? Erlang is opinionated: ? Destructive assignment is hard because it should be ? hipe_bifs:bytearray_update() allows for destructive array assignment ? Necessary for aggregating Chat users¡¯ presence ? Don¡¯t tell anyone! 36
  • 37. Then and now Erlang in Progress 37
  • 38. Then ... a steep learning curve ? Start of 2007: ? Few industry-focused English-language resources ? Few blogs (outside of Yariv¡¯s and Joel Reymont¡¯s) ? Code examples spread out and disorganized ? U.S. Erlang community limited in number and visibility 38
  • 39. Now ... ? Programming Erlang (Jun 2007) ? Erlang Programming (upcoming...) ? More blogs and blog aggregators: ? Planet Erlang, Planet TrapExit ? Erlang Factory aggregates Erlang developments ? More code available: ? GitHub, CEAN ? More general-purpose Open Source Libraries ? U.S. -located conference and ErlLounges 39
  • 40. (c) 2009 Facebook, Inc. or its licensors. ?"Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0 40