This document discusses Facebook's use of Erlang in developing its chat feature. It describes how Facebook had over 200 million active users in 2009 who spent over 3.5 billion minutes on the site daily. It needed a chat feature and initially developed one in a hackathon using Erlang due to its concurrency and distribution capabilities. Over time, the chat architecture was developed further using Erlang's channel servers to queue and deliver messages across many nodes. The document discusses key Erlang features like concurrency, distribution, fault isolation, and hot code swapping that helped Facebook scale the chat feature to support over 800 million messages per day.
5. The Facebook Environment
? The Site
? More than 200 million active users
? More than 3.5 billion minutes are spent on Facebook each day
? Fewer than 900 employees
? The Engineering Team
? Fast iteration: code gets out to production within a week
? Polyglot programming: interoperability is key
? Practical: high-leverage tools win
5
6. Erlang Projects
? Chat: the biggest and best known user
? AIM Presence: a JSONP validator
? Chat Jabber support (ejabberd)
6
9. Enter a Hackathon (Jan 2007)
? Chat started in one night of coding
? Floating conversation windows
? No buddy list
? One server (no distribution)
? Erlang was there!
9
10. Enter Eugene (Feb 2007)
? I joined Facebook after Chat Hackathon
? What is this Erlang?
? Spring 2007:
? Learning Erlang from Joe Armstrong's thesis
? Lots of prototyping
? Evaluating infrastructure needs
? Summer 2007:
? Chris Piro works on Erlang Thrift bindings
10
11. Let¡¯s do this!
? Mid-Fall 2007: Chat becomes a ¡°real¡± project
? 4 engineers, 0.5 designer
? Infrastructure components get built and improved
? Feb 2008: ¡°Dark launch¡± testing begins
? Simulates load on the Erlang servers ... they hold up
? Apr 6, 2008: First real Chat message sent
? Apr 23, 2008: 100% rollout (Facebook has 70M users at the time)
11
12. Launch: April 2008
? Apr 6, 2008: gradual live rollout starts
? First message: "msn chat?"
? Apr 23, 2008: 100% rollout (to Facebook¡¯s 70M users)
? Graph of sends in the ?rst days of launch
0
3
6
9
12
15
Tue 00:00 12:00 Wed 00:00 12:00
millions of sends per hour
12
13. Chat ... one year later
? Facebook has 200M active users
? 800+ million user messages / day
? 7+ million active channels at peak
? 1GB+ in / sec at peak
? 100+ channel machines
? ~9-10 times the work at launch;
~2 as many machines
13
15. System challenges
? How does synchronous messaging work on the Web?
? ¡°Presence¡± is hard to scale
? Need a system to queue and deliver messages
? Millions of connections, mostly idle
? Need logging, at least between page loads
? Make it work in Facebook¡¯s environment
15
17. System overview - User Interface
Chat in the browser?
? Chat bar af?xed to the bottom of each Facebook page
? Mix of client-side Javascript and server-side PHP
? Works around transport errors, browser differences
? Regular AJAX for sending messages, fetching conversation history
? Periodic AJAX polling for list of online friends
? AJAX long-polling for messages (Comet)
17
18. System Overview - Back End
How does the back end service requests?
? Discrete responsibilities for each service
? Communicate via Thrift
? Channel (Erlang): message queuing and delivery
? Queue messages in each user¡¯s ¡°channel¡±
? Deliver messages as responses to long-polling HTTP requests
? Presence (C++): aggregates online info in memory (pull-based presence)
? Chatlogger (C++): stores conversations between page loads
? Web tier (PHP): serves our vanilla web requests
18
22. Channel servers
Architectural overview
? One channel per user
? Web tier delivers messages for that user
? Channel State: short queue of sequenced messages
? Long poll for streaming (Comet)
? Clients make an HTTP request
? Server replies when a message is ready
? One active request per browser tab
22
24. Channel servers
Architectural details
? Distributed design
? User id space is partitioned (division of labor)
? Each partition is serviced by a cluster (availability)
? Presence aggregation
? Channel servers are authoritative
? Periodically shipped to presence servers
? Open source: Erlang, Mochiweb, Thrift, Scribe, fb303,et al.
24
26. Concurrency
? Cheap parallelism at massive scale
? Simpli?es modeling concurrent interactions
? Chat users are independent and concurrent
? Mapping onto traditional OS threads is unnatural
? Locality of reference
? Bonus: carries over to non-Erlang concurrent programming
26
27. Distribution
? Connected network of nodes
? Remote processes look like local processes
? Any node in a channel server cluster can route requests
? Naive load balancing
? Distributed Erlang works out-of-the-box (all nodes are trusted)
27
28. Fault Isolation
? Bugs in the initial versions of Chat:
? Process leaks in the Thrift bindings
? Unintended multicasting of messages
? Bad return state for presence aggregators
? (Horrible) bugs don¡¯t kill a mostly functional system:
? C/C++ segfault takes down the OS process and your server state
? Erlang badmatch takes down an Erlang process
? ... and noti?es linked processes
28
29. Error logging (Crash Reports)
? Any proc_lib-compliant process generates crash reports
? Error reports can be handled out of band (not where generated)
? Stacktraces point the way to bugs (functional languages win big here)
? ... but they could be improved with source line numbers
? Writing error_log handlers is simple:
? gen_event behavior
? Allows for massaging of the crash and error messages (binaries!)
? Thrift client in the error log
? WARNING: error logging can OOM the Erlang node
29
30. Hot code swapping
? Restart-free upgrades are awesome (!)
? Pushing new functional code for Chat takes ~20 seconds
? No state is lost
? Test on a running system
? Provides a safety net ... rolling back bad code is easy
? NOTE: we don¡¯t use the OTP release/upgrade strategies
30
31. Monitoring and Error Recovery
? Supervision hierarchies
? Organize (and control) processes
? Organize thoughts
? Systematize restarts and error recovery
? simple_one_for_one for dynamic child processes
? net_kernel (Distributed Erlang)
? sends nodedown, nodeup messages
? any process can subscribe
? heart: monitors and restarts the OS process
31
32. Remote Shell
? To invoke:
> erl -name hidden -hidden -remsh <node_name> -setcookie <cookie>
Eshell V5.7.1 (abort with ^G)
(<node_name>)1>
? Ad-hoc inspection of a running node
? Command-and-control from a console
? Combines with hot code loading
32
33. Erlang top (etop)
? Shows Erlang processes, sorted by
reductions, memory and message
queue
? OS functionality ... for free
33
34. Hibernation
? Drastically shrink memory usage with erlang:hibernate/3
? Throws away the call stack
? Minimizes the heap
? Enters a wait state for new messages
? ¡°Jumps¡± into a passed-in function for a received message
? Perfect for a long-running, idling HTTP request handler
? But ... not compatible with gen_server:call (and gen_server:reply)
? gen_server:call has its own receive() loop
? hibernate() doesn¡¯t support have an explicit timeout
? Fixed with a few hours and a look at gen.erl
34
35. Symmetric MultiProcessing (SMP)
? Take advantage of multi-core servers
? erl -smp runs multiple scheduler threads inside the node
? SMP is emphasized in recent Erlang development
? Added to Erlang R11B
? Erlang R12B-0 through R13B include ?xes and perf boosts
? Smart people have been optimizing our code for a year (!)
? Upgraded to R13B last night with about 1/3 less load
35
36. hipe_bifs
Cheating single assignment
? Erlang is opinionated:
? Destructive assignment is hard because it should be
? hipe_bifs:bytearray_update() allows for destructive array assignment
? Necessary for aggregating Chat users¡¯ presence
? Don¡¯t tell anyone!
36
38. Then ... a steep learning curve
? Start of 2007:
? Few industry-focused English-language resources
? Few blogs (outside of Yariv¡¯s and Joel Reymont¡¯s)
? Code examples spread out and disorganized
? U.S. Erlang community limited in number and visibility
38
39. Now ...
? Programming Erlang (Jun 2007)
? Erlang Programming (upcoming...)
? More blogs and blog aggregators:
? Planet Erlang, Planet TrapExit
? Erlang Factory aggregates Erlang developments
? More code available:
? GitHub, CEAN
? More general-purpose Open Source Libraries
? U.S. -located conference and ErlLounges
39
40. (c) 2009 Facebook, Inc. or its licensors. ?"Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
40