Design Facebook messenger / What's app

Step 1. Requirements

  • one-on-one conversation between users including acknowledgement
  • keep track of user's online/offline status
  • support persistent storage of chat history (NOTE: not all chat apps stores chat histories, some app does not store them at all, need to clarify this)
  • group chat
  • push notification
  • minimum latency
  • highly consistent, user should see same histories across different platforms/devices
  • availability: ideally should be high but can sacrifice it for consistency (NOTE!!! this is different from other system that can sacrifice consistency for availability)

Step 2. Data Estimation

Assumption: 500 M DAU / each user sends 40 msg per day / 100 bytes per message

Calculation:

  • 20 Billion messages a day (230 K messages/s)-> 2 TB / day -> 3.6 PetaBytes 5 years
  • 230 K msg/s -> 23 MB /s (both upload and download) -> each message needs to come in and go out

Step 3. High level design

a chat server to 1) receive messages 2) send messages 3) acknowledge message delivery / user status 4) store message

Workflow:

  1. user1 send msg1 to user2 via Chat Server
  2. server receives msg1 and sends acknowledgement to user1
  3. server stores msg1 in the database
  4. server sends msg1 to user2
  5. user2 receives msg1 and sends acknowledgement to server
  6. server notifies user1 that msg1 has been delivered to user2

Step 4. Detail component design

  • message receiving and sending
  • message storing and retrieving
  • user online/offline status managing

Message Receiving and Sending

Push vs Poll when receiving msgs

After user1 sends msg1 to server, how would user2 gets msg1 from server ? Pull vs push in chat apps

Pull

  • user2 can pull from chat server for new messages periodically.
    • - server needs to keep track of which msgs pending to be delivered
    • - user2 has to pull frequently to get the msgs in a timely manner. user2 can't get the message unless user2 pulls, msgs are not received in a timely manner
    • - frequently pulling most of the time will get an empty response for no msgs, wasting resources.

push

  • once server gets a message sent from user, server pushes the msg to its receiver.
    • all the active/online users need to keep a connection open with the server
    • + server does not need to keep track of pending messages (if user is online, failed to delivered is another topic)
    • + minimum latency, msg delivered instantly on the open connection

We should choose PUSH

Push technics : long pulling and websocket

long polling

  • a technique where the server elects to hold a client's connection open for as long as possible, delivering a response only after data becomes available or a timeout threshold has been reached.
  • + implemented based on XMLHttpRequest, near-universally supported by devices
  • - more implementations on the server

websocket - would be a better choice

  • a thin transport layer built on top of a device's TCP/IP stack
  • an as-close-to-raw-as-possible TCP communication layer to web application developers
  • + faster than long polling, less data loads in transits
  • - don't automatically recover when connections are terminated. need to implement yourself (a lot of client library can handle this)
  • - old browsers does not support it

How does long polling work?

  • client issue a request, started a long polling connection
  • if no updates, server will not reply; server keeps long polling connection open until the connection times out.
  • as soon as server has updates, it sends to client immediately
  • upon clients receives the msg, client and issue another request for future updates to keep the connection open.

How does the server tracks long polling connections for user?

A hash table. key: userID, value: connection object

Sending a message for an offline user

  • implement on the sender's UI to resend it when receiver comes back online
  • stores the message for a while and retry sending it once receiver comes back online. when user comes back online, check if there are any pending messages pull? or maybe a sqs queue to store them ? or store them in a database)

chat servers needed: assuming 500 M connections are maintained at the same time (modern server can handle 50K concurrent connections) -> need 10 K servers -> needs load balancers

how do we know which server holds which user's connection ?

load balancer with a map each userId to a server to redirect the request. LB stickiness.

sent acknowledgement

when server receives msg from sender 1) it sends msg to receiver then send an acknowledge to sender 2) stores msg in database (acknowledgement does not need to wait for msg to be stored in db)

message sequence

  1. user1 send M1 to user2 at 1:00 pm / user2 send M2 to user1 at 1:00 pm
  2. M1 arrives at server at 1: 01 pm / M2 arrives at server at 1: 02 pm
  3. user1 sees: M1, M2 / user2 sees: M2, M1

Message Database

date time sender receiver message content
1:01 PM user1 user2 M1
1:02 PM user2 user1 M2

在同一个device上 user1和user2的message排列应该是这样的 (假设sender在右边)

User1 UI - complies with server time

M1 says User1 (1:01 PM server time)
User2 says : M2 (1:02 PM server time)

User 2 UI - not complies with server time

M2 says User2 (1:02 PM server time)
User1 says : M1 (1:01 PM server time)

对于User2, 如果他在另一个device上登录, 重新fetch message histories from database的话, 他的UI就会变得不同

User 2 UI - new device

User1 says : M1 (1:01 PM server time)
M2 says User2 (1:02 PM server time)

所以对于每一个user, 都要maintain一个单独的sequence??? TODO 我不知道怎么弄. ref: https://stackoverflow.com/questions/65953525/how-does-the-messenger-maintain-the-sequencing-of-the-messages-during-chat-and-w

data store and retrieval

  • use HBase
  • use pagination
    • use Limit/Offset with SQL databases which already have LIMIT and OFFSET as part of the SQL SELECT Syntax. would look like GET /items?limit=20&offset=100. This query would return the 20 rows starting with the 100th row
    • + Easiest to implement, almost no coding required other than passing parameters directly to SQL query.
    • + Stateless on the server.
    • + Works regardless of custom sort_by parameters.
  • Not performant for large offset values. Let's say you perform a query with a large offset value of 1000000. The database needs to scan and count rows starting with 0, and will skip (i.e. throw away data) for the first 1000000 rows.

Not consistent when new items are inserted to the table (i.e. Page drift) This is especially noticeable when we are ordering items by newest first. Consider the following that orders by decreasing Id:

Query GET /items?offset=0&limit=15

10 new items added to the table

Query GET /items?offset=15&limit=15 The second query will only return 5 new items, as adding 10 new items moved the offset back by 10 items. To fix this, the client would really need to offset by 25 for the second query GET /items?offset=25&limit=15, but the client couldn't possibly know other objects being inserted into the table.


This free site is ad-supported. Learn more