oneOokay posted: " How to store location based data when place's location does not change frequently - Quad tree Design location based systems: yelp, POI (point of interests), nearby friends. takes in a place's location, find nearby restaurants / POI / friends Step 1" How to store location based data when place's location does not change frequently - Quad tree Design location based systems: yelp, POI (point of interests), nearby friends. takes in a place's location, find nearby restaurants / POI / friends Step 1. Requirements - add/delete/update places
- given latitude/longitude, user should find all nearby places within a given radius
- add feedback/review for places, can have pictures, text and a rating
- low latency search experience
- heavy search load
- Assumption: 500 M places / 100 K QPS
Step 2. Data Schema & API - place: placeId, name, latitude, longitude, description, category
- review: reviewId, placeId, reviewText, Rating
addPlace(dev_api_key, placeName, description, lat_long) search(dev_api_key, search_terms, lat_long, radius, limit, category, order, page_token)
Step 3. Algorithm - place location does not change that frequently (in contrast to Uber system)
SQL solution - store latitude and longitude, have two indexes on them
select * from place where longitude between X + d and X - d and latitude between Y + d and Y - d - - two indexes would return a lot of data then perform intersection is very slow
- - the query is not completely accurate for a definition of radius (is a circle)
Grid - Idea: divide map in to grids, store places in those grids. all grids are the same size
- for each place, it will also have a grid id
- 用一个4 bytes number来表示, 前2byte表示top left, 后2byte表示bottom right
- why 2 bytes ? 因为1 byte = 8 bits = 255 但是地球的最大维度是左180+右180 = 360. 1 byte太小, 所以2bytes
- for queries:
select * from place where longitude between X + d and X - d and latitude between Y + d and Y - d and gridId in (gridID1, gridID2, ..., gridID9 ) - 把每个grid里面的place存在cache里面 for fast query -> store in cache
Dynamic size grids - remote locations have less places while busy places has a lot more places within the same size grid
- for a grid if found there are more than a certain numbers of POIs what to do ? => divid grid to smaller grids. how to divid? divide one into FOUR
Quad Tree - 一个node, node有4个儿子, 分别为这个quad的左上, 左下, 右上和右下
- Implementation: https://www.geeksforgeeks.org/quad-tree/
- 给一个location, 如何找到quad tree node ? 找到一个leaf node即为target node
- 找到了target node, 如何从这个target node找到它的neighbors?
- 一个node的4个儿子都doubly-linked to each other -> todo ? 这个感觉找不齐target node周围的8个node啊
- 每个儿子都存一个指向parent的指针 -> 这个我感觉比较好
Search steps with Quad Tree - 找到含有目标lat_long的quad tree node
- 如果这个node里面有足够的places, 直接返回他们
- 如果没有继续step2
- expand to neighbor nodes to find enough places or stop the search when we reach the maximum radius
Quad Tree Data partitioning - sharding by region: hot region & region has lots places hard to maintain uniform data distribution (NO)
- sharding by locationId: (YES)
- query all the servers gets all the places
- aggregation server to aggregate and sort results and return
Replication and Fault Tolerance - replicas of QuadTree
- master and slaves -> slaves read only
- build a reverse index to recover a quad tree
- key is the quadtree serverID
- value is a hashset contains all the places with location ID and lat_long info stored in that server
- query this reverse index to rebuild a quad tree
- should also have a replica for this QuadTreeIndex server
Step 4. Improvements - having cache in front of the database with Memcache
- using LRU eviction polity
- adding LB at proper places
- Ranking: store in database and the quadtree
- having a cron job to update quadtree every couple of hrs
|
|
|
|
|
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.