How to store location based data when place's location does not change frequently - Quad tree

Design location based systems: yelp, POI (point of interests), nearby friends. takes in a place's location, find nearby restaurants / POI / friends

Step 1. Requirements

  • add/delete/update places
  • given latitude/longitude, user should find all nearby places within a given radius
  • add feedback/review for places, can have pictures, text and a rating
  • low latency search experience
  • heavy search load
  • Assumption: 500 M places / 100 K QPS

Step 2. Data Schema & API

  • place: placeId, name, latitude, longitude, description, category
  • review: reviewId, placeId, reviewText, Rating
  • addPlace(dev_api_key, placeName, description, lat_long)
  • search(dev_api_key, search_terms, lat_long, radius, limit, category, order, page_token)

Step 3. Algorithm

  • place location does not change that frequently (in contrast to Uber system)

SQL solution

  • store latitude and longitude, have two indexes on them
  • select * from place where longitude between X + d and X - d and latitude between Y + d and Y - d
  • - two indexes would return a lot of data then perform intersection is very slow
  • - the query is not completely accurate for a definition of radius (is a circle)

Grid

  • Idea: divide map in to grids, store places in those grids. all grids are the same size
  • for each place, it will also have a grid id
    • 用一个4 bytes number来表示, 前2byte表示top left, 后2byte表示bottom right
    • why 2 bytes ? 因为1 byte = 8 bits = 255 但是地球的最大维度是左180+右180 = 360. 1 byte太小, 所以2bytes
  • for queries: select * from place where longitude between X + d and X - d and latitude between Y + d and Y - d and gridId in (gridID1, gridID2, ..., gridID9 )
    • 一共有9个grid!
  • 把每个grid里面的place存在cache里面 for fast query -> store in cache

Dynamic size grids

  • remote locations have less places while busy places has a lot more places within the same size grid
  • for a grid if found there are more than a certain numbers of POIs what to do ? => divid grid to smaller grids. how to divid? divide one into FOUR

Quad Tree

  • 一个node, node有4个儿子, 分别为这个quad的左上, 左下, 右上和右下
  • Implementation: https://www.geeksforgeeks.org/quad-tree/
  • 给一个location, 如何找到quad tree node ? 找到一个leaf node即为target node
  • 找到了target node, 如何从这个target node找到它的neighbors?
    • 一个node的4个儿子都doubly-linked to each other -> todo ? 这个感觉找不齐target node周围的8个node啊
    • 每个儿子都存一个指向parent的指针 -> 这个我感觉比较好

Search steps with Quad Tree

  1. 找到含有目标lat_long的quad tree node
    1. 如果这个node里面有足够的places, 直接返回他们
    2. 如果没有继续step2
  2. expand to neighbor nodes to find enough places or stop the search when we reach the maximum radius

Quad Tree Data partitioning

  1. sharding by region: hot region & region has lots places hard to maintain uniform data distribution (NO)
  2. sharding by locationId: (YES)
    1. query all the servers gets all the places
    2. aggregation server to aggregate and sort results and return

Replication and Fault Tolerance

  • replicas of QuadTree
  • master and slaves -> slaves read only
  • build a reverse index to recover a quad tree
    • key is the quadtree serverID
    • value is a hashset contains all the places with location ID and lat_long info stored in that server
    • query this reverse index to rebuild a quad tree
    • should also have a replica for this QuadTreeIndex server

Step 4. Improvements

  • having cache in front of the database with Memcache
  • using LRU eviction polity
  • adding LB at proper places
  • Ranking: store in database and the quadtree
    • having a cron job to update quadtree every couple of hrs

This free site is ad-supported. Learn more