Skip to content

Commit 2a89f56

Browse files
authored
feat: hive_posts_cache_temp for hot queries (90-day window) (#364)
* feat: hive_posts_cache_temp for hot queries (trending/hot/payout)
1 parent 808cb02 commit 2a89f56

18 files changed

Lines changed: 482 additions & 36 deletions
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
# PR: hive_posts_cache_temp for hot queries (90-day window)
2+
3+
## Summary
4+
5+
This branch introduces a second cache table **`hive_posts_cache_temp`** that holds only the last ~90 days of post cache data. Hot list APIs (trending, hot, created, promoted, payout, payout_comments) are routed to this table to reduce query load on the main `hive_posts_cache` table.
6+
7+
## What’s in this branch
8+
9+
### 1. Temp table and schema
10+
- **`hive/db/schema.py`**: New table `hive_posts_cache_temp` with the same columns as `hive_posts_cache` (plus optional `_synced_at`), and indexes tuned for hot sorts.
11+
- **`hive/db/db_state.py`**: Migration (v24→25) creates the temp table if missing and runs a **one-time cold-start backfill** (INSERT from main table, 90-day window). Uses `SET LOCAL statement_timeout = '0'` for the long-running insert; adapter whitelist updated to allow `SET LOCAL`.
12+
13+
### 2. Query routing
14+
- **`hive/db/cache_router.py`**: `CacheRouter.get_table(sort)` returns `hive_posts_cache_temp` for hot sorts and `hive_posts_cache` otherwise.
15+
- **Bridge / condenser / hive_api**: List and cursor code use `CacheRouter.get_table(sort)` so hot sorts hit the temp table; `comments_by_id` queries temp first and falls back to main for missing IDs; `posts_by_id` remains main-only.
16+
17+
### 3. Writes (dual-write)
18+
- **`hive/indexer/cached_post.py`**: Every INSERT/UPDATE to `hive_posts_cache` is applied to **both** the main and temp table in the **same batch/transaction** (dual-write). Undelete path updated to write to both tables.
19+
- **Logging**: `[DUAL-WRITE] batch: N posts written to main+temp` per batch; `[PREP] posts cache process (main+temp): ...` for large flushes, so operators can confirm dual-write in logs.
20+
21+
### 4. Deletes
22+
- **`hive/indexer/cache_sync.py`**: Every **60s** (triggered from listen when `num % 20 == 0`), a background thread runs `DELETE FROM hive_posts_cache_temp WHERE created_at < :cutoff` (90-day cutoff). No INSERT/sync from main—temp is fed only by dual-write and cold-start backfill.
23+
- **`hive/indexer/cached_post.py`**: `CachedPost.delete()` deletes from both main and temp when a post is removed (e.g. delete_comment).
24+
- **`hive/indexer/blocks.py`**: Fork rollback (`_pop_blocks()`) deletes affected post_ids from both main and temp.
25+
26+
### 5. Documentation
27+
- **`docs/hive_posts_cache_temp-90day-boundary.md`**:
28+
- Describes routing and 90-day boundary behavior (list APIs do *not* auto-fallback to main at 90 days; only comment-by-ID falls back).
29+
- **§6** documents all write/delete locations for the temp table (cold-start, dual-write, 60s prune, delete at delete time, fork rollback).
30+
31+
## Design notes
32+
33+
- **Dual-write** was chosen instead of a periodic sync (e.g. chunked INSERT from main every 60s) so that temp and main are updated in the same transaction and write timing is consistent; the previous 60s chunked-sync approach was dropped for performance and simplicity.
34+
- Hot lists are intentionally 90-day only; “load more” does not automatically switch to the main table beyond that window.
35+
36+
## Testing
37+
38+
- New tests: `tests/db/test_cache_router.py`, `tests/db/test_cache_sync.py`.
39+
- Cache sync tests updated (e.g. no INSERT batch size assertions after sync became delete-only).
40+
41+
## Related code (quick ref)
42+
43+
| Area | Files |
44+
|-----------------|--------|
45+
| Schema / migration | `hive/db/schema.py`, `hive/db/db_state.py` |
46+
| Routing | `hive/db/cache_router.py` |
47+
| Dual-write | `hive/indexer/cached_post.py` |
48+
| 90-day prune | `hive/indexer/cache_sync.py`, `hive/indexer/sync.py` |
49+
| Delete at delete / fork | `hive/indexer/cached_post.py`, `hive/indexer/blocks.py` |
50+
| API usage | `hive/server/bridge_api/cursor.py`, `hive/server/condenser_api/cursor.py`, `hive/server/hive_api/objects.py`, `hive/server/hive_api/thread.py` |
51+
| Docs | `docs/hive_posts_cache_temp-90day-boundary.md` |
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# hive_posts_cache_temp and 90-Day Boundary Evaluation
2+
3+
This document records the evaluation of the temp table (`hive_posts_cache_temp`) routing logic and whether list APIs automatically fall back to `hive_posts_cache` when "load more" crosses the ~90-day boundary.
4+
5+
## Summary
6+
7+
- **List APIs (including "load more")**: They do **not** automatically read from `hive_posts_cache` at the 90-day boundary. Each request uses a single table (temp or main) chosen by `CacheRouter.get_table(sort)`; there is no cursor-based fallback to the main table.
8+
- **By-ID object APIs**: `comments_by_id` queries temp first, then fills missing IDs from main, so it **does** automatically read `hive_posts_cache`. `posts_by_id` uses only the main table.
9+
10+
So: **"Load more" does not automatically read `hive_posts_cache`**; only comment-by-ID resolution falls back to the main table when data is missing from temp.
11+
12+
---
13+
14+
## 1. Temp Table and 90-Day Boundary
15+
16+
- **Temp table**: `hive_posts_cache_temp` holds only the last ~90 days of hot data. Rows older than 90 days are pruned by `CacheSync` (see `hive/indexer/cache_sync.py`).
17+
- **Routing**: In `hive/db/cache_router.py`, `CacheRouter.get_table(sort)` returns the temp table for hot sorts (`trending`, `hot`, `created`, `promoted`, `payout`, `payout_comments`) and the main table otherwise. Routing is based only on `sort`; there is no logic that considers cursor position or proximity to the 90-day cutoff.
18+
19+
---
20+
21+
## 2. List + Pagination ("Load More") Implementation
22+
23+
List and pagination are implemented in `hive/server/bridge_api/cursor.py` and `hive/server/condenser_api/cursor.py`:
24+
25+
- `table = CacheRouter.get_table(sort)` is used once per request; the same table is used for the entire list query and for the cursor/seek condition.
26+
- Pagination uses `seek_id` / `last_id` on that single table with `ORDER BY ... LIMIT`; there is no "query temp then fall back to main" logic.
27+
28+
Example from `pids_by_category` (bridge):
29+
30+
```python
31+
table = CacheRouter.get_table(sort)
32+
# ...
33+
sql = ("""SELECT post_id FROM %s WHERE %s
34+
ORDER BY %s DESC, post_id LIMIT :limit
35+
""" % (table, ' AND '.join(where), field))
36+
return await db.query_col(sql, tag=tag, last_id=last_id, limit=limit)
37+
```
38+
39+
So for sorts that use the temp table (e.g. trending, hot, created), the entire list (first page and all "load more" pages) is read only from the temp table. When the user scrolls near the 90-day boundary, the temp table simply has no older rows (they were pruned), so the next page returns fewer or no rows; the code does **not** switch to `hive_posts_cache` for older data.
40+
41+
- If the product expectation is "hot lists only show the last 90 days", the current behavior is correct and does not wrongly read the main table.
42+
- If the product expectation is "load more should continue to show posts older than 90 days", then the current implementation does **not** satisfy that; you would need to add logic to use the main table (or a combined strategy) when the cursor is near or past the 90-day boundary.
43+
44+
---
45+
46+
## 3. APIs That Do Automatically Read hive_posts_cache
47+
48+
Only the **comment-by-ID** path does a temp-then-main fallback. In `hive/server/hive_api/objects.py`, `comments_by_id`:
49+
50+
- Queries the temp table first with the requested IDs.
51+
- Collects which IDs were not found.
52+
- If there are missing IDs, runs the same query against `hive_posts_cache` and appends those rows.
53+
54+
So comments older than 90 days are still returned by reading from the main table when they are missing from temp. List APIs do not have this fallback.
55+
56+
---
57+
58+
## 4. Whether the 90-Day Boundary Is a "Bug"
59+
60+
- **Correctness**: When only the temp table is used for a list, the results are consistent and do not mix in unintended rows from the main table. At the boundary, "load more" simply runs out of rows in temp; there is no automatic read from `hive_posts_cache`.
61+
- **Product expectation**: If the design is "hot lists are 90-day only", there is no bug. If the design is "load more should include data beyond 90 days", then the current list implementation does not do that and would need to be extended (e.g. use main table or union when the cursor is near the 90-day cutoff).
62+
63+
---
64+
65+
## 5. Quick Reference
66+
67+
| Scenario | Automatically reads hive_posts_cache? | Notes |
68+
|----------|--------------------------------------|--------|
69+
| List first page + load more (trending / hot / created, etc.) | **No** | Only temp is used; no fallback at boundary |
70+
| Comment by ID (`comments_by_id`) | **Yes** | Temp first, then main for missing IDs |
71+
| Post by ID (`posts_by_id`) | N/A (main only) | Uses main table only, not temp |
72+
73+
**Direct answer**: "Load more" does **not** automatically read `hive_posts_cache`. Only the comment-by-ID resolution falls back to the main table when IDs are missing from temp. If you want list APIs to continue beyond the 90-day boundary using the main table, that logic would need to be added explicitly (e.g. cursor-based table selection or a combined temp+main query).
74+
75+
---
76+
77+
## 6. Temp Table: Write and Delete Locations
78+
79+
Where `hive_posts_cache_temp` is written to and deleted from (for verification and operations).
80+
81+
### Writes (INSERT / UPDATE)
82+
83+
| Location | Description |
84+
|----------|-------------|
85+
| **`hive/db/db_state.py`** | **Create table + cold-start backfill.** On migration, if the temp table does not exist: create it, then run a one-time `INSERT INTO hive_posts_cache_temp (...) SELECT ... FROM hive_posts_cache WHERE created_at >= :cutoff` (90-day window) to backfill. |
86+
| **`hive/indexer/cached_post.py`** | **Dual-write with main table.** Every write to `hive_posts_cache` also writes to the temp table in the same batch: `_insert()` produces both main and temp INSERTs; `_update()` produces both main and temp UPDATEs. Used in listen (per-block), `from_steemd` (batch sync), and undelete placeholder writes. |
87+
88+
### Deletes
89+
90+
| Location | Description |
91+
|----------|-------------|
92+
| **`hive/indexer/cache_sync.py`** | **Every 60s:** `CacheSync._sync()` runs in a background thread and executes `DELETE FROM hive_posts_cache_temp WHERE created_at < :cutoff` (cutoff = now − 90 days). Triggered from `hive/indexer/sync.py` when `num % 20 == 0` (every ~60s in listen). |
93+
| **`hive/indexer/cached_post.py`** | **On post delete:** `CachedPost.delete()` runs `DELETE FROM hive_posts_cache_temp WHERE post_id = :id` when a post is removed (e.g. delete_comment op), in sync with the main table delete. |
94+
| **`hive/indexer/blocks.py`** | **Fork rollback:** `_pop_blocks()` deletes from the main cache for affected blocks and also runs `DELETE FROM hive_posts_cache_temp WHERE post_id IN :ids` so temp stays in sync. |
95+
96+
### Summary
97+
98+
- **Writes to temp:** (1) One-time cold-start backfill in `db_state.py`; (2) Dual-write in `cached_post.py` (same transaction as `hive_posts_cache`).
99+
- **Deletes from temp:** (1) 90-day prune every 60s in `cache_sync.py`; (2) Single-post delete in `cached_post.delete()`; (3) Bulk delete on fork rollback in `blocks._pop_blocks()`.
100+
101+
---
102+
103+
## Related Code
104+
105+
- `hive/db/cache_router.py` – table selection by `sort`
106+
- `hive/indexer/cache_sync.py` – 90-day pruning of `hive_posts_cache_temp`
107+
- `hive/db/db_state.py` – migration that creates and backfills temp (v24)
108+
- `hive/server/bridge_api/cursor.py` – list + pagination (e.g. `pids_by_category`, `pids_by_community`)
109+
- `hive/server/condenser_api/cursor.py``pids_by_query`
110+
- `hive/server/hive_api/objects.py``comments_by_id` (temp + main), `posts_by_id` (main only)

hive/db/adapter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,6 @@ def _is_write_query(sql):
189189
return False
190190
if action in ['DELETE', 'UPDATE', 'INSERT', 'COMMIT', 'START',
191191
'ALTER', 'TRUNCA', 'CREATE', 'DROP I', 'DROP T',
192-
'ANALYZ']: # ANALYZE command
192+
'ANALYZ', 'SET LO']: # ANALYZE command; SET LOCAL for migration
193193
return True
194194
raise Exception("unknown action: {}".format(sql))

hive/db/cache_router.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
"""Query routing for hive_posts_cache vs hive_posts_cache_temp."""
2+
3+
4+
class CacheRouter:
5+
"""Route hot-data queries to temp table, others to main table."""
6+
7+
TEMP_TABLE = 'hive_posts_cache_temp'
8+
MAIN_TABLE = 'hive_posts_cache'
9+
10+
HOT_QUERIES = {
11+
'trending', 'hot', 'payout', 'payout_comments', 'created', 'promoted'
12+
}
13+
14+
@classmethod
15+
def get_table(cls, query_type=None):
16+
"""Return table name for the given query type."""
17+
if query_type and query_type in cls.HOT_QUERIES:
18+
return cls.TEMP_TABLE
19+
return cls.MAIN_TABLE
20+
21+
@classmethod
22+
def get_temp_sql(cls, base_sql):
23+
"""Replace main table name with temp table in SQL."""
24+
return base_sql.replace(cls.MAIN_TABLE, cls.TEMP_TABLE)

hive/db/db_state.py

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,12 @@
44

55
import time
66
import logging
7+
from datetime import datetime, timedelta
78

89
from hive.db.schema import (setup, reset_autovac, build_metadata,
910
build_metadata_community, teardown, DB_VERSION,
10-
build_metadata_blacklist, build_trxid_block_num)
11+
build_metadata_blacklist, build_trxid_block_num,
12+
build_temp_cache_metadata)
1113
from hive.db.adapter import Db
1214

1315
log = logging.getLogger(__name__)
@@ -383,6 +385,38 @@ def _check_migrations(cls):
383385

384386
log.info("[HIVE] idx_posts_id_author_deleted_depth_community index created")
385387
cls._set_ver(24)
388+
if cls._ver == 24:
389+
# Hot-data temp table for hive_posts_cache (90-day window, synced every 30s)
390+
if not cls.db().query_col(
391+
"SELECT EXISTS(SELECT 1 FROM information_schema.tables WHERE table_name='hive_posts_cache_temp')"
392+
)[0]:
393+
log.info("[HIVE] Creating hive_posts_cache_temp table...")
394+
build_temp_cache_metadata().create_all(cls.db().engine())
395+
log.info("[HIVE] hive_posts_cache_temp created")
396+
# One-time cold-start backfill (90-day window); disable statement_timeout for long run
397+
cls.db().query("SET LOCAL statement_timeout = '0'")
398+
cutoff = datetime.now() - timedelta(days=90)
399+
backfill_sql = """
400+
INSERT INTO hive_posts_cache_temp (
401+
post_id, author, permlink, category, community_id, depth, children,
402+
author_rep, flag_weight, total_votes, up_votes, title, preview, img_url,
403+
payout, promoted, created_at, payout_at, updated_at, is_paidout,
404+
is_nsfw, is_declined, is_full_power, is_hidden, is_grayed,
405+
rshares, sc_trend, sc_hot, body, votes, json, raw_json, _synced_at
406+
)
407+
SELECT post_id, author, permlink, category, community_id, depth, children,
408+
author_rep, flag_weight, total_votes, up_votes, title, preview, img_url,
409+
payout, promoted, created_at, payout_at, updated_at, is_paidout,
410+
is_nsfw, is_declined, is_full_power, is_hidden, is_grayed,
411+
rshares, sc_trend, sc_hot, body, votes, json, raw_json,
412+
NOW() as _synced_at
413+
FROM hive_posts_cache
414+
WHERE created_at >= :cutoff
415+
"""
416+
result = cls.db().query(backfill_sql, cutoff=cutoff)
417+
n = result.rowcount if hasattr(result, 'rowcount') else 0
418+
log.info("[HIVE] hive_posts_cache_temp cold-start backfill done, rows=%s", n)
419+
cls._set_ver(25)
386420

387421
reset_autovac(cls.db())
388422

hive/db/schema.py

Lines changed: 57 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
#pylint: disable=line-too-long, too-many-lines, bad-whitespace
1212

13-
DB_VERSION = 24
13+
DB_VERSION = 25
1414

1515
def build_metadata():
1616
"""Build schema def with SqlAlchemy"""
@@ -247,8 +247,64 @@ def build_metadata():
247247

248248
metadata = build_trxid_block_num(metadata)
249249

250+
metadata = build_temp_cache_metadata(metadata)
251+
252+
return metadata
253+
254+
255+
def build_temp_cache_metadata(metadata=None):
256+
"""Build hive_posts_cache_temp table for hot-data queries (90-day window)."""
257+
if not metadata:
258+
metadata = sa.MetaData()
259+
260+
sa.Table(
261+
'hive_posts_cache_temp', metadata,
262+
sa.Column('post_id', sa.Integer, primary_key=True, autoincrement=False),
263+
sa.Column('author', VARCHAR(16), nullable=False),
264+
sa.Column('permlink', VARCHAR(255), nullable=False),
265+
sa.Column('category', VARCHAR(255), nullable=False, server_default=''),
266+
sa.Column('community_id', sa.Integer, nullable=True),
267+
sa.Column('depth', SMALLINT, nullable=False, server_default='0'),
268+
sa.Column('children', SMALLINT, nullable=False, server_default='0'),
269+
sa.Column('author_rep', sa.Float(precision=6), nullable=False, server_default='0'),
270+
sa.Column('flag_weight', sa.Float(precision=6), nullable=False, server_default='0'),
271+
sa.Column('total_votes', sa.Integer, nullable=False, server_default='0'),
272+
sa.Column('up_votes', sa.Integer, nullable=False, server_default='0'),
273+
sa.Column('title', sa.String(255), nullable=False, server_default=''),
274+
sa.Column('preview', sa.String(1024), nullable=False, server_default=''),
275+
sa.Column('img_url', sa.String(1024), nullable=False, server_default=''),
276+
sa.Column('payout', sa.types.DECIMAL(10, 3), nullable=False, server_default='0'),
277+
sa.Column('promoted', sa.types.DECIMAL(10, 3), nullable=False, server_default='0'),
278+
sa.Column('created_at', sa.DateTime, nullable=False, server_default='1990-01-01'),
279+
sa.Column('payout_at', sa.DateTime, nullable=False, server_default='1990-01-01'),
280+
sa.Column('updated_at', sa.DateTime, nullable=False, server_default='1990-01-01'),
281+
sa.Column('is_paidout', BOOLEAN, nullable=False, server_default='0'),
282+
sa.Column('is_nsfw', BOOLEAN, nullable=False, server_default='0'),
283+
sa.Column('is_declined', BOOLEAN, nullable=False, server_default='0'),
284+
sa.Column('is_full_power', BOOLEAN, nullable=False, server_default='0'),
285+
sa.Column('is_hidden', BOOLEAN, nullable=False, server_default='0'),
286+
sa.Column('is_grayed', BOOLEAN, nullable=False, server_default='0'),
287+
sa.Column('rshares', sa.BigInteger, nullable=False, server_default='0'),
288+
sa.Column('sc_trend', sa.Float(precision=6), nullable=False, server_default='0'),
289+
sa.Column('sc_hot', sa.Float(precision=6), nullable=False, server_default='0'),
290+
sa.Column('body', TEXT),
291+
sa.Column('votes', TEXT),
292+
sa.Column('json', sa.Text),
293+
sa.Column('raw_json', sa.Text),
294+
sa.Column('_synced_at', sa.DateTime, nullable=True),
295+
sa.Index('hive_posts_cache_temp_ix6a', 'sc_trend', 'post_id',
296+
postgresql_where=sql_text("is_paidout = '0'")),
297+
sa.Index('hive_posts_cache_temp_ix7a', 'sc_hot', 'post_id',
298+
postgresql_where=sql_text("is_paidout = '0'")),
299+
sa.Index('hive_posts_cache_temp_ix9a', 'depth', 'payout', 'post_id',
300+
postgresql_where=sql_text("is_paidout = '0'")),
301+
sa.Index('hive_posts_cache_temp_ix30', 'community_id', 'sc_trend', 'post_id',
302+
postgresql_where=sql_text("community_id IS NOT NULL AND is_grayed = '0' AND depth = 0")),
303+
sa.Index('hive_posts_cache_temp_created', 'created_at'),
304+
)
250305
return metadata
251306

307+
252308
def build_metadata_community(metadata=None):
253309
"""Build community schema defs"""
254310
if not metadata:

hive/indexer/blocks.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,6 +223,7 @@ def _pop(cls, blocks):
223223
# remove posts: core, tags, cache entries
224224
if post_ids:
225225
DB.query("DELETE FROM hive_posts_cache WHERE post_id IN :ids", ids=post_ids)
226+
DB.query("DELETE FROM hive_posts_cache_temp WHERE post_id IN :ids", ids=post_ids)
226227
DB.query("DELETE FROM hive_post_tags WHERE post_id IN :ids", ids=post_ids)
227228
DB.query("DELETE FROM hive_posts WHERE id IN :ids", ids=post_ids)
228229

0 commit comments

Comments
 (0)