feat: antispam scoring package#5178
Conversation
🛠 PR Checks SummaryAll Automated Checks passed. ✅ Manual Checks (for Reviewers):
Read More🤖 This bot helps streamline PR reviews by verifying automated checks and providing guidance for contributors and reviewers. ✅ Automated Checks (for Contributors):🟢 Maintainers must be able to edit this pull request (more info) ☑️ Contributor Actions:
☑️ Reviewer Actions:
📚 Resources:Debug
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
jeronimoalbi
left a comment
There was a problem hiding this comment.
I have to still go though the code, lots to cover. It's interesting though 👍
| // When EarlyExitAt is set to a positive threshold, each earlyExit | ||
| // check can short-circuit before reaching costlier rules. | ||
| // Use EarlyExitDisabled (0, the default) to evaluate all rules. | ||
| func Score(in ScoreInput) SpamScore { |
There was a problem hiding this comment.
Just a though but it might be worth exploring a Score() implementation that relies on a rule scoring interface based solution. If each one of the scoring rules follows an interface I think it probably would be possible to implement Score() as basically a loop that iterates the rules and updates the SpamScore instance. Maybe something like ScoringRule.Score(ScoreInput) (score int)?
There was a problem hiding this comment.
Thanks for the suggestion! I actually chose the procedural approach to keep the gas optimization strategy explicit. Rules are ordered cheap - to -> expensive with earlyExit checks between groups, so obvious spam never hits regex/Bayes/fingerprints. An interface pipeline would hide that pattern and make it harder to maintain.
That said, no rule plugins are planned right now: but what if that becomes a need down the line? Would refactoring to an interface then be the right move (wagni or not?), or do you see a better approach?
There was a problem hiding this comment.
The interface pipeline could also be ordered and a comment could be used to describe the gas optimization strategy.
Another lighter approach might to just use a scoring function type:
type ScoringFunc func(ScoreInput) []RuleHit
// ...
rules := []ScoringFunc{
ScoreBlocklist,
ScoreRate,
ScoreReputation,
// ...
}In any case is just a recommendation, if it makes sense, to avoid defining the inline functions, the earlyExit() calls and reducing the function's body.
Do not remove this file. It is important information, and a canary file.
Of all places, /r/UFOResearch should not censor this kind of information. DO NOT DELETE THIS FILE, DO NOT TRUST ANYONE WHO DELETES THIS FILE.
added law of spinning mass.
…ring with new tests and updating documentation for clarity
…spam poisoning and enhance training accuracy, add tests for decay behavior and keyword dictionary size limits
…nd README with detailed auto-training safety guidelines and new features
…opulating moderation data in Score function and updating documentation for clarity
…g and scoring functions, ensuring consistency in antispam package; update tests and documentation accordingly
…nimum matches to prevent false positives in long content; update tests and documentation accordingly
2cd0dfa to
97020f8
Compare
| // Truncate oversized input to cap gas cost. | ||
| if len(in.Content) > MaxInputLength { | ||
| in.Content = in.Content[:MaxInputLength] | ||
| } |
There was a problem hiding this comment.
Truncating the content might leave relevant pieces out of the scoring, I think is not a good idea truncating it within the package 🤔, devs can truncate it when calling Score() from a realm.
| func (bl *Blocklist) AllowAddress(addr string) { | ||
| bl.allowed.Set(addr, true) | ||
| } |
There was a problem hiding this comment.
Minor improvement, it could be applied to the other cases too:
| func (bl *Blocklist) AllowAddress(addr string) { | |
| bl.allowed.Set(addr, true) | |
| } | |
| func (bl *Blocklist) AllowAddress(addr string) { | |
| bl.allowed.Set(addr, struct{}{}) | |
| } |
| // rebuildCombined compiles all patterns into a single alternation regex. | ||
| // Called when patterns are added or removed. Compilation cost is paid once | ||
| // at admin time, not per Score() call. | ||
| func (bl *Blocklist) rebuildCombined() { |
There was a problem hiding this comment.
Why not using individual ones instead of combining them? Compiling a big expression or even evaluating it might end up being more expensive. Also with individual ones you would be able to potentially stop early.
With that change you could provably also remove the limit of 30 expressions.
| // ReputationData holds caller-provided account context for an author. | ||
| type ReputationData struct { | ||
| AccountAgeDays int | ||
| Balance int64 |
There was a problem hiding this comment.
Maybe:
| Balance int64 | |
| Balance chain.Coins |
If so then later on you could do Balance.AmountOf(string) int64, which could be used in the future to improve the reputation implementation.
| AccountAgeDays int | ||
| Balance int64 | ||
| FlaggedCount int | ||
| TotalAccepted int |
There was a problem hiding this comment.
Maybe something like ValidContentCount or AcceptedContentCount? In any case some field documentation would be handy to better understand the field 🙏
| } | ||
|
|
||
| const ( | ||
| repMinAgeDays = 1 // accounts younger than this are "new" |
There was a problem hiding this comment.
WDYT about using 15 or 30 days instead of 1? Reasoning is that within that period account is new and might be starting with the first interactions.
|
|
||
| // Ban history penalty | ||
| if rep.BanCount > 0 { | ||
| penalty := rep.BanCount * repBanPenalty |
There was a problem hiding this comment.
Right now repBanPenalty is 1, so it could be removed 🤔
| HasUsername: false, | ||
| BanCount: 0, | ||
| }, | ||
| wantMin: 3, |
There was a problem hiding this comment.
Test would benefit from using the constants:
| wantMin: 3, | |
| wantMin: WeightNewAccount + WeightNoUsername + WeightLowBalance, |
| import antispamr "gno.land/r/gnoland/antispam" | ||
| import engine "gno.land/p/gnoland/antispam" |
There was a problem hiding this comment.
| import antispamr "gno.land/r/gnoland/antispam" | |
| import engine "gno.land/p/gnoland/antispam" | |
| import ( | |
| antispamr "gno.land/r/gnoland/antispam" | |
| engine "gno.land/p/gnoland/antispam" | |
| ) |
| if corpus == nil || corpus.Size() < bayesMinCorpusSize { | ||
| return 0, "" | ||
| } | ||
| if len(tokens) == 0 { | ||
| return 0, "" | ||
| } |
There was a problem hiding this comment.
Could be joined:
| if corpus == nil || corpus.Size() < bayesMinCorpusSize { | |
| return 0, "" | |
| } | |
| if len(tokens) == 0 { | |
| return 0, "" | |
| } | |
| if corpus == nil || corpus.Size() < bayesMinCorpusSize || len(tokens) == 0 { | |
| return 0, "" | |
| } |
| bayesMinCorpusSize = 10 | ||
|
|
||
| // bayesSpamThresholdPct is the spam ratio threshold (percentage). | ||
| // Tokens appearing in spam more than this% of the time are considered spam indicators. |
There was a problem hiding this comment.
Typo
| // Tokens appearing in spam more than this% of the time are considered spam indicators. | |
| // Tokens appearing in spam more than this % of the time are considered spam indicators. |
| // new observation. This decay gives recent observations more weight and | ||
| // prevents corpus poisoning from being permanent - essential for | ||
| // auto-training scenarios where moderation actions feed the corpus. | ||
| func (c *Corpus) Train(content string, isSpam bool) { |
There was a problem hiding this comment.
Out of curiosity, wouldn't some ham tokens potentially be considered spam if they are used a lot when content is trained as spam?
| // normalizeLeet converts common leet speak digit substitutions back to letters. | ||
| // Only handles digit-based leet (0->o, 1->i, 3->e, 4->a, 5->s, 7->t) since | ||
| // symbol-based leet (@, $) is stripped during tokenization. | ||
| func normalizeLeet(s string) string { |
There was a problem hiding this comment.
Could be public, it might be handy for other packages 👍
| if dict == nil || dict.Size() == 0 { | ||
| return 0, "" | ||
| } | ||
| if len(tokens) == 0 { | ||
| return 0, "" | ||
| } |
There was a problem hiding this comment.
Also possible:
| if dict == nil || dict.Size() == 0 { | |
| return 0, "" | |
| } | |
| if len(tokens) == 0 { | |
| return 0, "" | |
| } | |
| if dict == nil || dict.Size() == 0 || len(tokens) == 0 | |
| return 0, "" | |
| } |
| if c.Size() < 4 { | ||
| t.Errorf("expected size >= 4, got %d", c.Size()) | ||
| } |
There was a problem hiding this comment.
Why not?
| if c.Size() < 4 { | |
| t.Errorf("expected size >= 4, got %d", c.Size()) | |
| } | |
| if c.Size() != 6 { | |
| t.Errorf("expected size 6, got %d", c.Size()) | |
| } |
| if c.Size() != 9 { | ||
| t.Fatalf("test setup: expected corpus size 9, got %d", c.Size()) | ||
| } |
There was a problem hiding this comment.
Could be removed, already tested in TestCorpus():
| if c.Size() != 9 { | |
| t.Fatalf("test setup: expected corpus size 9, got %d", c.Size()) | |
| } |
There was a problem hiding this comment.
Size checks could be removed from other Bayes related tests to keep them simpler
| return corpus, dict | ||
| } | ||
|
|
||
| func TestCryptoContentFalsePositives(t *testing.T) { |
There was a problem hiding this comment.
Using table tests here would keep the tests DRY
| if len(fp1) == 0 { | ||
| t.Fatal("expected non-empty fingerprint") | ||
| } |
There was a problem hiding this comment.
This could be removed, there is already a test case that checks zero length:
| if len(fp1) == 0 { | |
| t.Fatal("expected non-empty fingerprint") | |
| } |
| if s.urlCount > linkMaxCount { | ||
| hits = append(hits, RuleHit{WeightLinkHeavy, RuleLinkHeavy}) | ||
| } |
There was a problem hiding this comment.
I'm wondering if it makes sense to count links, tutorials or other type of Markdown content are valid and could be link heavy. It would be better to consider the number of links versus the rest of the content, or maybe just to remove the rule.
|
I think we would need more eyes on the PR, it's quite big and has a lot of specifics to check. Also having others opinions would be helpful. Maybe the realm could be part of other PR, for easier review and merge. cc @Kouteki |
|
@alexiscolin do we need to add more reviewers to this? Is this a in the next 2 week cycle or should I push it another cycle back? |
SpamAssassin-style scoring for Gno realms. Multi-rule engine (19 rules) combining content checks, rate limiting, reputation, Bayesian filter, duplicate detection, and blocklists. Scores accumulate - above 5 might hide, above 8 might reject. Multiple signals make the system harder to game.
Two components:
p/gnoland/antispam- Pure scoring library (import + customize)r/gnoland/antispam- Shared realm with defaults (21 scam patterns, 400 keywords)Detection Strategies
Architecture
Package (
p/gnoland/antispam)Realm (
r/gnoland/antispam)Score() is read-only - does NOT auto-update reputation. Calling realm must manually record moderation decisions (prevents false positives from poisoning shared data) - Admin only.
See pure README and realm README for planned detailed usage, gas optimization, and integration patterns.