Belief-Guided Neural ISMCTS¶

更新日: 2026-06-28

この文書は Belief-Guided Neural ISMCTS の設計メモである。詳細な構成は current-method.md にまとめ、このファイルでは提案手法をコード構成へ落とした見取り図を簡潔に示す。

Approach¶

この手法では、NN-only ではなく Policy/Value Network と ISMCTS を実戦でも組み合わせる。

Belief Model
+ Policy/Value Model
+ PublicKnowledgeTracker
+ Information Set MCTS
= Belief-Guided Neural ISMCTS Agent

v12 では rollout を本線に入れない。AlphaGo Zero に寄せて、tree は leaf で止まり、NN prior/value と軽い progress value で探索する。

Module Map¶

CABT observation / legal options
  |
  v
src/pca/features/encoder.py
  - observation tokenization
  - legal action tokenization
  |
  +--> src/pca/models/policy_value.py
  |      - ActionConditionedPolicyValueNet
  |      - policy logits per legal option
  |      - scalar value
  |
  +--> src/pca/models/belief.py
         - BeliefNet
         - opponent hand/deck/prize logits
         - threat heads

src/pca/search/belief.py
  - BeliefPrior
  - PublicKnowledgeTracker
  - hidden state sampling

src/pca/search/ismcts/
  - public information tree
  - determinization schedule
  - PUCT with NN prior
  - progressive widening

src/pca/search/mcts.py
  - terminal/progress value
  - v12_prize_race profile
  - no-progress / deck-out / seed-out shaping

src/pca/training/selfplay/
  - self-play collection
  - full observation targets for belief training
  - incremental JSONL output

src/pca/training/train.py
  - policy/value training
  - low-progress policy downweighting

src/pca/evaluation/tournament/
  - online ISMCTS evaluation
  - deck-out / pokemon-out / normal-win metrics

State and Action Encoding¶

features/encoder.py は CABT observation と select.option を token 化する。

含める情報:

自分の active / bench / hand / discard / prize
相手の公開 active / bench / discard / prize count / hand count
deck count, turn, turnActionCount
supporter / stadium / energy attachment / retreat 使用状態
HP, damage, energy, tool, pre-evolution
special conditions
recent logs
legal option type / context / target / card id

Policy / Value Model¶

ActionConditionedPolicyValueNet は固定アクション分類ではなく、局面ごとに CABT が返す合法 option を直接スコアリングする。

state_tokens -> Transformer state encoder -> state_vec
action_tokens -> action encoder -> action_vec
[state_vec, action_vec] -> policy_head
state_vec -> value_head

deck context は none | mean | set_transformer をサポートする。Set Transformer と card static features は実装済みで、v7 系 checkpoint では使っていた。v10/v11 checkpoint では teacher search の安定化と構造差の切り分けを優先して一時的に deck_context_mode=none を使ったが、2026-06-28 以降の v12 学習 default は deck_context_mode=set_transformer に戻す。

Belief Model¶

BeliefNet は public observation から hidden zone のカード分布を推定する。

state_tokens
  -> Transformer encoder
  -> opponent_hand_logits
  -> opponent_deck_logits
  -> opponent_prize_logits
  -> next_threat_logits
  -> knockout_threat

推論時は、BeliefNet の soft prior と PublicKnowledgeTracker の hard constraint を組み合わせて hidden state をサンプリングする。

注意:

BeliefNet は相手手札を直接見るものではない。
--full-observation-targets は belief 学習ラベル生成のためであり、提出時の入力には使わない。
hidden active が非公開の場合、Pokemon card id だけを active 候補にする。

ISMCTS¶

ISMCTS は次を行う。

public information key で node を共有する。
Belief-guided hidden state を determinization として複数サンプリングする。
Policy/Value model の policy logits を PUCT prior として使う。
leaf では NN value と progress value を使う。
root noise と visit temperature は self-play 用。
evaluation / submission では noise を切る。

v12 追加:

root / non-root candidate cap
progressive widening
attack / choice / forced action の pruning 保護
empty-bench Pokemon play の pruning 保護
reach stats logging

Value Shaping¶

v12_prize_race profile は、通常勝ちとサイド取得へ向かう探索を促すための軽い shaping である。

含めるもの:

サイド差
実際のサイド取得 delta
相手 active のダメージ進行
deck-out 負け回避
no-progress END 減点
empty bench seed-out risk
empty bench から Pokemon を出す小さな safety bonus

含めないもの:

サポート役を attacker と決め打つカード役割 heuristic
場の Pokemon 数が多いほど単純加点
弱い rollout

Training¶

self-play では、実際の行動も教師 policy も public observation + belief-guided ISMCTS から作る。

主な target:

search_policy: root visit distribution
selected_action: 実際に選んだ option index
final_result: 最終勝敗
turn_value: optional auxiliary value
belief target: full observation から作る hidden hand/deck/prize label

policy/value 学習では、弱い teacher trajectory の policy imitation を弱める。

deck-out 勝ちは強い勝ちとして扱わない。
deck-out 負けは強く悪い。
passive deck-out win は除外可能。
unfinished / long game は policy weight を下げる。
サイド取得できない低品質 trajectory は policy weight を下げる。
reason=3 の種切れ負けは value は負けとして残し、policy imitation を弱める。

Evaluation¶

評価は checkpoint-only ではなく online ISMCTS 同士を基本にする。

見る指標:

normal wins
prizes taken
attacks
first attack step
first prize step
deck-out losses
pokemon-out losses
unfinished
attack/prize reached rate
replay 上の行動品質

deck-out や pokemon-out は normal win から分けて扱う。

Canonical Reference¶

詳細仕様、推奨コマンド、既知課題は current-method.md を参照する。