LSSPT
本文最后更新于 2024年10月25日 上午
Introduction
conventional deep learning-based methods
- LSTM, GCN
- explore group activity representations under supervised or weakly supervised modes
- require manually annotated personal action, labels(数据标记?)
background
- NLP:unsupervised
- SSL developes
- SSRL, the temporal evolution (时间演变) not yet been explicitly exploited
- predictive coding scheme (预测编码方案)
group activities
- more complex state dynamics
- lead to failure of SSRL using RNN(复杂序列关系建模困难)
- LSTM相关模型缺乏注意力机制(attention to the history sequence dependencies)
- Transformer networks in NLP restricted to normal data
- 人类在长周期group activity中重复某种运动
- exploiting multiple ranges of historical information
LSSPT
encoder-decoder framework
- encoder: summarize group state
- decoder: anticipate the state in the future
- based on relation graph and casual Transformer
sparse graph Transformer
- spatial state context in short time
casual temporal Transformer(CTT)
- long range temporal dynamics
Approach
predictive coding
- 时空编码函数
- 预测函数
- 优化函数
Architecture
- 特征提取
- I3D预训练模型提取人物特征
- 长短状态编码
- 长短状态解码
- 推理训练
- 重构损失reconstructed loss
- 对比损失contrasitive loss
- 对抗损失adversarial loss
Long-Short State Encoder
sparse graph transformer
building
\(\{p{}^t_i\}{}^N_{i=1}\),\(p_i\in \mathit R^d\)表示第i个人的特征
\(稀疏矩阵G^t=\{V^t,E^t\}\),\(V_t=\{p{}^t_i\}{}^N_{i=1}\)表示节点,\(E_t=\{(i,j)|p_i,p_j 在n时刻连结\}\)
\(节点的邻居Nei(i,t)=\{p^t_j\}{}^M_{i=1},其中p^t_j满足(i,j)\in E^t\)
update
通过邻居节点传递的key,自身节点的equry更新节点信息,由原先的\(h_i\)变为\(\hat{h_i}\)
$ =softmax()[v_i]^N_{i=1}\ q_i表示query\ k_j表示key\ v_i表示value\ $
group state modeling
$ 小组状态g_t=P_{max}(Norm(f_o(),...,f_o())) \ P_{max}池化层 \ Norm层标准化 \ f_o全连接层 \ $
casual temporal transformer
- masked Transformer
- 为绝对帧添加时间位置编码
- 多层CTT层传递,masked multihead attention, LayerNorm(层归一化),MLP(what???)
- mask保证模型只注意部分特定输入(类似于LLM中后文不会影响前文语素的注意力分配机制)
Long-Short State Decoder
- state attention modules: 建立长短期之间的依赖
- state update modules: 输出长短期信息
LSSPT
https://meteor041.git.io/2024/10/20/LSSPT/