"The quadratic resource requirements of the attention mechanism are the main roadblock in scaling up transformers to long sequences. This paper replaces the full quadratic attention mechanism by a combination of random attention, window attention, and global attention."

· · Web · 0 · 0 · 0
Sign in to participate in the conversation

Personal server of Lukáš Lánský