Typical transformers apply self-attention between tokens that vary across time. So the dot product values for each pair of tokens in the resulting attention (correlation) matrix are basically dot products between pairs of moments in time.
The iTransformer authors seem to be saying that, for certain time series forecasting tasks, it's not correct to assume that embedding channels of tokens across moments in time represent data that was collected at precisely the same moment or with similar instruments. In reality, different varieties of data are sometimes not precisely aligned in a data set and also have very different distributions relating to how the data was collected.
So the iTransformer model proposes to apply self-attention across embedding channels instead of across time. Self-attention otherwise seems to work in the same way. Query and key matrices are calculated but they project each embedding channel separately instead of projecting a collection of channel values at a single moment. Then the query-key calculation finds the degree to which all the entirely independent time series (embedding channels) are correlated. Those correlations are used to weight the value vectors and obtain new embedding channels that are weighted averages.
Then the feed-forward layer projects each channel independently, instead of projecting across channels as it would do in a standard transformer model.
Also, since layer normalization acts within an embedding channel, they claim that this can reduce noise that would result from normalizing data across channels that were collected using different methods. The distribution characteristics of each channel stay within the channel instead of bleeding across channels and potentially deleting information.
They lay out more of their reasoning for taking this approach in the paper and I feel like I agree with their intuitions. But the paper needs some serious proof reading. It's very hard to parse the verbiage.
Typical transformers apply self-attention between tokens that vary across time. So the dot product values for each pair of tokens in the resulting attention (correlation) matrix are basically dot products between pairs of moments in time.
The iTransformer authors seem to be saying that, for certain time series forecasting tasks, it's not correct to assume that embedding channels of tokens across moments in time represent data that was collected at precisely the same moment or with similar instruments. In reality, different varieties of data are sometimes not precisely aligned in a data set and also have very different distributions relating to how the data was collected.
So the iTransformer model proposes to apply self-attention across embedding channels instead of across time. Self-attention otherwise seems to work in the same way. Query and key matrices are calculated but they project each embedding channel separately instead of projecting a collection of channel values at a single moment. Then the query-key calculation finds the degree to which all the entirely independent time series (embedding channels) are correlated. Those correlations are used to weight the value vectors and obtain new embedding channels that are weighted averages.
Then the feed-forward layer projects each channel independently, instead of projecting across channels as it would do in a standard transformer model.
Also, since layer normalization acts within an embedding channel, they claim that this can reduce noise that would result from normalizing data across channels that were collected using different methods. The distribution characteristics of each channel stay within the channel instead of bleeding across channels and potentially deleting information.
They lay out more of their reasoning for taking this approach in the paper and I feel like I agree with their intuitions. But the paper needs some serious proof reading. It's very hard to parse the verbiage.