At Borealis AI, we firmly believe that the development of responsible ML requires diverse views, research and talent. And we are committed to encouraging greater diversity and inclusion in our actions, our research and our collaborative partnerships.

That is why Borealis AI is proud to support the 2021 Women in Machine Learning (WiML) Workshop. This important event gives female-identified faculty, research scientists, and graduate students in the machine learning community an opportunity to meet, exchange ideas and learn from each other. In doing so, WiML is on a mission to increase gender diversity in ML, help women-identified individuals in ML to succeed professionally, and increase their impact within their communities.

“At Borealis AI, we are committed to empowering and engaging female-identified researchers in the field of ML,”

noted Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI.

“Alongside our range of other diversity and inclusion initiatives, we hope our support of the 2021 WiML Workshop at NeurIPS provides those researchers – and those aspiring to join the field of ML – with the role models, ideas and inspiration to drive their career in ML forward.”

Hosted virtually within the 2021 Conference on Neural Information Processing Systems (NeurIPS), this year’s event builds on 15 years of programs designed around substantive technical and professional conversations held within positive, supportive environments. To learn more about WiML and the WiML Workshop, click here.

]]>The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) is the preeminent forum for collaboration around computational linguistics and natural language processing. This year’s conference is expected to attract around 4,000 attendees, both physical and virtual. But, for a wide variety of reasons, forums like this can be often be difficult to access for some researchers. And that directly impacts diversity.

As a Diversity and Inclusion sponsor of the EMNLP, we aim to support researchers facing various types of hardships. We are helping provide accommodations for researchers with disabilities. We are helping to subsidize attendance for those dealing with financial hardship, those with family or childcare responsibilities, and those first-time attendees from underrepresented regions or groups. And we are helping to enable remote participation for researchers unable to travel to the conference.

“Borealis AI is dedicated to growing, strengthening and diversifying the global machine learning talent pool through innovative and smart partnerships like our Diversity and Inclusion sponsorship of EMNLP 2021,”

“We look forward to meeting the attendees at our virtual booth and we are excited to see what new ideas, models and technologies will emerge from the event.”

]]>

In contrast to STL, multi-task learning (MTL) optimizes a single model to perform multiple related tasks simultaneously, aiming to improve generalization and parameter efficiency across tasks. In this case, two or more output targets are associated with the same input data. Effective multi-tasking learning typically requires task balancing to prevent one or more tasks from dominating the optimization, to decrease negative transfer, and to avoid overfitting. Standard MTL settings usually assume a homogeneous set of tasks, for example all tasks are classification or regression tasks, and usually they are non-sequential data. This scenario can greatly benefit MTL approaches with strong shared representations. In contrast, heterogeneous multi-task learning is defined by multiple classes of tasks, such as classification, regression with single or multi label characteristics and temporal data, being optimized simultaneously. The latter setting is more realistic but lacks further exploration. In this post, we share a novel method that we recently developed for heterogeneous MTL.

Hard-parameter sharing networks [1], shown in Figure 1.b are one of the pillars of multi-task learning. These networks are composed of a shared bottom and task-specific branches. Ma et al. [2] suggested that a unique shared bottom might not be enough to generalize for all tasks in an application, and proposed to use several shared bottoms, or what they call experts. The experts are combined using gate functions, and their combination is forwarded to the towers. The final architecture is called Multi-gate Mixture-of-Experts(MMoE), and is shown in Figure 1.c. MMoE generalizes better than its traditional hard-parameter sharing counterpart, but there are two weaknesses: first, it lacks a task-balancing mechanism; second, the only source of diversity among the experts is due to the random initialization. Although the experts can indeed be diverse enough if they specialize in different tasks, there are no guarantees that this will happen in practice. We propose the Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx) (Figure 1.d), a model that induces more diversity among the experts and has a task-balancing component.

Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx) takes its inspiration from ensemble learning, where diversity among their learners tend to generalize better. MMoEEx can be divided in three parts: gates, experts and towers. Considering an application with $K$ tasks, input data $x \in \mathbb{R}^d$, the gate function $g^k()$ is defined as:

\begin{equation}

\label{eq:g}

g^k(x) = \text{softmax}(W^k x), \forall k \in \{0,...,K\} \tag{1}

\end{equation}

where $W^k \in \mathbb{R}^{E \times d}$ are learnable weights and $E$ is the number of experts, defined by the user. The gates control the contribution of each expert to each task.

The experts $f_e(),\forall e\in \{0,...,E\}$, and our implementation is very flexible to accept several experts architectures, which is essential to work with applications with different data types. For example, if working with temporal data, the experts can be LSTMs, GRUs, RNNs; for non-temporal data, the experts can be dense layers. The number of experts $E$ is defined by the user. The experts and gates' outputs are combined as follows:

\begin{equation}

\label{eq:f}

f^k(x) = \sum_{e=0}^Eg^k(x)f_e(x), \forall k \in \{0,...,K\} \tag{2}

\end{equation}

The $f^k()$ are input to the towers, the task-specific part of the architecture. Their design depends on the data type and tasks. The towers $h^k$ output the task predictions as follows:

\begin{equation}

y^k = h^k(f^k(x)), \forall k \in \{0,...,K \} \tag{3}

\end{equation}

Previous Mixture of Experts models like [2] leverage several experts to make their final predictions; however, they rely on indirect approaches, such as random initialization, to foster diversity among the experts, and on the expectation that the gate function will learn how to combine these experts. Here we propose a mechanism to induce diversity among the experts, defined as $\textit{exclusivity}$.

**Exclusivity**: We set $\alpha E$ experts to be exclusively connected to one task. The value $\alpha\in[0,1]$ controls the proportion of experts that will be $\textit{exclusive}$. If $\alpha=1$, all experts are exclusive, and if $\alpha=0$, all experts are shared (same as MMoE). An exclusive expert is randomly assigned to one of the tasks $T_k$, but the task $T_k$ can still be associated with other exclusive experts and shared experts.

MMoEEx, similarly to MMoE, relies on the expectation that gate functions will learn how to combine the experts. Our approach induces more diversity by forcing some of these gates to be 'closed' to some experts, and the exclusivity mechanism is used to close part of the gates. The remaining non-closed gates learn to combine the output of each expert based on the input data, according to Equation 1.

Competing task optimization is another challenge of optimizing heterogenous tasks. The goal of the MAML-MTL optimization is to balance the tasks on the gradient level. Finn et al. [3] proposed the Model-agnostic Meta-learning (MAML), a two-step optimization approach originally intend to be used with transfer-learning and few-shot learning due to its fast convergence. Initial attempts to apply MAML to MTL show that MAML can balance the tasks on the gradient level and yield better results than some existing task-balancing approaches [4]. The core idea is that MAML's temporary update yields smoothed losses, which also smooth the gradients on direction and magnitude. However, differently from [4], we do not freeze task-specific layers during the intermediate/inner update. The MAML-MTL approach is shown in Figure 2. The approach consists of evaluating each task loss. After that each task loss is used to temporarily update the network which are re-evaluated and the task specific temporarily losses are aggregated to form the final loss which will provide the actual network update.

Heterogenous MTL Experiments

The Medical Information Mart for Intensive Care (MIMIC-III) database was proposed by [5] to be a benchmark dataset for MTL in time-series data. It contains metrics of patients from over 40,000 intensive care units (ICU) stays. This dataset has 4 tasks: two binary tasks, one temporal multi-label task, and one temporal classification. Figure 3 shows the neural network adopted in our work and where each task is calculated.

The full set of results for MIMIC-III dataset is presented in Table 1. We compared our approach with the multitask channel wise LSTM (MCW-LSTM) [6], single task trained network, shared bottom, MMoE [2] and MMoEEx.

MMoEEx outperforms all the compared approaches except on the Phenotype (Pheno) task. For both time series tasks (LOS and Decomp) our approach outperforms all baselines. It is worth noting that for the LOS task, which is the hardest task on MIMIC-III, we present a relative improvement superior to $40$ percentage points when compared to multitask channel wise LSTM [6] and over $16$ percentage points to MMoE.

Method | Pheno | LOS | Decomp | Ihm | $\Delta$ |

MCW-LSTM[6] | 77.4 |
$45.0$ | $90.5$ | $87.0$ | $+0.28%$ |

Single Task [6] | $77.0$ | $45.0$ | $91.0$ | $86.0$ | - |

Shared Bottom | $73.36$ | $30.60$ | $94.12$ | $82.71$ | $-9.28%$ |

MMoE | 75.09 | $54.48$ | $96.20$ | $90.44$ | $+7.36%$ |

MMoEEx | $72.44$ | 63.45 |
96.82 |
90.73 |
+11.74% |

We measured how diverse MMoEEx experts are compared to traditional MMoE.

The diversity among experts can be scored by the distance between the experts' outputs $f_e, \forall e\in\{0,..., E\}$. Considering a pair of experts $i$ and $j$, the distance between them is defined as:

\begin{equation}

d_{i,j} = \sqrt{\sum_{n=0}^N(f_i(x_n)-f_j(x_n))^2} \tag{4}

\end{equation}

where $N$ is the number of samples in the dataset, $d_{i,j} = d_{j,i}$, and a matrix $D \in \mathbb{R}^{E\times E}$ is used to keep all the distances. To scale the distances into $d_{i,j}\in [0,1]$, we divide the raw entries in the distance matrix $D$ by the maximum distance observed, $\text{max}(D)$. A pair of experts $i,j$ with $d_{i,j} = 0$ are considered identical, and experts distances $d_{i,j}$ close to 0 are considered very similar; analogously, experts with $d_{i,j}$ close to 1 are considered very dissimilar. To compare the overall distance between the experts of a model, we define the $\textit{diversity score}$ $\bar{d}$ as the mean entry in $D$.

We analyze the diversity score of the MMoE and MMoEEx experts on MIMIC-III. The MMoE and MMoEEx models compared have the same neural network structure, but the MMoEEx uses the MAML - MTL optimization and has the diversity enforced. The MMoEEx model in Figure 4 was created with $\alpha = 0.5$ and exclusivity. In other words, half of the experts in the MMoEEx model were randomly assigned to be exclusive to one of the tasks, while in the MMoE model all experts are shared among all tasks. Figure 4 shows a heatmap of the distances $D^{MMoE}$ and $D^{MMoEEx}$ calculated on the MIMIC-III testing set with 12 experts. MMoE's heatmap has overall lighter colors, indicating smaller diversity scores, compared with MMoEEx. Quantitatively, MMoEEx produces a relative lift of $43\%$ in diversity score.

We presented a novel multi-task learning approach called Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx), which extends previous methods by introducing an exclusivity mechanism that induces more diversity among experts, allowing the network to learn representations that are more effective for heterogeneous MTL. We also introduce a two-step optimization approach called MAML-MTL, which balances tasks at the gradient level and enhances MMoEEx's capability to optimize imbalanced tasks.

MTL has achieve critical mass in multiple areas like natural language processing [7, 8, 9], computer vision [10, 11, 12], reinforcement learning [13, 14] and multi-modal learning [15, 16]. Standard soft/hard parameter sharing approaches are a well established technique to handle multiple tasks. While they show improvements over single task learning for tasks with similar characteristics, it is not fully explored how MTL can further improve heterogeneous task scenarios. Hybrid approaches like mixture of experts can mitigate several limitations of standard approaches and further extend its capabilities when coupled with specialized optimization methods. Optimization methods for MTL are in their infancy, and more research on meta-learning task balancing can greatly benefit MTL research. We hope this work inspires the community to further investigate multi-task learning at the network architecture and optimization levels.

^{[1] } Rich Caruana. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, 1993.

^{[2] } Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.

^{[3] } Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.

^{[4] } Sungjae Lee and Youngdoo Son. Multitask learning with single gradient step update for task balancing. arXiv preprint arXiv:2005.09910, 2020. 8

^{[5] } Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1–18, 2019.

^{[6] } Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.

^{[7] } Victor Sanh, Thomas Wolf, and Sebastian Ruder. A hierarchical multi-task approach for learning embeddings from semantic tasks. In AAAI Conference on Artificial Intelligence, volume 33, 2019.

^{[8] } Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In Annual Meeting of the Association for Computational Linguistics, 2019.

^{[9] } Cagla Aksoy, Alper Ahmetoglu, and Tunga Gung ¨ or. Hierar- ¨ chical multitask learning approach for BERT. arXiv preprint arXiv:2011.04451, 2020.

^{[10] } Ximeng Sun, Rameswar Panda, Rogerio Feris, and Kate Saenko. AdaShare: Learning what to share for efficient deep multi-task learning. In Advances in Neural Information Processing Systems, 2020.

^{[11] } Shikun Liu, Edward Johns, and Andrew J Davison. End-to-end multi-task learning with attention. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.

^{[12] } Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. MTI-Net: Multi-scale task interaction networks for multi-task learning. In European Conference on Computer Vision, 2020.

^{[13] } Lerrel Pinto and Abhinav Gupta. Learning to push by grasping: Using multiple tasks for effective learning. arXiv preprint arXiv:1609.09025, 2016. 9

^{[14] } Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multi-task deep reinforcement learning with popart. Technical report, DeepMind, 2019.

^{[15] } Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain. OmniNet: A unified architecture for multi-modal multi-task learning. arXiv preprint arXiv:1907.07804, 2019.

^{[16] } Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12-in-1: Multi-task vision and language representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.

Reinforcement learning (RL) has moved from toy domains to real-world applications such as navigation [4], software engineering [2], industrial design [11], and finance [10]. Each of these applications has inherent difficulties which are long-standing fundamental challenges in RL, such as: limited training time, partial observability, large action or state spaces, costly exploration and safety considerations, among others. Similar problems occur when using RL in trading markets, however, we focus on three main aspects that we consider are highly relevant for financial applications: risk-awareness, variance reduction, and robustness.

Risk is such a common term that can have many definitions in different scenario. Our first question then is, *what is risk?* In the context of trading, risk is the potential that your chosen investments may fail to deliver your anticipated outcome. That could mean getting lower returns than expected, or losing your original investment – and in certain forms of trading, it can even mean a loss that exceeds your deposit.

Our second question is, *how to measure risk?* Risk assessment is a cornerstone in financial applications, a well-known approach is to consider risk while assessing the performance (profit) of a trading strategy. Here, risk is a quantity related to the variance (or standard deviation) of the profit and it is commonly refereed to as "volatility". In particular, the Sharpe ratio [15], a commonly used measure in trading markets, considers both the generated profit and the risk (variance) associated with a trading strategy. Sharpe ratio is commonly defined to be the asset return divided by the standard deviation of the asset return.

Traditional RL aims to optimize the expected return, usually, without considerations of risk. However, *risk-averse RL* is a recent area that has proposed to optimize an objective function with risk consideration.

**Risk-averse Q-learning (RAQL): **Shen et al. [16] proposed a Q-learning algorithm that is shown to converge to the optimal of a risk-sensitive objective function:

\begin{align}

\label{eq:Risk_Averse_Objective}

\tilde{J}_{\pi}= \frac{1}{\beta}\mathbb{E}_{\pi}\left[exp\left(\beta\sum_{t=0}^{\infty}\gamma^t r_t\right)\right]=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_t\right] + \frac{\beta}{2}\mathbb{V} ar\left[\sum_{t=0}^{\infty}\gamma^t r_t\right] + O(\beta^2).

\end{align}

The training scheme is the same as Q-learning, except that in each iteration, a *utility function* is applied to the TD-error. A utility function is a monotonically increasing function. A concave utility function is applied when we want to optimize a risk-averse objective function, in contrast, a convex utility function is applied when we want to optimize a risk-seeking objective function. To summarize, applying a utility function to Q-learning is a concise way to consider risk in RL.

In trading markets, we do not only care about the expected return, but also how 'safe' a strategy is. In RL, one common approach to measure 'safety' is by measuring the variance of the return [7]. Here we we mention two recent works.

**Averaged-DQN [1]****:** This approach reduces training variance by training multiple Q tables in parallel and averaging previously learned Q-value estimates, which leads to a more stable training procedure and improved performance by reducing approximation error variance in the target values. Averaged DQN is theoretically shown to reducing the training variance, but there is no convergence guarantee.

**Variance reduced Q-learning (V-QL): **Wainwrigth [18] proposed a variance reduction Q-learning algorithm which can be seen as a variant of the SVRG algorithm in stochastic optimization [9]. Given an algorithm that converges to $Q^*$, one of its iterates $\bar{Q}$ could be used as a proxy for $Q^*$, and then recenter the ordinary Q-learning updates by a quantity $-\hat{\mathcal{T}}_k(\bar{Q}) + \mathcal{T}(\bar{Q})$, where $\hat{\mathcal{T}}_k$ is an empirical Bellman operator, $\mathcal{T}$ is the population Bellman operator, which is not computable, but an unbiased approximation of it could be used instead. This algorithm is shown to be convergent to the optimal of expected return and enjoys minimax optimality up to a logarithmic factor.

**Novel proposed algorithm: RA2-Q [6]****: ** Since RAQL has the advantage that it converges to the optimal of a risk-averse objective function, we could use it as a building block and design novel risk-averse RL algorithms based on it. The idea of training multiple Q tables in parallel could be integrated with the utility function technique, more specifically, we train $k$ Q tables in parallel using the update rules in RAQL, and select more stable actions by estimating the variance by the sample variance of those $k$ Q tables, then compute a risk-averse $\hat{Q}$ table and select actions according to it. We name this algorithm RA2-Q, which preserves the convergence property of RAQL.

**Novel proposed algorithm: RA2.1-Q [6]****: **We can also combine the 'recenter' technique of V-QL with the utility function technique in a novel algorithm RA2.1-Q. For each empirical Bellman operator $\hat{\mathcal{T}}_k$, we apply a risk-averse utility function to the TD error. Although we cannot show any convergence guarantee of RA2.1-Q, empirically, RA2.1-Q obtained better results than RA2-Q and RAQL in a multi-agent evaluation.

*What is robustness?* We usually say an algorithm is robust if it is stable under different challenging scenarios. Recent works, have improved robustness of algorithms with *adversarial learning* by assuming two opposing learning processes: one that aims to disturb the most and another one that tries to control the perturbations [12].

**Risk-Averse Robust Adversarial Reinforcement Learning (RARL): **The same concept has been adapted with neural networks in the context of deep RL [14] and in particular RARL [13] extended this idea by combining with Averaged DQN. RARL trains two agents -- protagonist and adversary in parallel, and the goal for those two agents are respectively to maximize/minimize the expected return as well as minimize/maximize the variance of expected return. RARL showed good experimental results, enhancing stability and robustness, without providing theoretical guarantees.

**Novel proposed algorithm: RA3-Q [6]: **The idea of having a protagonist and adversary in the same environment lends itself to multi-agent learning algorithms. In this context, Nash Q-learning [8] is a well-known multi-agent algorithm that can obtain the optimal strategy when there exists a unique Nash equilibrium in general-sum stochastic games. Our last proposal takes inspiriation from multi-agent learning algorithms and adversarial RL. In order to achieve a robust risk-averse agent, we combine the idea of adversarial learning with RA2-Q. We assume two opposing learning process: one protagonist aims to maximize the expected reward and minimize the variance, while one adversary aims to disturb the protagonist by minimizing the expected reward and maximize the variance. We name this adversarial learning algorithm RA3-Q and although RA3-Q does not have a convergence guarantee, empirically, RA3-Q shows better results in terms of robustness compared to RA2-Q.

How do we measure the superiority of RL agents in trading markets? We use game theory and treat each agent as a player in a stochastic game. In empirical game theory, a meta game payoff table could be seen as a combination of two matrices $(N|R)$, where each row $N_i$ contains a discrete distribution of $p$ players over $k$ strategies, and each row yields a discrete profile $(n_{\pi_1}, ..., n_{\pi_k})$ indicating exactly how many players play each strategy with $\sum_{j}n_{\pi_j} = p$. A strategy profile $\mathbf{u} = \left(\frac{n_{\pi_1}}{p}, ..., \frac{n_{\pi_k}}{p}\right)$. And each row $R_i$ captures the rewards corresponding to the rows in $N$. Once we have a meta-game payoff table, to view the dominance of different strategies, one can plot a directional field of the payoff tables where arrows in the strategy space indicates the direction of flow of the population composition over the strategies [17].

In our first experiment with the open-sourced ABIDES [5] market simulator our setting consisted of one non-learning agent that replays the market deterministically [3]and learning agents. The learning agents considered are: RAQL, RA2-Q, RA2.1-Q. The measure we use is Sharpe Ratio, which is a commonly used risk-averse measure in financial markets. The results are shown in the Figure below.

Our second experiment tested robustness and we trained RA2-Q and RA3-Q agents under the same conditions as a first step. Then in testing phase we added two types of perturbations, one adversarial agent (trained within RA3-Q) or adding noise (aka. zero-intelligence) agents in the environment. In both cases, the agents will act in a perturbed environment. The results presented in Table 1 shown that RA3-Q obtained better results than RA2-Q, highlighting its robustness.

Algorithm/Setting | Adversarial Perturbation | ZI Agents Perturbation |

RA2-Q RA3-Q |
0.5269 0.9347 |
0.9538 1.0692 |

We have argued that risk-awareness, variance reduction and robustness are relevant characteristics for RL agents since those can be used as building blocks to construct algorithms. For example, by using utility functions, parallel training of Q tables, and adversarial learning, different algorithms can be constructed, as shown in Fig. 2.

Table 2 presents a summary of properties of the algorithms mentioned in this post, those with bold typeface are our novel algorithms [6].

Algorithm | Risk-awareness | Variance reduction | Robustness |

RAQL | ● | ||

Averaged DQN | ● | ||

V-QL | ● | ||

RARL | ● | ● | |

RA2-Q | ● | ● | |

RA2.1-Q | ● | ● | |

RA3-Q | ● | ● | ● |

^{[1] } Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn: Variance reduction and sta-bilization for deep reinforcement learning. InInternational Conference on Machine Learning,pages 176–185. PMLR, 2017.

^{[2] }Mojtaba Bagherzadeh, Nafiseh Kahani, and Lionel Briand. Reinforcement learning for test caseprioritization.arXiv preprint arXiv:2011.01834, 2020.

^{[3] } Tucker Hybinette Balch, Mahmoud Mahfouz, Joshua Lockhart, Maria Hybinette, and DavidByrd. How to evaluate trading strategies: Single agent market replay or multiple agent inter-active simulation?arXiv preprint arXiv:1906.12010, 2019.

^{[4] }Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado,Subhodeep Moitra, Sameera S Ponda, and Ziyu Wang. Autonomous navigation of stratosphericballoons using reinforcement learning.Nature, 588(7836):77–82, 2020.

^{[5] }David Byrd, Maria Hybinette, and Tucker Hybinette Balch. Abides: Towards high-fidelitymarket simulation for ai research.arXiv preprint arXiv:1904.12066, 2019.

^{[6] } Yue Gao, Kry Yik Chau Lui, and Pablo Hernandez-Leal. Robust Risk-Sensitive ReinforcementLearning Agents for Trading Markets. InReinforcement Learning for Real Life (RL4RealLife)Workshop at ICML, 2021.

^{[7] } Javier Garcıa and Fernando Fern ́andez. A comprehensive survey on safe reinforcement learning.Journal of Machine Learning Research, 16(1):1437–1480, 2015.

^{[8] } Junling Hu and Michael P. Wellman. Multiagent reinforcement learning: Theoretical frameworkand an algorithm. InProceedings of the Fifteenth International Conference on Machine Learn-ing, ICML ’98, page 242–250, San Francisco, CA, USA, 1998. Morgan Kaufmann PublishersInc.

^{[9] } Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variancereduction. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates,Inc., 2013.

^{[10] } Yuxi Li. Deep reinforcement learning: An overview.arXiv preprint arXiv:1701.07274, 2017.

^{[11] } Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang,Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, et al. Chip placement with deepreinforcement learning.arXiv preprint arXiv:2004.10746, 2020.

^{[12] } Jun Morimoto and Kenji Doya. Robust reinforcement learning.Neural computation, 17(2):335–359, 2005.5

^{[13] } Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforce-ment learning. In2019 International Conference on Robotics and Automation (ICRA), pages8522–8528. IEEE, 2019.

^{[14] } Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarialreinforcement learning. InInternational Conference on Machine Learning, pages 2817–2826.PMLR, 2017.

^{[15] } William F Sharpe. The sharpe ratio.Journal of portfolio management, 21(1):49–58, 1994.

^{[16] } Yun Shen, Michael J Tobia, Tobias Sommer, and Klaus Obermayer. Risk-sensitive reinforcementlearning.Neural computation, 26(7):1298–1328, 2014.

^{[17] } Karl Tuyls, Julien Perolat, Marc Lanctot, Edward Hughes, Richard Everett, Joel Z Leibo, CsabaSzepesv ́ari, and Thore Graepel. Bounds and dynamics for empirical game theoretic analysis.Autonomous Agents and Multi-Agent Systems, 34(1):1–30, 2020.

^{[18] } Martin J Wainwright.Variance-reducedq-learning is minimax optimal.arXiv preprintarXiv:1906.04697, 2019.

*The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.*

**Nassim Abdi (NA):**

Businesses have been talking about diversity and inclusion for a long time. Unfortunately, most of the time, they were doing little more than completing a list of ‘checkbox’ items. More recently, however, we have started to see companies thinking more clearly about the business case behind diversity and inclusion. Businesses are starting to realize that they are losing high quality talent and slowing their rate of innovation and creativity simply because they do not have diverse voices at the table. You can’t create the most innovative tool if you aren’t looking at it from all angles and from different perspectives.

The data proves it. Recent research shows that companies with an inclusive workplace enjoy a 42% increase in collaboration. Those with diverse teams report a 1.7x increase in innovation. And there is emerging data that suggests lack of diversity leads to an US8 billion loss in productivity. Those are numbers businesses can’t simply ignore.

**(NA): **

I think Microsoft’s experience with Tay demonstrated that AI is only as good as the data you put into it. Recent studies suggest that facial recognition tends to be much, much more accurate when it comes to white males than other groups – particularly women of colour – largely because there is much more test data for white males than other populations. So I would argue that bias is already in the system and in the data before we even start.

That is why it is so critically important that AI developers and researchers pay attention to this issue. You must be able to bring that diversity of views or that empathetic mindset approach that only comes from understanding other peoples’ perspectives and walking in their shoes. Otherwise, you are allowing your decisions to be influenced by implicit biases.

*(NA): *

It really comes down to the relationship between bias and privilege. As developers, you have the privilege of being the decision-makers – you are the ones that control whether you are creating something that will ultimately be inclusive or exclusive. And that influences the ultimate structure of power. It’s not easy. But it starts with understanding that privilege and really checking our biases.

Machine learning models love binary decisions. Yet diversity does not lend itself well to binary thinking. How can we tackle this idea of ‘intersectionality’ as we think about our models?

That is one the big challenges when it comes to how we use and define technology. For now, it really comes down to ensuring you have real diversity and inclusion in your teams and that you define standards that are more aligned to the human world we actually live in. That means that the governance of these platforms is going to become much more important.

**(NA): **

The first thing is just being aware that we all have implicit bias. It’s human nature; for millions of years, humans were trained to avoid things that were unfamiliar. So the big question is how we can avoid it, particularly in a world of social media echo chambers. The real challenge here is to help people walk in someone else’s shoes – to really start to understand their situation and the world from their perspective.

**(NA): **

The first step for executives and business leaders is to be willing to address the problem. It isn’t always easy to change the status quo. The good news is that the new generation of workers is really starting to change the conversation. They care about where they work and the vision of the organization. They want to see the bigger picture and they want to have a positive impact.

One of the more successful approaches that businesses are adopting is ‘reverse mentorship’. As a practice, it’s been around for a while. But now we are seeing a lot of success from companies using reverse mentorship to create safe spaces for conversations about diversity, equity and inclusion.

**(NA): **

I believe the key is in helping people walk in other peoples’ shoes and get a real understanding of their perspectives and lived experiences. And that’s really the foundation point for the company I founded – StoryBolt. Simply put, we use the power of storytelling to help organizations unpack implicit bias, gender equality, mental health and more.

As a teacher, I realized very quickly that people learn better from visual stories. Different parts of the brain start working; language comprehension, visual cues and sounds all fire up the neurons. And it creates an experience that can stay with us for the rest of our lives. Rather than just show smart documentaries that unpack an issue, we then invite the filmmaker to come into the room for a Q&A and discussion on the issues. It’s amazing how that sparks a kind of awareness in people that simply does not go away.

Nassim Abdi, Ph.D., is the CEO and Co-founder of StoryBolt. She is a storyteller and evangelist on finding the intersection of entertainment and learning in the area of diversity, equity, and inclusion. She has 12 years of academic experience in the field of intersectionalities of gender, race, and other identities as they relate to systems of discrimination or disadvantage. Nassim is also the leading actress of a Netflix-featured film, Secret Ballot, (by Sony Pictures). Her vision for StoryBolt was shaped by the life-changing experience of the film as it engaged her in Q&A sessions and exposed her to the power of movies and how candid human connections could change perspectives and facilitate courageous conversations in the workplace.

]]>If you’ve got an idea and passion to explore what you can do with it, we want to help you solve it.

Let’s SOLVE it is a new Borealis AI mentorship program that aims to help undergraduate students use AI to make a difference and solve real problems in their communities. You bring the idea and the team, we’ll provide the industry exposure, mentorship, contacts and training you need to make the project a reality.

We’re looking for teams of 3 to 5 undergraduate students (enrolled at Canadian universities) with idas on how AI and ML could be used to tackle a specific community problem. With this program, we hope to reach as many communities as possible, and so we’re encouraging teams from across the country and from every walk of life to apply.

The mentorship program is free and will be conducted virtually. It runs from October 1st to December 2nd, so it falls within the school semester, and you’d need to allocate about 10 hours per week during that time.

In return, you’ll get all the support you need to turn your idea into a proof-of-concept implementation. You’ll get valuable experience, skills and ecosystem contacts that will help you consider launching your career in AI. Think of it as a ‘fast-track’ accelerator for your idea, and your skills and capabilities.

This program is open to all undergraduate students at all Canadian universities. You don’t need to be enrolled in a Computer Science program – each team member should have some basic programming knowledge, but specific experience using AI or ML isn’t necessary. What you do need, however, is a strong sense of curiosity, a passion for solving problems using AI and a burning desire to accelerate your personal development.

Let’s SOLVE it is one of a handful of initiatives that Borealis AI and the Royal Bank support in order to encourage diversity, skills development and innovation at Canadian universities. Along with initiatives like our Borealis AI Fellowships program (aimed at post-grad students) and our Internships program (focused on Masters-level students), our goal is to help nurture the AI leaders of tomorrow.

If you are an undergraduate student with dreams of solving real problems in your community using AI, we want to help you get there. We look forward to seeing your team’s application!

]]>In this final part, we discuss challenges with transformer training dynamics and introduce some of the tricks that practitioners use to get transformers to converge. This discussion will be suitable for researchers who already understand the transformer architecture, and who are interested in training transformers and similar models from scratch.

Despite their broad applications, transformers are surprisingly difficult to train from scratch. One of the contributions of the original transformer paper was to use four tricks that collectively allow stable training:

**1. Residual connections:** Each transformer layer takes the $I\times D$ data matrix $\mathbf{X}$ where $I$ is the number of inputs and $D$ the dimensionality of those inputs and returns an object of the same size. It performs the following operations:

\begin{eqnarray}

\mathbf{X} &\leftarrow& \mathbf{X} + \bf{MhSa}[\mathbf{X}] \nonumber \\

\mathbf{X} &\leftarrow& \bf{Layernorm}[\mathbf{X}] \hspace{3cm}\nonumber\\

\mathbf{x}_{i} &\leftarrow& \mathbf{x}_{i}+\bf{mlp}[\mathbf{x}_{i}] \hspace{3.6cm}\forall\; i\in\{1\ldots I\}\nonumber\\

\mathbf{X} &\leftarrow& \bf{Layernorm}[\mathbf{X}], \tag{1}

\end{eqnarray}

which include two residual connections around the multi-head self-attention $\bf{MhSa}[\bullet]$ and multi-layer perceptron $\bf{mlp}[\bullet]$ components (figure 1).

**2. Layer normalization:** After each residual connection, a layer normalization procedure is applied:

\begin{equation}

\bf Layernorm[\mathbf{X}] = \gamma\cdot \frac{\mathbf{X}-\mu}{\sigma}+\beta, \tag{2}

\end{equation}

where $\mu$ and $\sigma$ are the mean and standard deviation of the elements of $\mathbf{X}$ (but are separate for each member of the batch), and $\gamma$ and $\beta$ are learned parameters.

**3. Learning rate warm-up:** The learning rate is increased linearly from $0$ to $R$ over first $T_{R}$ time steps so that:

\begin{equation}

\mbox{lr}[t] = R \cdot \frac{t}{T_{R}}. \tag{3}

\end{equation}

**4. Adaptive optimizers:** Transformers need to be trained with adaptive optimizers like *Adam*, which recursively estimates the momentum and the learning rate separately for each parameter at each time-step. In practice, relatively large batch sizes of $>1,000$ are usually employed.

Removing any of these tricks makes training unstable and often leads to complete training failures. However, they have been employed without a full understanding of why they are required.

As transformers are applied more widely, it is increasingly important that we have a better understanding of transformer training. To this end, a number of recent papers have been devoted to demystifying this topic and exploring better training methods. In the rest of this blog post, we will connect these separate efforts to form a comprehensive overview of this topic.

In this section we will review some tricks and see that there are complex dependencies between them: Some tricks cause problems, which are in turn resolved by others. We will see that there are complex dependencies between them, so that some of the tricks cause problems, which are in turn resolved by others. In the subsequent section we will discuss improvements to the training process that follow from this understanding.

Learning rate warm-up (in which the learning rate is gradually increased during the early stages of training) is particularly puzzling. This is not required for most deep learning architectures. However, training fails for transformers if we just start with a typical learning rate. If we start with a very small learning rate, then the training is stable, but then it takes an impractically long time.

Xiong *et* al., 2020 explored this phenomenon by conducting experiments on a machine translation task with different optimizers and learning rate schedules. Their results (figure 2) show that learning rate warm-up is essential for both Adam and SGD, and that the training process is sensitive to the warm-up steps.

Although learning rate warm-up works, it has some obvious disadvantages. It introduces an extra hyper-parameter (the number of warm-up steps) and it initializes the learning rate to zero which slows the training down. Hence, it's important that we understand why it is necessary.

To help answer this question, Huang *et* al., 2020 visualized the gradient of the loss $\mathcal{L}$ with respect to the input embeddings $\mathbf{X}$, and the size of the Adam updates during the first 100 steps of training (figure 3). They found that without warm-up, the gradients vanish very quickly, and the Adam updates also rapidly become much smaller. Diminishing gradients at lower layers in the transformer model without warm-up have also been observed by Liu *et* al., 2020.

To understand why learning rate warm-up is required, and why the gradients vanish without it, we will first need to understand the reasons for, and the consequences of using residual connections, layer normalization, and Adam.

Residual networks were developed in computer vision; they make networks easier to optimize and allow deeper networks to be trained. In computer vision, the additive residual connections are usually placed around convolutional layers and combined with batch normalization. In the transformer, they are placed around the self-attention and feed-forward networks and combined with layer normalization (figure 1). From this perspective, the transformer architecture could be considered a "deep residual self-attention network".

Zhang *et* al., 2019 show that the output variance of residual networks grows exponentially with depth. Hence, normalization is used to prevent *gradient explosion* for deep residual networks. Layer normalization is used in the transformer because the statistics of language data exhibit large fluctuations across the batch dimension, and this leads to instability in batch normalization.

Transformers also differ from convolutional networks in that stochastic gradient descent does not work well for training (figure 2) and adaptive optimizers like Adam are required. Liu *et* al., 2020 observed that differentiating through the self-attention mechanism creates *unbalanced gradients*. In particular, the gradients for the query $\boldsymbol\Phi_{q}$ and key $\boldsymbol\Phi_{k}$ parameters were much smaller than those for the value parameters $\boldsymbol\Phi_{v}$, and so the former parameters change much more slowly. This is a direct consequence of the mathematical expression for self-attention. The Adam optimizer fixes this problem by essentially having different learning rates for each parameter.

To conclude, we've seen that residual connections are needed to allow us to train deep networks. These cause gradient explosion, which is resolved by using layer normalization. The self-attention computation causes unbalanced gradients, which necessitates the use of Adam (figure 4). In the next section, we'll see that layer normalization and Adam themselves cause more problems, which ultimately result in the need for learning rate warm-up.

Xiong *et* al., 2020 found that the magnitude of the gradients through layer normalization is inversely proportional to magnitude of the input. Specifically, the gradient has the following property:

\begin{equation}

\left\lVert \frac{\partial \bf Layernorm[\mathbf{X}]}{\partial \mathbf{X}} \right\rVert=\mathcal{O}\left(\frac{\sqrt{D}}{\lVert\mathbf{X}\rVert}\right), \tag{4}

\end{equation}

where $\mathbf{X}$ is the input to layer normalization and $D$ is the embedding dimension. If the input norm $\lVert \mathbf{X} \rVert$ is larger than $\sqrt{D}$ then back-propagating through layer normalization reduces the gradient magnitude in lower layers. As this effect compounds through multiple layers, it causes the gradient to vanish at lower layers for deep models. We will term this the *gradient shrinkage effect*.

Layer normalization also causes *unbalanced dependencies* between the two branches of the residual connection around the self-attention module. In other words, the output of $\bf LayerNorm[\mathbf{X}+\bf Sa[\mathbf{X}]]$ depends much more on the self-attention computation $\bf Sa[\mathbf{X}]$ than the skip connection $\mathbf{X}$. This means that the outputs depend much more on later layers than earlier layers. Liu *et* al., 2019 show that this happens empirically in practice.

Moreover, they show that this leads to *amplified output perturbations*; small changes to the network parameters cause large output fluctuations. More precisely, they proved that for a transformer network $\bf T_{N}[\mathbf{X},\boldsymbol\theta]$ with parameters $\boldsymbol\theta$, the output variance scales with the number of layers $N$ when we randomly perturb the parameters to $\boldsymbol\theta^{*} = \boldsymbol\theta+\boldsymbol\delta$:

\begin{equation}

\mbox{Var}\left[\bf T_{N}[\mathbf{X};\boldsymbol\theta] - \bf T_{N}[\mathbf{X};\boldsymbol\theta^{*}]\right] = \mathcal{O}(N), \tag{5}

\end{equation}

They also show that this happens empirically with both random parameter changes and Adam updates (figure 5). The result is that the output changes more and more when we update the parameters, which destabilizes transformer training.

Furthermore, using adaptive optimizers like Adam aggravates both the gradient shrinkage effect and the amplified output perturbations. Liu *et* al., 2019 show that the variance of the Adam updates is unbounded at the start of training, and these updates are also known to exhibit high variance in the early stages of training.

This can lead to problematic large updates early on which can make the input norm $\Vert \mathbf{X} \rVert$ to each layer increase as we move through the network and thus increased gradient shrinkage as predicted by equation 2.3. Moreover, the output fluctuation which is already amplified by the network structure will be even greater for these large parameter updates.

To summarize, residual connections are required in the transformer architecture for the ease of optimization, which further requires layer normalization to avoid gradient explosion and adaptive optimizers like Adam to address unbalanced gradients in the self-attention blocks. On the flip side, the use of layer normalization causes the gradients to shrink in the early layers and also amplifies the output perturbations. Moreover, the instability of Adam in the early stages of training exacerbates both of these effects (figure 6).

This is where learning rate warm-up comes in: it effectively stabilizes the Adam updates during the early stages of training by making the parameter changes much smaller. Consequently, Adam no longer aggravates gradient shrinkage and amplification of output perturbations and training becomes relatively stable.

In the previous section, we argued that the transformer architecture, and the statistics of language data require us to use layer normalization and train with adaptive optimizers like Adam. These choices in turn cause other problems that are resolved by using learning rate warm-up. In this section, we consider alternative methods for training deep transformers that don't require learning rate warm-up.

We'll consider three approaches that respectively remove the normalization from the network, attempt to re-balance the dependency on the two paths of the residual networks, and reduce the variance of the optimizer updates.

Since both the problems of gradient shrinkage and unbalanced dependencies are directly connected to layer normalization, it is natural to question whether we can train deep transformer models without this step. It has indeed been demonstrated that this is possible, and that we can achieve even better generalization without layer normalization.

Recall that the normalization mechanism is introduced to prevent gradients exploding in deep residual networks. It follows that if we can stabilize the gradient updates $\Delta \boldsymbol\theta$, then we can remove layer normalization. Zhang *et* al., 2019 demonstrated that the gradient updates $\Delta \pmb \theta$ can be bounded when using the SGD optimizer to train residual MLP or convolution blocks by appropriately initializing the weights. Based on this work, Huang *et* al., 2020 derived an analogous initialization scheme for residual self-attention blocks.

Although the theoretical derivations are for SGD updates, these results hold well for adaptive optimizers like Adam in practice. Furthermore, it follows from the Taylor expansion:

\begin{equation}\label{eq:taylor}

\Delta\bf T[\mathbf{X},\boldsymbol\theta] \approx \frac{\partial \bf T[\mathbf{X},\boldsymbol\theta]}{\partial \pmb\theta} \Delta \pmb\theta, \tag{6}

\end{equation}

that the output fluctuation $\Delta f$ is also bounded by bounding the gradient updates $\Delta\boldsymbol\theta$. As a result, both the gradient vanishing and the amplified output perturbations are resolved with stable gradient updates.

The proposed initialization scheme is known as *T-Fixup* and is easy to implement. Consider a multi-head self-attention block where the $h^{th}$ head computes

\begin{equation}

{\bf Sa}_{h}[\mathbf{X}] =\bf Softmax\left[\frac{(\mathbf{X}\boldsymbol\Phi_{qh})(\mathbf{X}\boldsymbol\Phi_{kh})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{vh}. \tag{7}

\end{equation}

where $\mathbf{X}$ is the input data matrix containing word embeddings in its rows and $\boldsymbol\Phi_{qh}$, $\boldsymbol\Phi_{kh}$ and $\boldsymbol\Phi_{vh}$ are the weight parameters for the queries, keys, and values respectively. The outputs of these self-attention mechanisms are concatenated and another linear transform $\boldsymbol\Phi_{c}$ is applied to combine them:

\begin{equation}

{\bf MhSa}[\mathbf{X}] = \left[{\bf Sa}_{1}[\mathbf{X}]\;{\bf Sa}_{2}[\mathbf{X}]\;\ldots\;{\bf Sa}_{H}[\mathbf{X}] \right]\boldsymbol\Phi_{c}. \tag{8}

\end{equation}

The T-Fixup scheme for encoder decoder attention is then as follows:

- Apply Xavier initialization for all parameters excluding input embeddings. Use Gaussian initialization $\mbox{Norm}_{\theta}[0, D^{-\frac12}]$ for input embeddings where $D$ is the embedding dimension.
- Scale $\boldsymbol\Phi_{vh}$ and $\boldsymbol\Phi_{c}$ in each encoder attention block and weight matrices in each encoder MLP block by $0.67N_{e}^{-\frac14}$ where $N_{e}$ is the number of transformer blocks (i.e, self-attention + MLP) in the encoder. Scale the input embeddings to the encoder by $(9N_{e})^{-\frac14}$
- Scale parameters $\boldsymbol\Phi_{vh}$ and $\boldsymbol\Phi_{c}$ in each decoder attention block, weight matrices in each decoder MLP block and input embeddings in the decoder by $(9N_d)^{-\frac14}$ where $N_d$ is the number of transformer blocks in the decoder.

In practice, *T-Fixup* is able to train significantly deeper transformer models with improved performance on the task of machine translation. For the detailed derivation of this method, we refer the readers to the original paper.

An alternative approach is to balance the residual dependencies, which in turn will limit the output perturbations $\Delta \bf T[\mathbf{X}]$. Equation 6 shows that controlling the magnitude of the output fluctuation $\Delta \bf T[\mathbf{X}]$ also bounds the magnitude of the gradient updates $\Delta \boldsymbol\theta$, which in turn mitigates the problem of gradient vanishing. Here we'll consider three possible approaches.

**Pre-LN transformers:** One simple solution is to change the location of layer normalization inside the transformer layer so that it occurs inside the residual blocks and before the self-attention or MLP (figure 7). This is known as the *pre-LN transformer*. This simple change can help control the gradient magnitude and balance the residual dependencies.

Pre-LN transformer models can be trained without learning rate warm-up. However, they also lead to inferior empirical performance. It has been speculated that this is because now the models are restricted not to depend too much on the contents of their residual layers Liu *et* al., 2020.

**Admin:** To bridge this performance gap, *adaptive model initialization* or *Admin* aims to bound the output fluctuation $\Delta \bf T[\mathbf{X}]$ by controlling the residual dependencies while retaining the original architecture.

Admin adds a new parameter $1\times D$ parameter vector $\boldsymbol\psi$ to each residual block. The self-attention block is then constructed as $\bf LayerNorm[\mathbf{X} \odot \boldsymbol\Psi + \bf MhSa[\mathbf{X}]]$ where $\odot$ is the element-wise product and $\boldsymbol\Psi$ is an $I\times D$ matrix where each row is a copy of $\boldsymbol\psi$. The residual connection around the parallel MLP layer is treated in the same way (figure 8a).

The new parameters at the $n^{th}$ layer are initialized to be the output standard deviation at that layer before this intervention. This can be estimated by setting all elements of $\boldsymbol\psi$ to one and forward propagating on a few training instances.

**ReZero:** In a similar vein, *ReZero* removes the layer normalization and introduces a single trainable parameter $\alpha$ per residual layer so that the self-attention block residual layer becomes, $\mathbf{X} + \alpha\bf MhSa[\mathbf{X}]$, where $\alpha$ is initialized to zero (figure 8b). The result of this is that the entire network is initialized just to compute the identity function, and the contributions of the self-attention and MLP layers are gradually and adaptively introduced.

Empirically, both Admin and ReZero work well for training deeper transformer models with better generalization performance, which demonstrates the effectiveness of balancing the residual dependencies.

We noted before that the high variance of learning rates in the Adam optimizer at the early stages of training exacerbates the problems of amplified output perturbations and gradient vanishing. Liu *et* al., (2019) argue that this is due to the lack of samples in the early stages of learning. They base their argument on an experiment in which they do not change the model parameters or momentum term of Adam for the first 2000 learning steps, but only adapt the learning rate. After this, warm-up is no longer required.

Based on these observations, they propose *Rectified Adam* or *RAdam* which gradually changes the momentum term over time in a way that helps avoid high variance. One way to think of this is that we have effectively incorporated learning rate warm-up into the Adam algorithm, but in a principled way.

In the previous sections, we have seen that great progress has been made towards understanding transformer training. Several solutions have been proposed that allow the training of significantly deeper transformer models with improved empirical performance.

However, they have only been applied to tasks with sufficient training data such as machine translation and language modelling. This is possibly due to the commonly-held belief that training deep transformers from scratch requires large datasets. For small datasets, it is typical just to add shallow and simple additional layers (*e.g.*, a classifier head) to pre-trained models and then fine-tune.

So, what prevents practitioners from training deep transformers on small datasets? It turns out that the final missing piece of the puzzle is the batch size. For small datasets, it's necessary to leverage large pre-trained models and then fine-tune. However, the size of these models limits the batch size and when the batch size is small, the variance of the updates is even larger, which makes training even harder. Even if we could use a larger batch size, it usually results in poorer generalization, especially on small datasets.

In short, small datasets require pre-trained models and small batch sizes to perform well, but these two requirements make training additional transformer layers challenging. To resolve the high variance of training updates in small batch sizes, the three ideas from the previous section can all be applied. However, these approaches all assume that the inputs to the transformers are randomly initialized embeddings, but this is not true if we are adding yet-to-be-trained transformers on top of pre-trained models (figure 9).

DT-Fixup is a data-dependent initialization strategy developed by Borealis AI. It adapts the T-Fixup method for this type of mixed setting. DT-Fixup allows significantly deeper transformers to be trained with small datasets for challenging tasks such as Text-to-SQL semantic parsing and logical reading comprehension. This demonstrates that training deep transformers with small datasets is feasible with the correct optimization procedure.

In the first two parts of this blog, we introduced the transformer, and discussed extensions and relations to other models. In this final part, we have discussed the complex topic of how to train deep transformer models effectively. We hope that this discussion will help practitioners to train deeper transformers for different applications and help identify potential directions for further improving transformer training.

]]>In this final post, we consider *probabilistic context-free grammars* or *PCFGs*, which are are a special case of WCFGs. They are featured more than WCFGs in the earlier statistical NLP literature and in most teaching materials. As the name suggests, they replace the rule weights with probabilities. We will treat these probabilities as model parameters and describe algorithms to learn them for both the supervised and the unsupervised cases. The latter is tackled by expectation-maximization and leads us to develop the *inside-outside algorithm* which computes the expected rule counts that are required for the EM updates.

PCFGs are the same as WCFGs except that the weights are constrained; the weights of all rules with the same non-terminal on the left-hand side must be non-negative and sum to one:

\begin{equation}

\sum_{\alpha} \mbox{g}[\text{A} \rightarrow \alpha] = 1. \tag{1}

\end{equation}

For example, we might have three rules with VP on the left-hand side: $\text{VP} \rightarrow \text{NP}$, $\text{VP} \rightarrow \text{NN}$ and $\text{VP} \rightarrow \text{PN}$. For a PCFG, this implies that:

\begin{equation}

\mbox{g}[\text{VP} \rightarrow \text{NP}]+\mbox{g}[\text{VP} \rightarrow \text{NN}]+\mbox{g}[\text{VP} \rightarrow \text{PN}]= 1. \tag{2}

\end{equation}

The rule weights are now probabilities and the weight $\mbox{G}[T]$ of an entire tree is the product of these probabilities. The tree weight $\mbox{G}[T]$ can hence be interpreted as the probability $Pr(T)$ of the tree:

\begin{eqnarray}\label{eq:tree_like}

Pr(T) &=& \mbox{G}[T]\nonumber \\

&=& \prod_{t\in T} \mbox{g}[T_{t}]\nonumber \\

&=& \prod_{A}\prod_{\alpha} \mbox{g}[\text{A} \rightarrow \alpha]^{\mbox{f}_{\text{A}\rightarrow\alpha}[T]} \tag{3}

\end{eqnarray}

where the function $f_{\text{A} \rightarrow \alpha}[T]$ counts the number of times $\text{A} \rightarrow \alpha$ appears in tree $T$.

PCFGs define a joint distribution $Pr(T,\mathbf{w})$ of trees $T$ and sentences $\mathbf{w}$. The probability of a sentence $Pr(\mathbf{w})$ can be computed through marginalization:

\begin{equation}\label{eq:parse_marginal}

Pr(\mathbf{w}) = \sum_{T\in \mathcal{T}[\mathbf{w}]} Pr(\mathbf{w}, T), \tag{4}

\end{equation}

where $\mathcal{T}[\mathbf{w}]$ is the set of all parse trees that are compatible with the observed sentence $\mathbf{w}$.

The conditional probability of the sentence $Pr(\mathbf{w}|T)$ given the tree is just $1$, because the tree *yields* $\mathbf{w}$ (i.e., the tree deterministically produces the words). It follows that the joint distribution is:

\begin{equation}\label{eq:parsing_joint}

Pr(T,\mathbf{w}) = Pr(\mathbf{w}|T) Pr(T) = Pr(T). \tag{5}

\end{equation}

However, the conditional probability $Pr(T|\mathbf{w})\neq 1$ in general. When a sentence is ambiguous, there are multiple trees that can produce the same words. For PCFGs, the weighted parsing algorithm returns the tree with the maximum conditional probability, and the inside algorithm returns the marginal probability of the sentence (equation 4).

PCFGs are a *generative* approach to syntactic analysis in that they represent joint distributions over sentences and parse trees. They can also be used to sample random sentences: we start by drawing a sample from all of the rules $\text{S} \rightarrow \alpha$ with the start token $S$ on the left hand side according to the probabilities $g[\text{S}\rightarrow \alpha]$. For example, we might draw $\text{S}\rightarrow\text{NP VP}$. Then we draw samples from rules with $\text{NP}$ and $\text{VP}$ on the left hand-side respectively. The process continues until it draws terminal symbols (i.e., words) in every branch of the tree and no non-terminals remain at the leaves.

We now turn our attention to learning the rule probabilities for a PCFG. In this section we'll consider the supervised case where we have a *treebank* (i.e, a set of sentences annotated with trees). We 'll show that we can estimate the probabilities using a simple counting procedure. In subsequent sections, we'll consider the unsupervised case, where we must estimate the weights based on the sentences alone.

To learn the parameters of a PCFG from a treebank, we optimize the total log likelihood $L$ of the $I$ observed trees:

\begin{equation}

L = \sum_{i=1}^{I}\sum_{A} \sum_{\alpha} f_{\text{A}\rightarrow\alpha}[T_{i}]\log[\mbox{g}[\text{A} \rightarrow \alpha]]. \tag{6}

\end{equation}

with respect to the rule probabilities $g[\text{A} \rightarrow \alpha]$. The first sum is over the training examples, the second over the left hand side of the rules and the third over the right-hand side. The function $f_{\text{A} \rightarrow \alpha}[T_{i}]$ counts the number of times rule $\text{A} \rightarrow \alpha$ appears in tree $T_{i}$.

To ensure that the result of this optimization process is a valid PCFG, we must also add a set of constraints that ensure that all rules with the same left-hand side sum to one:

\begin{equation}

\sum_{\alpha} \mbox{g}[\text{A} \rightarrow \alpha] = 1\hspace{1cm}\forall \mbox{A}\in\mathcal{V}.\tag{7}

\end{equation}

Taking just the terms with A on the left hand side, and adding a Lagrange multiplier that enforces this constraint we have:

\begin{equation}

L_{A}^{\prime} = \sum_{i=1}^{I}\sum_{\alpha} f_{\text{A}\rightarrow\alpha}[T_{i}]\log[\mbox{g}[\text{A} \rightarrow \alpha]]+\lambda\left(\sum_{\text{A}}g[\text{A} \rightarrow \alpha] - 1\right). \tag{8}

\end{equation}

We then take derivatives with respect to each rule $g[\text{A} \rightarrow \alpha]$ and the Lagrange multiplier $\lambda$ and set the resulting expressions to zero. This yields a set of equations which can be re-arranged to provide the maximum likelihood estimator for a given rule $\mbox{g}[\text{A} \rightarrow \text{B} \text{C}]$:

\begin{equation}\label{eq:treebank_ml}

\mbox{g}[\text{A} \rightarrow \text{B} \text{C}] = \frac{\sum_i^I f_{\text{A} \rightarrow \text{B} \text{C}}[T_i]}{\sum_i^I \sum_{\alpha} f_{\text{A}, \rightarrow \alpha}[T_i]}, \tag{9}

\end{equation}

where the numerator counts the number of times A is re-written to BC and the denominator counts the number of times it is re-rewritten to anything. See Chi and Geman (1998) for further details.

We'll now provide some code snippets that use a treebank to estimate the parameters of a PCFG using the above method. In treebanks the constituency trees are usually represented in a bracket notation. For example, the legendary first sentence from the Penn Treebank^{1} is Pierre Vinken, 61 years old, will join the board as a non-executive director Nov. 29. with associated tree:

`( (S`

(NP-SBJ

(NP (NNP Pierre) (NNP Vinken) )

(, ,)

(ADJP

(NP (CD 61) (NNS years) )

(JJ old) )

(, ,) )

(VP (MD will)

(VP (VB join)

(NP (DT the) (NN board) )

(PP-CLR (IN as)

(NP (DT a) (JJ nonexecutive) (NN director) ))

(NP-TMP (NNP Nov.) (CD 29) )))

(. .) ))

Throughout this section, we'll provide some Python code snippets that use the `NLTK`

(natural language toolkit) package. We'll show how to estimate the rule weights from annotated data using NLTK, and then we'll take a look inside the code to see how it is implemented.

NLTK has the convenient `Tree`

class to make exploration easier:

`from nltk.tree import Tree`

t = Tree.fromstring("(S (NP I) (VP (V saw) (NP him)))")

To extract a grammar that is useful for parsing we need a to convert the CFG to Chomsky Normal Form:

`t.chomsky_normal_form()`

We then extract grammar rules from the entire treebank:

`productions = []`

# Go over all tree-strings

for line in treebank:

tree = Tree.fromstring(line)

tree.collapse_unary(collapsePOS = False)

tree.chomsky_normal_form(horzMarkov = 2)

productions += tree.productions()

Finally, we learn the weights:

`from nltk import Nonterminal`

from nltk import induce_pcfg

S = Nonterminal('S')

grammar = induce_pcfg(S, productions)

Now let's peek into `NLTK`

version `3.6`

to see how these estimates are computed:

`# Production count: the number of times a given production occurs`

pcount = {}

# LHS-count: counts the number of times a given lhs occurs

lcount = {}

for prod in productions:

lcount[prod.lhs()] = lcount.get(prod.lhs(), 0) + 1

pcount[prod] = pcount.get(prod, 0) + 1

prods = [

ProbabilisticProduction(p.lhs(),

p.rhs(),

prob=pcount[p] / lcount[p.lhs()])

for p in pcount

]

As expected, this just directly implements equation 13. Given the parameters of the model we can parse a sentence under the induced grammar using the CYK weighted parsing algorithm:

`from nltk.parse import ViterbiParser`

sent = ['How', 'much', 'is', 'the', 'fish', '?']

parser = ViterbiParser(grammar)

parse = parser.parse(sent)

In the previous section, we showed that it is relatively easy to estimate the parameters (rule weights) of PCFGs when we are given a set of sentences with the associated trees. Now we turn our attention to the unsupervised case. Here, we are given the sentences, but not the trees. This is a chicken-and-egg problem. If we knew the rule-weights, then we could perform weighted parsing to find the best fitting trees. Likewise, if we knew the trees, we could estimate the rule weights using the supervised approach above.

In the next section, we'll introduce some notation and define the cost function for parameter estimation in the unsupervised case. Then we'll describe two algorithms to optimize this cost function. First we'll introduce Viterbi training that alternates between finding the best trees for fixed rule weights and updating the rule weights based on these trees. Then, we'll introduce an expectation maximization (EM) algorithm that follows a similar approach, but takes into account the ambiguity of the parsing procedure.

Our goal is to calculate the rule weights $\mbox{g}[\text{A}\rightarrow\text{BC}]$. To make the notation cleaner, we'll define the vector $\boldsymbol\theta$ to contain the log rule weights, so that the element $\theta_{A \rightarrow BC}$ contains $\log \left[\mbox{g}[A \rightarrow BC]\right]$. We aim to maximize the joint probability of the observed sentences and the associated trees with respect to these parameters:

\begin{equation}

\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}} \left[\sum_i^I \log\left[Pr(T_{i}, \mathbf{w}_{i}|\boldsymbol\theta)\right]\right] \tag{10}

\end{equation}

where $i$ indexes $I$ training examples, each consisting of sentence $\mathbf{w}_{i}$ and associated parse tree $T_{i}$.

Unfortunately, in the unsupervised learning case, we don't know the associated parse trees. There are two possible solutions to this problem. In Viterbi training, we will simultaneously maximize the likelihood with respect to the parameters $\boldsymbol\theta$ and with respect to the choice of tree $T_{i}$ from the set of valid trees $\mathcal{T}[\mathbf{w}_{i}]$ for each training example $i$:

\begin{equation}\label{eq:parse_viterbi}

\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[ \sum_i^I

\log\left[\underset{T_{i}\in\mathcal{T}_{i}[\mathbf{w}_{i}]}{\text{arg}\,\text{max}}\left[Pr(T_{i}, \mathbf{w}_{i}|\boldsymbol\theta)\right]\right]\right]. \tag{11}

\end{equation}

In the expectation-maximization approach we will marginalize over all the possible trees:

\begin{equation}

\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}} \left[\sum_i^I \log\left[\sum_{ T_{i}\in\mathcal{T}_{i}[\mathbf{w}_{i}]}Pr(T_{i}, \mathbf{w}_{i}|\boldsymbol\theta)\right]\right]. \tag{12}

\end{equation}

To summarize, Viterbi training finds the parameters that maximize the joint likelihood of the sentences and their highest probability parses, while the expectation maximization approach searches for the parameters that maximizes the marginal likelihood of the sentences.

Viterbi training optimizes the cost function in equation 11 by alternately maximizing over the parameters and fitted trees. More precisely, we first initialize the parameters $\boldsymbol\theta$ to random values. Then we alternate between two steps:

- We run the weighted parsing algorithm for each sentence, $\mathbf{w}_{i}$, to find the tree $T_{i}$ with the highest weight, treating the parameters $\boldsymbol\theta$ as fixed.
- We compute the maximum likelihood estimator for the parameters $\boldsymbol\theta$, treating the parse trees $T_{i}$ as fixed. Since $Pr(T_{i},\mathbf{w}_{i}) = Pr(T_{i})$ (equation 5), this is effectively the same as the supervised algorithm, and the solution is given by:

\begin{equation}

\theta_{\text{A} \rightarrow \text{B} \text{C}} = \log\left[ \frac{\sum_i^I f_{\text{A} \rightarrow \text{B} \text{C}}[T_i]}{\sum_i^I \sum_\alpha f_{\text{A} \rightarrow \alpha}[T_i]}\right], \tag{13}

\end{equation}

where the function $f_{\text{A} \rightarrow \text{B} \text{C}}[T_i]$ counts the number of times $\text{A} \rightarrow \text{B} \text{C}$ appears in tree $T_i$ and $\sum_\alpha f_{\text{A} \rightarrow \alpha}[T_i]$ counts the number of times that $\text{A}$ is re-written to anything in tree $T_{i}$.

Now let's turn our attention to the expectation maximization (EM) approach which is somewhat more complicated. Recall that we want to maximize the cost function:

\begin{equation}

\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}} \left[\sum_i^I \log\left[\sum_{ T_{i}\in\mathcal{T}_{i}[\mathbf{w}_{i}]}Pr(T_{i}, \mathbf{w}_{i}|\boldsymbol\theta)\right]\right]. \tag{14}

\end{equation}

We'll present a brief overview of the EM method in an abstract form. Then we'll map this to our use case. We'll find expressions for the E-Step and the M-Step respectively. The expression for the M-Step will contain terms that are expected rule counts. We'll show that we can find these by taking the derivative of the log partition function. This will finally lead us to the inside-outside algorithm which computes these counts.

The EM algorithm is a general tool for maximizing cost functions of the form:

\begin{equation}

\boldsymbol\theta^{*} = \underset{\boldsymbol\theta}{\text{arg}\,\text{max}} \left[\sum_i^I \log\left[\sum_{h_{i}}Pr(h_{i}, \mathbf{w}_{i}|\boldsymbol\theta)\right]\right], \tag{15}

\end{equation}

where $h_{i}$ a hidden (i.e., unobserved) variable associated with observations $\mathbf{w}_{i}$.

The EM algorithm alternates between the *expectation step* (*E-step*) and the *maximization step* (*M-step*). In the E-step, we consider the parameters $\boldsymbol\theta$ to be fixed and compute the posterior distribution $Pr(h_{i}|\mathbf{w}_{i},\boldsymbol\theta)$ over the hidden variables $h_{i}$ for each of the $I$ examples using Bayes' rule:

\begin{equation}

Pr(h_{i}|\mathbf{w}_{i},\boldsymbol\theta) = \frac{Pr(h_{i}, \mathbf{w}_{i}|\boldsymbol\theta)}{\sum_{h_{i}}Pr(h_{i}, \mathbf{w}_{i}|\boldsymbol\theta)}. \tag{16}

\end{equation}

In the M-step we update the parameters using the rule:

\begin{eqnarray}

\boldsymbol\theta &=&\underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \sum_{h_{i}}Pr(h_{i}|\mathbf{w}_{i},\boldsymbol\theta^{old})\log\left[Pr(h_{i}, \mathbf{w}_{i}|\boldsymbol\theta)\right]\right]\nonumber \\

&=&\underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \mathbb{E}_{h_{i}}\left[\log\left[Pr(h_{i}, \mathbf{w}_{i}|\boldsymbol\theta)\right]\right]\right] \tag{17}

\end{eqnarray}

where the expectation is calculated with respect to the distribution that we computed in the E-Step and $\boldsymbol\theta^{old}$ denotes the previous parameters that were used in the E-Step.

It is not obvious why the EM update rules maximize the cost function in equation 15, but this discussion is beyond the scope of this tutorial. For more information consult chapter 7 of this book.

Now let's map the EM algorithm back to our use case. Here, the choice of the unknown underlying parse tree $T_{i}$ is the hidden variable, and we also have for our case $Pr(\mathbf{w}_{i},T_{i}) = Pr(T_{i})$ (equation 5). This gives us the E-Step:

\begin{eqnarray}

Pr(T_{i}|\mathbf{w}_{i},\boldsymbol\theta) = \frac{Pr(T_{i}|\boldsymbol\theta)}{\sum_{T_{i}\in\mathcal{T}[\mathbf{w}]}Pr(T_{i}|\boldsymbol\theta)}, \tag{18}

\end{eqnarray}

and M-Step:

\begin{eqnarray}\label{eq:tree_m_step}

\boldsymbol\theta^{t+1} &=&\underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \mathbb{E}_{T_{i}}\left[\log\left[Pr(T_{i}|\boldsymbol\theta)\right]\right]\right], \tag{19}

\end{eqnarray}

where this expectation is with respect to the posterior distribution $Pr(T_{i}|\mathbf{w}_{i},\boldsymbol\theta^{old})$ that was calculated in the E-Step. The maximization in the M-Step is subject to the usual PCFG constraints that the weights of all the rules mapping from a given non-terminal $\text{A}$ must sum to one:

\begin{equation}\label{eq:tree_m_step_constraint}

\sum_{\alpha}\exp\left[\theta_{A\rightarrow\alpha}\right] = 1. \tag{20}

\end{equation}

We'll now consider each of those steps in more detail.

Recall, that in the E-Step we wish to compute the posterior distribution $Pr(T|\mathbf{w},\boldsymbol\theta)$ over the parse trees $T$ for each sentence $\mathbf{w}$ given the current parameters $\boldsymbol\theta$:

\begin{eqnarray}\label{eq:E-step_parsing}

Pr(T|\mathbf{w},\boldsymbol\theta) = \frac{Pr(T|\boldsymbol\theta)}{\sum_{T\in\mathcal{T}[\mathbf{w}]}Pr(T|\boldsymbol\theta)} \tag{21}

\end{eqnarray}

For a PCFG, the conditional probability in the numerator is just the product of the weights in the tree and the denominator is the partition function $Z$. Let's derive an expression for the numerator:

\begin{align}

\tag{definition of cond prob}

Pr(T|\boldsymbol\theta^{[t]}) &=\prod_{t \in T} \mbox{g}[T_t]) \\

\tag{log-sum-exp}

&= \exp \left[\sum_{t \in T} \log \left[\mbox{g}[T_t]\right] \right] \\

\tag{parameter notation}

&= \exp \left[\sum_{t \in T} \theta_{T_t} \right] \\

\tag{rule counts}

&= \exp \left[\sum_{r \in R} f_r[T] \cdot \theta_{r} \right]\\

\tag{vectorized}

&= \exp \left[\mathbf{f}^{T} \boldsymbol\theta \right],

\end{align}

where $\mathbf{f}^{T}$ is a $|\mathcal{R}|\times 1$ vector in which the $r^{th}$ entry corresponds to the number of times that rule $R$ occurs in the tree and $\boldsymbol\theta$ is a vector of the same size where each entry is the log probability (weight) of that rule.

Armed with this new formulation, we can now re-write equation 21 as

\begin{eqnarray}\label{eq:E_Step_parsing_post}

Pr(T|\mathbf{w}_{i},\boldsymbol\theta) &=& \frac{\exp \left[\mathbf{f}^{T} \boldsymbol\theta \right]}{\sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}^{T} \boldsymbol\theta \right]}\nonumber \\

&=& \frac{\exp \left[\mathbf{f}^{T} \boldsymbol\theta \right]}{Z}, \tag{22}

\end{eqnarray}

where $Z$ is the partition function that we can calculate using the inside algorithm.

In the M-Step, we compute:

\begin{eqnarray}

\boldsymbol\theta^{t+1} &=&\underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \mathbb{E}_{T_{i}}\left[\log\left[Pr(T_{i}|\boldsymbol\theta)\right]\right]\right]\nonumber \\

&=& \underset{\boldsymbol\theta}{\text{arg}\,\text{max}}\left[\sum_i^I \mathbb{E}_{T_{i}}\left[\sum_{\text{A}}\sum_{\alpha} \mbox{f}_{\text{A}\rightarrow\alpha}[T_{i}]\theta_{\text{A} \rightarrow \alpha}]\right]\right], \tag{23}

\end{eqnarray}

where as usual, the function $\mbox{f}_{\text{A}\rightarrow\alpha}[T_{i}]$ counts how many times the rules $\text{A}\rightarrow\alpha$ is used in tree $T_{i}$.

This is now very similar to the supervised case; we maximize equation 23 subject to the constraint in equation 20 using Lagrange multipliers to yield the update rule:

\begin{eqnarray}

\theta_{\text{A} \rightarrow \text{B} \text{C}} = \exp\left[\frac{\sum_i^N \mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}[T_i]]}{\sum_i^N \sum_\alpha \mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \alpha}[T_i]]} \right], \tag{24}

\end{eqnarray}

where the expectation is computed over the posterior distribution $Pr(T_{i}|\mathbf{w}_{i})$ that we compute in the E-Step. The final expression is the same as the supervised case (equation 13) except that we are now using the log weights $\theta_{\text{A} \rightarrow \text{B} \text{C}}$ and we are taking expectations over the counting functions $\mbox{f}_{\text{A} \rightarrow \alpha}[T_i]]$.

In this section, we'll derive an expression for the expected counts $\mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}[T_i]]$ in the M-Step. First, we substitute in the expression for the posterior distribution (equation 22) from the E-Step:

\begin{eqnarray}

\mathbb{E}_{T}\left[f_{r}\right]

&=& \sum_{T\in \mathcal{T}[\mathbf{w}]}Pr(T|\mathbf{w},\boldsymbol\theta)\cdot f_{r}\nonumber \\

&=& \frac{1}{Z}\sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}^{T} \cdot \boldsymbol\theta \right]\cdot f_{r} \tag{25}

\end{eqnarray}

Now we perform some algebraic manipulation of the right hand side:

\begin{eqnarray}

\mathbb{E}_{T}\left[f_{r}\right]

&=& \frac{1}{Z}\sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}^{T} \cdot \boldsymbol\theta \right]\cdot f_{r} \nonumber \\

&=& \frac{1}{Z}\sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}_{j}^{T}\boldsymbol\theta \right]\frac{\partial }{\partial \theta_{r}} \mathbf{f}^{T} \cdot \boldsymbol\theta \nonumber \\

&=&

\frac{1}{Z}\frac{\partial }{\partial \theta_{r}} \sum_{T\in \mathcal{T}[\mathbf{w}]}\exp \left[\mathbf{f}^{T} \boldsymbol\theta \right]\nonumber\\

&=& \frac{1}{Z}\frac{\partial Z}{\partial \theta_{r}} \nonumber\\

&=& \frac{\partial \log[Z]}{\partial \theta_{r}}. \tag{26}

\end{eqnarray}

We see that the expected count for the $r^{th}$ rule is just the derivative of the log partition function $Z$ with respect to the $r^{th}$ log weight $\theta_{r}$, which we calculated with the inside algorithm.

To summarize the EM algorithm, we alternate between computing the expected counts for each rule $\text{A} \rightarrow \text{B} \text{C}$ and sentence $\mathbf{w}_{i}$:

\begin{eqnarray}

\mathbb{E_{T_i}}\left[\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}\right] &=& \frac{\partial \log[Z_{i}]}{\partial \theta_{\text{A} \rightarrow \text{B} \text{C}}}, \tag{27}

\end{eqnarray}

and updating the parameters:

\begin{eqnarray}

\theta_{\text{A} \rightarrow \text{B} \text{C}} = \exp\left[\frac{\sum_i^N \mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}[T_i]]}{\sum_i^N \sum_\alpha \mathbb{E}_{T_i} [\mbox{f}_{\text{A} \rightarrow \alpha}[T_i]]} \right]. \tag{28}

\end{eqnarray}

To do this, we need to know how to compute the derivative of the log partition function. Since the log-partition function is calculated using the inside algorithm, a really simple way to do this is just to use *automatic differentiation*. We treat the inside algorithm as a differentiable function as we would a neural network and use backpropagation to estimate the derivatives using code similar to: `Z = inside(w, `

$\theta$`); log(Z).backward()`

. This is known as the *inside-outside algorithm*.

In the previous section, we argued that we could compute the expected counts by automatic differentiation of the inside algorithm. In this section, we'll take this one step further; we'll apply backpropagation by hand to the inside-algorithm, to yield the *outside* algorithm which directly computes these counts.

Let's think of the inside-algorithm as a differentiable program. It takes a sentence $\mathbf{w}$ and parameters $\log[g_{r}] = \theta_r$ as input, and returns the partition function $Z$ as output. The code is:

0 # Initialize data structure

1 chart[1...n, 1...n, 1...|V|] := 0

2

3 # Use unary rules to find possible parts of speech at pre-terminals

4 for k := 1 to n # start position

5 for each unary rule A -> w_k

6 chart[k-1, k, A] := g(A -> w_k)

7

8 # Sum over binary rules

9 for l := 2 to n # sub-string length

10 for i := 1 to n-l #start point

11 k = i + l #end point

11 for j := i + 1 to k-1 # mid point

12 for each binary rule A -> B C

13 chart[i, k, A] += g[A -> B C] x

chart[i, j, B] x

chart[j, k, C]

14

15 return chart[0, n, S]

Notice that we have changed the indexing in the loops from our original presentation. This will make the notation less cumbersome. In this notation, the chart is indexed so that `chart[i, k, A]`

contains the inside weights $\alpha[\text{A}_i^k]$ that represent the total weight of all the feasible trees in the sub-sequence from position $i$ to position $k$.

Now consider what happens when we call `log(Z).backward()`

. In the original inside algorithm, we worked from the leaves of the parse tree up to the root where the partition function was computed. When we backpropagate, we reverse this order of operations and work from the root downwards. We'll now work backwards through the inside algorithm and discuss each part in turn.

**Main loop:** Based on this intuition we can already sketch the loop structure going backwards from largest constituent to smallest:

1 # Main parsing loop top-down

2 for l := n downto 2

3 for i := 1 to n-l

3 k = i + 1

4 for j := i + 1 to k-1

do stuff

To simplify matters, let's assume that we want to take the derivative of the partition function $Z$ itself rather than $\log[Z]$ for now. The update in the inside algorithm was given by:

\begin{equation}

\alpha[\text{A}^i_k] += \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \alpha[\text{B}_i^j] \times \alpha[\text{C}_j^k] \tag{29}

\end{equation}

Now we use reverse mode automatic differentiation to work backwards through the computation graph implied by this update. Using the notation $\nabla x = \frac{\partial Z}{\partial x}$, we start from the base case at the root of the tree $\alpha[\text{S}_0^n]$, which holds the partition function so $\nabla \alpha[\text{S}_0^n] = \frac{\partial Z}{\partial Z} =1$. Then we recursively compute:

\begin{align}

\nabla \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] &+= \nabla \alpha[\text{A}_i^k] \left(\alpha[\text{B}_i^j] \times \alpha[\text{C}_j^k]\right) \label{eq:outside_rule_update}\tag{30}\\

\nabla \alpha[\text{B}_i^j] &+= \nabla \alpha[\text{A}_i^k]\left(\mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \alpha[\text{C}_j^k] \right)\label{eq:outside update}\tag{31}\\

\nabla \alpha[\text{C}_j^k] &+= \nabla \alpha[\text{A}_i^k]\left(\mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \alpha[\text{B}_i^j]\right)\label{eq:outside update2} \tag{32}

\end{align}

where in each case, the first term is the accumulated derivative that is being passed back from the later nodes of the computation graph, and the second term in brackets is the derivative of the the current node with respect to each quantity involved.

**Pre-terminal loop:** Working further back through the algorithm, we must also compute the derivatives of the pre-terminal loop in lines 4-6 of the algorithm. $\alpha[\text{A}_{k-1}^k] += \mbox{g}[\text{A} \rightarrow w_k]$:

\begin{equation}\label{eq:final_beta}

\nabla \mbox{g}[\text{A} \rightarrow w_k] += \nabla \alpha[\text{A}_{k-1}^k]. \tag{33}

\end{equation}

**Outer weights:** To ease further discussion let us rename the partial derivatives of the partition function with respect to the inside weights as *outer-weights* so that $\nabla \alpha [A_{i}^{k}] = \beta [A_{i}^{k}]$. Similarly, we'll rename the partial derivatives of the partition function with respect to the rules as $\nabla g[\text{A}\rightarrow \text{B} \text{C}] = \beta [\text{A}\rightarrow \text{B} \text{C}]$.

**Expected counts:** So far, we have been computing the derivatives of $Z$ with respect to the rule weights $\mbox{g}[\text{A} \rightarrow \text{B} \text{C}]$. However, we need to compute the expected counts, which are the derivatives of $\log[Z]$ with respect to the parameters $\theta_{\text{A} \rightarrow \text{B} \text{C}}$. We can make this adjustment using the chain rule:

\begin{eqnarray}

\mathbb{E}_{T}\left[\mbox{f}_{\text{A} \rightarrow \text{B} \text{C}}\right] &=& \frac{\partial \log[Z]}{\partial \theta_{\text{A} \rightarrow \text{C} \text{B}}}\nonumber \\

&=& \frac{\partial \log[Z]}{\partial Z} \cdot \frac{\partial Z}{\partial \mbox{g}[\text{A} \rightarrow \text{B} \text{C}]} \cdot

\frac{\partial \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] }{ \theta_{\text{A} \rightarrow \text{C} \text{B}}} \nonumber \\

&=& \frac{1}{Z} \cdot \beta[\text{A} \rightarrow \text{B} \text{C}] \cdot \mbox{g}[\text{A} \rightarrow \text{B} \text{C}].\label{eq:count_final} \tag{34}

\end{eqnarray}

where the second term is just the definition of $\beta[\text{A} \rightarrow \text{B} \text{C}]$ and the final term follows from the definition of the parameters $\theta_{\text{A} \rightarrow \text{B} \text{C}} = \log[\mbox{g}[\text{A} \rightarrow \text{B} \text{C}]]$.

Now we will put this all together and write out the inside-outside algorithm we obtained with this mechanistic procedure of differentiating the inside-algorithm. As opposed to a single chart, we will keep track of the intermediate values with three arrays:

`in_weights`

: Inside-weights $\alpha[\text{A}_i^k]$ (i.e., the result of the inside algorithm).`out_weights`

: Outside weights of anchored non-terminals $\beta[\text{A}_i^k] = \nabla \alpha[\text{A}_i^k]$.`out_rules`

: Outside weights of rules $\beta[\text{A} \rightarrow \text{B} \text{C}] = \nabla \mbox{g}[\text{A} \rightarrow \text{B} \text{C}]$.

The inside-outside algorithm is now:

1 # Run inside first

2 in_weights, Z := INSIDE(w, G)

3 out_weights[1 ... n, 1 ...n, 1...|V|] := 0

4 out_rules[|V|] := 0

5 # Partial derivative of Z

6 out_weights[0, n, S] += 1

7

8 # Top-down sweep

9 for l := n downto 2 # sub-string length

10 for i := 0 to n-l # start point

11 k = i + l # end-point

12 for j := i+1 to k-1 # mid-point

13 for each binary rule A -> B C

14 out_rules[A -> BC] += out_weights[i, k, A] x

in_weights[i, j, B] x

in_weights[j, k, C]

15 out_weights[i, j, B] += out_weights[i, k, A] x

g[A -> BC] x

in_weights[j, k, C]

16 out_weights[j, k, C] += out_weights[i, k, A] x

g[A -> BC] x

in_weights[i, j, B]

17 # Width 1 constituents

18 for k=1 to n:

19 for each unary rule A -> w_k

20 out_rules[A -> w_k] += out_weights[k-1, k, A]

21 # Expected counts

22 for R in rules:

23 counts[R] = 1/Z x out_rules[R] x g[R]

24 return counts

It's easy to see from the nested `for loop`

structure that the inside-outside algorithm together has the same complexity as the inside-algorithm alone.

Up to this point we have discussed quite mechanistically *how* to compute the outside-weights. Now let us finish by discussing exactly *what* they mean.

As we saw in Part II, the inside weights correspond to summing over the weights of all possible *inside-trees* (figure 1a):

\begin{equation}\label{eq:inside_update_2}

\alpha[A_i^k] = \sum_{B, C}\sum_{j=i+1}^{k} \mbox{g}[A \rightarrow B C] \times \alpha[B_i^j] \times \alpha[C_j^k]. \tag{35}

\end{equation}

The inside-weight $\alpha[A_i^k]$ corresponds to the sum of all the left and right child inside weights, considering all possible split points $j$ and all possible non-terminals B and C (figure 1).

From equations 31 and 32, we see that the equivalent recursive definition for the outside weights is:

\begin{eqnarray}

\beta[\text{B}_i^j] = & \sum_{\text{A}, \text{C}} \sum_{k=j+1}^{n} \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \alpha[\text{C}_{j+1}^{k}] \times \beta[\text{A}_i^k]\nonumber \\

&+\sum_{\text{A}, \text{C}} \sum_{k=0}^{i-1} \mbox{g}[\text{A} \rightarrow \text{C} \text{B}] \times \alpha[\text{C}_{k}^{i-1}] \times \beta[\text{A}_{k-1}^j] \tag{36}

\end{eqnarray}

The outside-trees of $\text{A}_i^j$ are are the set of trees that are rooted in $S$ and yield $w_1, \ldots, w_{i-1}, \text{A}, w_{j+1} \ldots, w_n$, where $n$ is the sentence length. These are the trees that originate from $S$ and occur around $\text{A}_i^j$ (figure 2). The recursive update computes the outer weight in terms of the parent outer weight and the sibling inner weight. It sums over all possible parents, siblings, the order of the siblings (there is one term for each in the equation) and the split point $k$.

We have seen that the inside weights sum over all inside-trees and the outside weights sum over all outside trees. It follows that total weight of all the parses that contain the anchored non-terminal $\text{A}_i^k$ is simply $\alpha[\text{A}_i^k] \cdot \beta[\text{A}_i^k]$. If we normalize by the partion function, $Z$ we retrieve the marginal probability of the anchored non-terminal $\text{A}_i^k$ which is given by $(\alpha[\text{A}_i^k] \cdot \beta[\text{A}_i^k)]/Z$.

By combining equations 30 and 34, we see how the inside- and outside-weights fit together to compute the expected counts:

\begin{equation}

\mathbb{E}_{T}\left[\text{A} \rightarrow \text{B} \text{C}\right]= \frac{1}{Z} \times \sum_{1 \leq i \leq j \leq k \leq n} \mbox{g}[\text{A} \rightarrow \text{B} \text{C}] \times \beta[\text{A}_i^k] \times \alpha[\text{B}_i^j] \times \alpha[\text{C}_j^k] \tag{37}

\end{equation}

This equation enumerates all possible spans, where A could be re-written to B and C (inside), but also considers every configuration that could happen around A.

In this blog, we introduced probabilistic context-free grammars and we discussed both supervised and unsupervised methods for learning their parameters. In the former case, we learn the rule weights from pairs of sentences and parse trees and there is a simple intuitive closed-form solution.

Then we considered the unsupervised case, where we only have access to the sentences. The first strategy we described is Viterbi training, which is fairly simple: we already know how to do weighted parsing to find the best trees *for known parameters*, and we also know how to maximize the likelihood of the parameters *if we are given the trees*. The Viterbi EM initializes the parameters randomly and then parses the entire corpus. Taking these parses as pseudo-labels, it re-estimates the parameters to maximize their likelihood. Then we iterate until convergence.

Viterbi training is sometimes called *Viterbi EM*, *hard EM* or *self-EM*. Variants include unsupervised parsing of different formalisms and methods to improve over vanilla supervised training. The same ideas have been adapted for image classification and segmentation.

We also discussed the soft-EM approach to unsupervised learning of rule weights in PCFGs. Rather than inferring the trees with the highest weight we computed the expected number of times each rule would appear in the parses. Then these expected counts were used to find the maximum likelihood estimates for the parameters. To compute these expected counts, we used the inside-outside algorithm, which can be summarized by the code snippet `Z = inside(w, `

$\theta$`); log(Z).backward()`

. Finally, we made a big leap and discussed how to mechanistically derive the outside-algorithm from the inside-pass to obtain the expected rule counts.

1 The Penn Tree Bank is not publicly available, but a free alternative is the QuestionBank which contains English questions annotated with trees.

]]>In this blog, we will introduce *weighted context-free grammars* or *WCFGs*. These assign a non-negative weight to each rule in the grammar. From here, we can assign a weight to any parse tree by multiplying the weights of its component rules together. We present two variations of the CYK algorithm that apply to WCFGs. (i) The *inside algorithm* computes the sum of the weights of all possible analyses (parse trees) for a sentence. (ii) The *weighted parsing* algorithm find the parse tree with the highest weight.

In Part III of this tutorial, we introduce probabilistic context-free grammars. These are a special case of WCFGs where the weights of all rules with the same left-hand side sum to one. We then discuss how to learn these weights from a corpus of text. We will see that the inside algorithm is a critical part of this process.

Before we start our discussion, let's briefly review what we learned about context-free grammars and the CYK recognition algorithm in part I of this tutorial. Recall that we defined a context-free grammar as the tuple $\langle S, \mathcal{V}, \Sigma, \mathcal{R}\rangle$ with a start symbol $S$, non-terminals $\mathcal{V}$, terminals $\Sigma$ and finally the rules $\mathcal{R}$.

In our examples, the non-terminals are a set $\mathcal{V}=\{\mbox{VP, PP, NP, DT, NN, }\ldots\}$ containing sub-clauses (e.g., verb-phrase $\mbox{VP}$ ) and parts of speech (e.g., noun $\mbox{NN}$). The terminals contain the words. We will consider grammars in Chomsky Normal Form, where the rules either map one non-terminal to two other non terminals (e.g., $\text{VP} \rightarrow \text{V} \; \text{NP})$ or a single terminal symbol (e.g., $\text{V}$-> eats).

The CYK recognition algorithm takes a sentence and a grammar in Chomsky Normal Form and determines if the sentence is valid under the grammar. With minor changes, it can also return the set of valid parse trees. It constructs a *chart* where each position in the chart corresponds to a sub-sequence of words (figure 2). At each position, there is a binary array with one entry per rule, where this entry is set to true if this rule can be applied validly to the associated sub-sequence.

The CYK algorithm works by first finding valid unary rules that map pre-terminals representing parts of speech to terminals representing words (e.g., DT$\rightarrow$ the). Then it considers sub-sequences of increasing length and identifies applicable binary non-terminal rules (e.g., $\mbox{NP}\rightarrow \mbox{DT NN})$. The rule is applicable if there are two sub-trees lower down in the chart whose roots match its right hand side. If the algorithm can place the start symbol in the top-left of the chart, then the overall sentence is valid. The pseudo-code is given by:

`0 # Initialize data structure`

1 chart[1...n, 1...n, 1...V] := FALSE

2

3 # Use unary rules to find possible parts of speech at pre-terminals

4 for p := 1 to n # start position

5 for each unary rule A -> w_p

6 chart[1, p, A] := TRUE

7

8 # Main parsing loop

9 for l := 2 to n # sub-sequence length

10 for p := 1 to n-l+1 # start position

11 for s := 1 to l-1 # split width

12 for each binary rule A -> B C

13 chart[l, p, A] = chart[l, p, A] OR

(chart[s, p, B] AND chart[l-s,p+s, C])

14

15 return chart[n, 1, S]

For a much more detailed discussion of this algorithm, consult Part I of this blog.

*Weighted context-free grammars* (WCFGs) are context-free grammars which have a non-negative weight associated with each rule. More precisely, we add the function $g: \mathcal{R} \mapsto \mathbb{R}_{\geq 0}$ that maps each rule to a non-negative number. The weight of a full derivation tree $T$ is then the product of the weights of each rule $T_t$:

\begin{equation}\label{eq:weighted_tree_from_rules}

\mbox{G}[T] = \prod_{t \in T} g[T_t]. \tag{1}

\end{equation}

Context-free grammars generate strings, whereas weighted context free grammars generate strings with an associated weight.

We will interpret the weight $g[T_t]$ as the degree to which we favor a rule, and so, we "prefer" parse trees $T$ with higher overall weights $\mbox{G}[T]$. Ultimately, we will learn these weights in such a way that real observed sentences have high weights and ungrammatical sentences have lower weights. From this viewpoint, the weights can be viewed as *parameters* of the model.

Since the tree weights $G[T]$ are non-negative, they can be interpreted as un-normalized probabilities. To create a valid probability distribution over possible parse trees, we must normalize by the total weight $Z$ of all tree derivations:

\begin{eqnarray}

Z &=& \sum_{T \in \mathcal{T}[\mathbf{w}]} \mbox{G}[T] \nonumber \\

&=& \sum_{T \in \mathcal{T}[\mathbf{w}]} \prod_{t \in T} \mbox{g}[T_t], \tag{2}

\end{eqnarray}

where $\mathcal{T}[\mathbf{w}]$ represents the set of all possible parse trees from which the observed words $\mathbf{w}=[x_{1},x_{2},\ldots x_{L}]$ can be derived. We'll refer to the normalizing constant $Z$ as the *partition function*. The conditional distribution of a possible derivation $T$ given the observed words $\mathbf{w}$ is then:

\begin{equation}

Pr(T|\mathbf{w}) = \frac{\mbox{G}[T]}{Z}. \tag{3}

\end{equation}

We defined the partition function $Z$ as the sum of the weights of all the trees $\mathcal{T}[\mathbf{w}]$ from which the observed words $\mathbf{w}$ can be derived. However, in Part I of this tutorial we saw that the number of possible binary parse trees increases very rapidly with the sentence length.

The CYK recognition algorithm used dynamic programming to search this huge space of possible trees in polynomial time and determine whether there is at least one valid tree. To compute the partition function, we will use a similar trick to search through all possible trees and sum their weights simultaneously. This is known as the *inside algorithm*.

Before we present the inside algorithm, we need to introduce the *semiring*. This abstract algebraic structure will help us adapt the CYK algorithm to compute different quantities. A semiring is a set $\mathbb{A}$ on which we have defined two binary operators:

1. $\oplus$ is a commutative operation with identity element 0, which behaves like the addition $+$:

- $x \oplus y = y \oplus x$
- $(x \oplus y) \oplus z = x \oplus (y \oplus z)$
- $x \oplus 0 = 0 \oplus x = x$

2. $\otimes$ is an associative operation that (right) distributes over $\oplus$ just like multiplication $\times$. It has the identity element 1 and absorbing element 0:

- $(x \otimes y) \otimes z = x \otimes (y \otimes z)$
- $x \otimes (y \oplus z) = (x \otimes y) + (x \otimes z)$
- $x \otimes 1 = 1 \otimes x = x$
- $x \otimes 0 = 0 \otimes x = 0$

Similarly to grammars we will just denote semirings as tuples: $\langle\mathbb{A}, \oplus, \otimes, 0, 1\rangle$. You can think of the semiring as generalizing the notions of addition and multiplication.^{1}

Computing the partition function $Z$ for the conditional distribution $Pr(T|\mathbf{w})$ might appear difficult, because it sums over the large space of possible derivations for the sentence $\mathbf{w}$. However, we've already seen how the CYK recognition algorithm accepts or rejects a sentence in polynomial time, while sweeping though all possible derivations. The inside algorithm uses a variation of the same trick to compute the partition function.

When used for recognition, the $\texttt{chart}$ holds values of $\texttt{TRUE}$ and $\texttt{FALSE}$ and the computation was based on two logical operators OR and AND, and we can think of these as being part of the semiring $\langle\{\texttt{TRUE}, \texttt{FALSE}\}, OR, AND, \texttt{FALSE}, \texttt{TRUE}\rangle$.

The inside algorithm replaces this semiring with the sum-product semiring $\langle\mathbb{R}_{\geq 0} \cup \{+\infty\} , +, \times, 0, 1\rangle$ to get the following procedure:

`0 # Initialize data structure`

1 chart[1...n, 1...n, 1...|V|] := 0

2

3 # Use unary rules to find possible parts of speech at pre-terminals

4 for p := 1 to n # start position

5 for each unary rule A -> w_p

6 chart[1, p, A] := g[A-> w_p]

7

8 # Main parsing loop

9 for l := 2 to n # sub-sequence length

10 for p := 1 to n-l+1 # start position

11 for s := 1 to l-1 # split width

12 for each binary rule A -> B C

13 chart[l, p, A] = chart[l, p, A] +

(g[A -> B C] x chart[s, p, B] x chart[l-s,p+s, C] )

14

15 return chart[n, 1, S]

where we have highlighted the differences from the recognition algorithm in green.

As in the CYK recognition algorithm, each position $(p,l)$ in the $\texttt{chart}$ represents the sub-sequence that starts at position $p$ and is of length $l$ (figure 2). In the inside algorithm, every position in the chart holds a length $|V|$ vector where the $v^{th}$ entry corresponds to the $v^{th}$ non-terminal. The value held in this vector is the sum of the weights of all sub-trees for which the $v^{th}$ non-terminal is the root.

The intuition for the update rule in line 13 is simple. The additional weight for adding rule $A\rightarrow BC$ into the chart is the weight $g[A\rightarrow BC]$ for this rule times the sum of weights of all possible left sub-trees rooted in B times the sum of weights of all possible right sub-trees rooted in C. As before, there may be multiple possible rules that place non-terminal $A$ in a position corresponding to different splits of the sub-sequence and here we perform this computation for each rule and sum the results together.

In figures 3 and 4 we show a worked example of the inside algorithm for the same sentence as we used for the CYK recognition algorithm. Figure 3a corresponds to lines 4-6 of the algorithm where we are initializing the first row of the chart based on the unary rule weights. Figure 3b corresponds to the main loop in lines 9-13 for sub-sequence length $l=2$. Here we assign binary non-terminal rules and compute their weights as (cost of rule $\times$ weight of left branch $\times$ weight of right branch).

Figure 4a corresponds to the main loop in lines 9-13 for sub-sequence length $l=5$. At position (5,2), there are two possible rules that apply, both of which result in the same non-terminal. We calculate the weights for each rule as before, and add the results so that the final weight at this position sums over all sub-trees. Figure 4b shows the final result of the algorithm. The weight associated with the start symbol $S$ at position (6,1) is the partition function.

Our discussion so far does not make it clear why the method for computing the partition function is known as the inside algorithm. This is because the $\texttt{chart}$ holds the *inside-weights* for each *anchored non-terminal*. By "anchored" we mean a non-terminal $A_i^k$ pronounced "Aye from eye to Kay" is anchored to a *span* in the sentence (i.e, a sub-string). It yields the string $A_i^k \Rightarrow w_i, \ldots, w_k$.

An anchored rule then has the form $A_i^k \rightarrow B_i^j C_j^k$. With this notation in our hand we can provide the recursive definition to the inside weight of anchored non-terminals:

\begin{equation}\label{eq:inside_update}

\alpha[A_i^k] = \sum_{B, C}\sum_{j=i+1}^k \mbox{g}[A \rightarrow B C] \times \alpha[B_i^j] \times \alpha[C_j^k]. \tag{4}

\end{equation}

The inside-weight $\alpha[A_i^k]$ corresponds to the sum of all the left and right sub-trees considering all possible split points $j$ and all possible non-terminals B and C (figure 5).

In the previous section, we saw that we could transform the CYK recognition algorithm into the inside algorithm, by just changing the underlying semiring. With his small adjustment, we showed that we can compute the partition function (sum of weights of all tree derivations) in polynomial time. In this section, we apply a similar trick to *weighted parsing*.

Recall that the partition function $Z$ was defined as the sum of all possible derivations:

\begin{eqnarray}

Z &=& \sum_{T \in \mathcal{T}[\mathbf{w}]} \mbox{G}[T] \nonumber \\

&=& \sum_{T \in \mathcal{T}[\mathbf{w}]} \prod_{t \in T} \mbox{g}[T_t], \tag{5}

\end{eqnarray}

In contrast, weighted parsing aims to find the derivation $T^{*}$ with the highest weight among all possible derivations:

\begin{eqnarray}

T^{*} &=& \underset{T \in \mathcal{T}[\mathbf{w}]}{\text{arg} \, \text{max}} \; \left[\mbox{G}[T]\right] \nonumber \\

&=& \underset{T \in \mathcal{T}[\mathbf{w}]}{\text{arg} \, \text{max}} \left[\prod_{t \in T} \mbox{g}[T_t]\right], \tag{6}

\end{eqnarray}

where $\mbox{G}[T]$ is the weight of a derivation tree which is computed by taking the product of the weights $\mbox{g}[T_t]$ of the rules.

Once again we will modify the semiring in the CYK algorithm to perform the task. Let us replace the sum-product semiring $\langle\mathbb{R}_{\geq 0} \cup \{+\infty\} , +, \times, 0, 1\rangle$ with the max-product semiring $<\mathbb{R}_{\geq 0} \cup \{+\infty\} , \max[\bullet], \times, 0, 1>$ to find the score of the "best" derivation. This gives us the following algorithm:

`0 # Initialize data structure`

1 chart[1...n, 1...n, 1...|V|] := 0

2

3 # Use unary rules to find possible parts of speech at pre-terminals

4 for p := 1 to n # start position

5 for each unary rule A -> w_p

6 chart[1, p, A] := g[A -> w_p]

`7`

8 # Main parsing loop

9 for l := 2 to n # sub-sequence length

10 for p := 1 to n-l+1 # start position

11 for s := 1 to l-1 # split width

12 for each binary rule A -> B C

13 chart[l, p, A] = max[chart[l, p, A],

(g[A -> B C] x chart[s, p, B] x chart[l-s,p+s, C]]

14

15 return chart[n, 1, S]

The differences from the CYK recognition algorithm are colored in green, and the single difference from both the inside algorithm and the CYK recognition algorithm is colored in orange.

Once more, each position $(p,l)$ in the $\texttt{chart}$ represents the sub-sequence that starts at position $p$ and is of length $l$. In the inside algorithm, each position contained a vector with one entry for each of the $|V|$ rules. Each element of this vector contained the sum of the weights of all of the sub-trees which feed into this anchored non-terminal. In this variation, each element contains the maximum weight among all the sub-trees that feed into this anchored non-terminal. Position (n,1) represents the whole string, and so the value $\texttt{chart[n, 1, S]}$ is the maximum weight among all valid parse trees. If this is zero, then there is no valid derivation.

The update rule at line 13 for the weight at $\texttt{chart[l, p, A]}$ now has the following interpretation. For each rule $\texttt{A -> B C}$ and for each possible split $\texttt{s}$ of the data, we multiply the the rule weight $\texttt{g[A -> B C]}$ by the two weights $\texttt{chart[s, p, B]}$ and $\texttt{chart[l-s, p+s, B]}$ associated with the two child sub-sequences. If the result is larger than the current highest value, then we update it. If we are interested in the parse tree itself, then we can store back-pointers indicating which split yielded the maximum value at each position, and traverse backwards to retrieve the best tree.

In figure 6 we illustrate worked example of weighted parsing. The algorithm starts by assigning weights to pre-terminals exactly as in figure 3a. The computation of the weights for sub-sequences of length $l=2$ is also exactly as in figure 3b, and the algorithm also proceeds identically for $l=3$ and $l=4$.

The sole difference occurs for the sub-sequence of length $l=5$ at position $p=2$ (figure 6). There are two possible rules that both assign the non-terminal VP to the chart at this position. In the inside algorithm, we calculated the weights of these rules and summed them. In weighted parsing, we store the largest of these weights, and this operation corresponds to the $\mbox{max}[\bullet,\bullet]$ function on line 13 of the algorithm.

At the end of the procedure, the weight associated with the start symbol at position (6,1) corresponds to the tree with the maximum weight and so is considered the "best". By keeping track of which sub-tree yielded the maximum weight at each split, we can retrieve this tree which corresponds to our best guess at parsing the sentence.

We've seen that we can add weights to CFGs and replace the $AND, OR$ semiring with $+, \times$ to find the total weight of all possible derivations (i.e. compute the partition function with the inside algorithm). Further more, but we can use $\max, \times$ instead to find the parse tree with the highest weight.

The semirings allow us to unify the CYK recognition, inside, and weighted parsing algorithms by recursively defining the chart entries as:

\begin{equation}

\texttt{chart}[A_i^k] = \bigoplus_{B, C, j} \mbox{g}[A \rightarrow B C] \otimes \texttt{chart}[B_i^j] \otimes \texttt{chart}[C_i^k], \tag{7}

\end{equation}

where for recognition $\mbox{g}[A \rightarrow B C]$ just returns $\texttt{TRUE}$ for all existing rules.

Readers familiar with graphical models, will no doubt have noticed the similarity between these methods and sum-product and max-product belief propagation. Indeed, we could alternatively have presented this entire argument in terms of graphical models, but the semiring formulation is more concise.

In the final part of this blog, we will consider probabilistic context-free grammars, which are a special case of weighted-context free grammars. We'll develop algorithms to learn the weights from (i) a corpus of sentences with known parse trees and (ii) just the sentences. The latter case will lead to a discussion of the famous *inside-outside algorithm*.

^{1. }If you are wondering why is it "semi", its because the magnificent *rings* also have an additive inverse for each element: $x \oplus (-x) = 0$.

In the first section, we'll discuss position embeddings. The transformer operates on unordered sets of embeddings, but often we are processing ordered sequences (e.g., words in NLP). We will describe the ways that the architecture has been adapted to take into account the position of each element in the sequence. In the second section, we'll discuss efficiency. The attention computation grows quadratically with the sequence length and in practice this limits the maximum length we can use. We'll describe work that allows the transformer to work efficiently with longer sequences. We will conclude by describing how the self-attention mechanism relates to other models, including RNNs, graph neural networks, capsule networks, Hopfield networks, CNNs, gating networks, and hypernetworks.

In part I, we discussed how the core component of the transformer is dot-product self attention $\bf Sa[\mathbf{X}]$. In this section, we'll provide a brief review of this mechanism. Self-attention takes a set of vectors $\{\mathbf{x}_{i}\}_{i=1}^{I}$ (which form the $I$ rows of $\mathbf{X}$) and modifies them based on the degree to which they attend to each other:

\begin{equation}

{\bf Sa}[\mathbf{X}] =\bf Softmax\left[\frac{(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{v}, \tag{1}

\end{equation}

where the function $\bf Softmax[\bullet]$ performs a separate softmax operation on each row of the input. The terms $\boldsymbol\Phi_{q}, \boldsymbol\Phi_{k}$ and $\boldsymbol\Phi_{v}$ are known as the query matrices, key matrices and value matrices respectively, and when applied to the data they form the queries $\mathbf{X}\boldsymbol\Phi_{q}$, keys $\mathbf{X}\boldsymbol\Phi_{k}$, and values $\mathbf{X}\boldsymbol\Phi_{v}$.

In simple terms, for each input $\mathbf{x}_{i}$ the self attention mechanism returns a weighted sum of the values for every input $\mathbf{x}_{j}$, where the weight depends on the dot product similarity between the query for $\mathbf{x}_{i}$ and the key for $\mathbf{x}_{j}$. These similarities are normalized by the softmax function so that they are positive and sum to one and after normalization are referred to as attention. The term $\bf Softmax\left[(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}/\sqrt{d_{q}}\right]$ is of size $I\times I$ and is known as the attention matrix.

The self-attention mechanism is *equivariant* to permutations of the input. In other words, if we apply a permutation matrix $\mathbf{P}$ to the rows of the matrix $\mathbf{X}$, the output will also be permuted, but will otherwise stay the same:

\begin{eqnarray}

{\bf Sa}[\mathbf{P}\mathbf{X}] &=&\bf Softmax\left[\frac{(\mathbf{P}\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{P}\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{P}\mathbf{X}\boldsymbol\Phi_{v}\nonumber\\

&=&\mathbf{P}\cdot \bf Softmax\left[\frac{(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{P}^{T}\mathbf{P}\mathbf{X}\boldsymbol\Phi_{v}\nonumber \\

&=&\mathbf{P}\cdot\bf Softmax\left[\frac{(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{v}\nonumber \\

&=&\mathbf{P}\cdot {\bf Sa}[\mathbf{X}] . \tag{2}

\end{eqnarray}

This is not desirable when the vectors $\mathbf{x}_{i}$ represents words in a sentence as the order of the inputs is important; the sentences The man ate the fish and The fish ate the man have different meanings and we hope that any neural processing will take this into account.

Before discussing how to encode positional information, it is worth thinking about what properties we would like this encoding to have. First, we need to know the *relative position* of two words rather than their absolute position. Transformers are trained with spans of text that may contain multiple sentences, and the start of the span may be mid-way through the sentence. Consequently, the absolute position does not contain much useful information.

Second, word embeddings that are far from one another in the sequence might be expected to interact with one another less than those that are closer. For example, when we disambiguate a pronoun (e.g., understanding who he is in a sentence like He ate the sandwich), it's likely that the answer is close at hand, not several thousand words away. Finally, we might expect that we need the relative position with less and less accuracy as the distance between tokens increases. For small distances, the relative word position directly affects the meaning of the sentence, but for larger distances the words are probably in different sentences and the exact distance between them matters much less.

In the original transformer paper, position was encoded by adding a pre-determined matrix $\boldsymbol\Pi$ to the input embedding matrix $\mathbf{X}$ where the position embeddings are pre-defined as:

\begin{eqnarray}

\Pi_{i, 2f} &=& \sin[\omega_f i] \nonumber\\

\Pi_{i, 2f+1} &=& \cos[\omega_f i] \tag{3}

\end{eqnarray}

where $i$ indexes the position in the sequence and $f$ indexes pairs of adjacent embedding dimensions. The angular frequencies $\omega_f$ of adjacent dimensions $d = 2f$ and $d+1 = 2f+1$ are the same and take the value $\omega_f = 10000^{-2f/D}$ (figure 1).

One way to think about adding the matrix $\boldsymbol\Pi$ is that we are adding a different vector to the embedding $\mathbf{x}_{i}$ where this vector encodes the absolute position $i$. So if the same word occurs at different positions in the sequence, it would have two different embeddings. For this reason, this sinusoidal encoding is considered an *absolute position embedding*.

This scheme is worth examining closely. In the self-attention mechanism we apply linear transformations $\boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ to $\mathbf{X}+\boldsymbol\Pi$ and then compute dot products between every pair of columns in the resulting matrices. We'll now consider several interesting properties that emerge we apply linear transformations to this sinusoidal embedding and take dot products.

**Separating position and word embeddings:** At first sight, *adding* the position embeddings to the data seems a bad idea; we probably need both the word embedding and the position embedding without having them hopelessly entangled. However, this is not necessarily a problem. Since the embedding dimension $D$ is usually greater than the maximum sequence length $I$ (e.g., BERT used D=1024, I=512), it is possible for the system to learn word embeddings that lie outside the subspace of the position embeddings. If this were the case, the system could recover the word embeddings by learning linear transformations $\boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ where the null-space spans the position embeddings. Similarly, the system could recover the position embeddings.

**Down-weighting distant elements:** The dot product between the position encodings $\boldsymbol\pi_{i}$ and $\boldsymbol\pi_{j}$ at different positions $i$ and $j$ (i.e. rows of $\boldsymbol\Pi)$ gets smaller as the relative position $|i-j|$ increase (figure 2). So if the system were to retrieve the position embeddings using a linear transform as described above, it could create an attention matrix that increasingly down-weights attention between elements as they become more distant when it computes the dot products.

**Relative vs. absolute positions:** We have added a unique embedding $\boldsymbol\pi_{i}$ at each absolute position $i$. However, it's possible to transform the embedding $\boldsymbol\pi_{i}$ at position $i$ to that at relative position $i+j$ using a linear operation. To see this, consider the embeddings $\left(\sin[\omega_{f}i]\;\;\cos[\omega_{f} i]\right)^{T}$ at word position $i$ and two adjacent dimensions $d$ and $d+1$ of the embedding. Applying the following linear transform we get:

\begin{eqnarray}

\begin{pmatrix}\cos[\omega_{f} j]&\sin[\omega_{f} j]\\-\sin[\omega_{f} j]&\cos[\omega_{f} j]\end{pmatrix}

\begin{pmatrix}\sin[\omega_{f} i]\\\cos[\omega_{f} i]\end{pmatrix} &=&\begin{pmatrix}

\cos[\omega_{f} j]\sin[\omega_{f} i]+ \sin[\omega_{f} j]\cos[\omega_{f} i]\\

-\sin[\omega_{f} j]\sin[\omega_{f} i]+\cos[\omega_{f} j]\cos[\omega_{f} i]\end{pmatrix}\nonumber \\ &=&

\begin{pmatrix}\sin[\omega_{f} (i+j)]\\\cos[\omega_{f} (i+j)]\end{pmatrix} \tag{4}

\end{eqnarray}

where we have used the trigonometric addition identities. So by applying the appropriate linear transformation, the system can transform the position encoding at position $i$ to that at position $i+j$. If it did this for just the queries, then the dot products between position vectors would take a maximum value at a relative offset of $j$ rather than 0.

Note that all of the above is supposition; the trained network does not necessarily do any of these things. The point is that these capabilities are available to it if it chooses to use them.

We've seen that it's possible to use sinusoidal embeddings for which the linear projections and dot-products have useful properties. An obvious next step is to learn the position embedding matrix $\boldsymbol\Pi$ during training. This approach was also tried in the original transformer paper and adopted by subsequent encoder models like BERT and GPT-2.

The advantage of learning the position embeddings is that we can potentially capture more complex properties. The disadvantage is that it adds a lot of extra parameters to the model, and once learned, the model cannot be extended to longer sequence lengths.

It's interesting however, to test if the learned position embeddings capture the desirable properties of the sinusoidal embeddings. Wang and Chen (2020) compared the cosine similarities (closely related to dot products) between embeddings at different relative distances (figure 3). For GPT-2 the similarity of the embeddings decreases as a function of distance for small distances with a periodic component at larger distances. For BERT, the results are more noisy and complicated.

They also examined if it is possible to predict the absolute positions by applying linear regression to the learned embedding. For the BERT embeddings, the error in these predictions is large, for the GPT-2 embeddings very small, and for the sinusoidal embeddings zero. The same experiment can be done by regressing pairs of position embeddings to predict relative position. Here, the error is again greatest for the BERT embeddings, but this time, the GPT-2 embeddings outperform the pre-defined sinusoidal embeddings.

Adding position embeddings modifies the self-attention calculation to:

\begin{equation}

\bf Sa [\mathbf{X}] = \bf Softmax\left[\frac{((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{q})((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right](\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{v}. \tag{5}

\end{equation}

The position matrix modifies both the attention matrix (the softmax term) and the computation of the values. There have been a number of studies in which the latter modification is dropped so that just the attention matrix is changed:

\begin{equation}

\bf Sa [\mathbf{X}] = \bf Softmax\left[\frac{((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{q})((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{v}. \tag{6}

\end{equation}

In these circumstances, the position information is usually added at every layer as it is only represented very implicitly in the output of the computation.

Let's consider the un-normalized and pre-softmax attention matrix:

\begin{equation}

\tilde{\mathbf{A}} = ((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{q})((\mathbf{X}+\boldsymbol\Pi)\boldsymbol\Phi_{k})^{T}, \tag{7}

\end{equation}

which has elements:

\begin{eqnarray}

\tilde{a}_{i,j} &=& ((\mathbf{x}_{i}+\boldsymbol\pi_{i})\boldsymbol\Phi_{q})((\mathbf{x}_{j}+\boldsymbol\pi_{j})\boldsymbol\Phi_{k})^{T}\nonumber \\

&=& \underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_\text{content-content}+\underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\boldsymbol\pi_{j}^{T}}_{\text{content-position}}+\underbrace{\boldsymbol\pi_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_{\text{position-content}}+\underbrace{\boldsymbol\pi_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\boldsymbol\pi_{j^{T}}}_{\text{position-position}},\label{eq:attention_breakdown} \tag{8}

\end{eqnarray}

where we can see that each element has four contributions in which the position embedding $\boldsymbol\pi$ and the content vector $\mathbf{x}$ interact differently. This expression has been modified in various ways

**Untied embeddings:** One simple modification is to decouple or *untie* the content and position components rather than add them together before projection. A simple way to do this is to remove the terms where they interact and to use a separate linear transform for each to give:

\begin{equation}

\tilde{a}_{i,j} = \underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_\text{content-content}+\underbrace{\boldsymbol\pi_{i}\boldsymbol\Psi_q\boldsymbol\Psi_{k}^{T}\boldsymbol\pi_{j}^{T}}_{\text{position-position}}. \tag{9}

\end{equation}

**Relative embeddings:** Another modification is to directly inject information about the relative position. For example, Shaw et al. (2018) add a term $\boldsymbol\pi_{|i-j|}$ which depends on the position difference.

\begin{equation}\label{eq:rel_pos_shaw}

\tilde{a}_{i,j} = \underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_\text{content-content}+\underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\pi_{i-j}^{T}}_{\text{content-position}}. \tag{10}

\end{equation}

where a different position vector $\boldsymbol\pi_{i-j}$ is learned for each signed position offset $i-j$ where this offset is usually clipped so after a certain distance, all terms are the same. Note that this position vector is defined directly in the space of the keys rather than projected into it^{1}.

Raffel et al. (2019) simplified this further by simply adding a learnable scalar $\pi_{|i-j|}$ to the attention matrix

\begin{equation}

\tilde{a}_{i,j} = \underbrace{\left(\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}\right)}_\text{content-content} + \pi_{i-j}. \tag{11}

\end{equation}

where $\pi_{i-j}$ is a different scalar for each signed offset $i-j$. Relative position information has also been combined directly in other ways various other ways such as simply multiplying the attentions by a modifying factor $\pi_{|i-j|}$:

\begin{equation}

\tilde{a}_{i,j} = \underbrace{\left(\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}\right)}_\text{content-content}\cdot \pi_{|i-j|}. \tag{12}

\end{equation}

where $\pi_{i-j}$ is a different scalar for each absolute offset $|i-j|$.

Finally, we note that pre-defined sinusoidal embeddings have also been used in a system based on equation 10 (where $\boldsymbol\pi_{ij}$ now contains sinusoidal terms in relative position $i-j$) and also in more complex ways.

**Combining ideas:** Many schemes combine have proposed position embeddings that combine the ideas of (i) only retaining certain terms from equation 8, (ii) using different projection matrices for the content and position embeddings, and (iii) using relative embeddings. For example, in DeBERTa they use:

\begin{equation}

\tilde{a}_{i,j} =

\underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_\text{content-content}+

\underbrace{\mathbf{x}_{i}\boldsymbol\Phi_q\boldsymbol\Psi_{k}^{T}\boldsymbol\pi_{i-j}^{T}}_{\text{content-position}}+

\underbrace{\boldsymbol\pi_{j-i}\boldsymbol\Psi_q\boldsymbol\Phi_{k}^{T}\mathbf{x}_{j}^{T}}_{\text{position-content}}. \tag{13}

\end{equation}

where they drop the position-position term and have a different relative embedding $\boldsymbol\pi_{i-j}$ for each signed offset $i-j$ between the positions.

In this section we have provided a brief overview of how position information is added into transformers. At the time of writing, it is not clear which of these position embedding schemes is empirically superior. For downstream tasks on BERT, relative position embeddings generally perform better than absolute position embeddings, but there does not seem to be much difference between sinusoidal embeddings and learned embeddings. To learn more about position embeddings, consult this survey paper.

In the second part of this blog, we consider modifications to the self-attention mechanism that make it more efficient as the sequence length increases. The self-attention mechanism takes $I$ inputs $\mathbf{x}_{i}$ and returns $I$ modified outputs. In this process, each input $\mathbf{x}_{i}$ interacts with one another; each output is a weighted sum of the values corresponding to every input, where the weights depend on how much the input attends to every other input. As such, the transformer naturally has quadratic complexity in the size $I$ of the input sequence.

However, there are some situations in which we might expect this input set to be extremely large. In NLP, we may wish to summarize long documents or answer questions about a body of documents. In other modalities like vision or audio processing, the data can also be of extremely high dimension. In these circumstances, the quadratic complexity of the attention mechanism can become the limiting factor and a sub-field has emerged that tries to address this bottleneck.

In this section, we review three lines of work. First, we discuss methods that aim to reduce the size of the attention matrix. Second, we review approaches that introduce sparsity into the attention matrix. Finally, we present methods that treat the self-attention computation as a kernel function and try to approximate this to create algorithms with linear complexity in the sequence length.

One simple idea to make self-attention more efficient is to reduce the size of the attention matrix. In *memory compressed attention*, a strided convolution is applied to the keys and values so the self-attention operation becomes:

\begin{equation}

\bf Sa[\mathbf{X}] = \bf Softmax\left[\mathbf{X}\boldsymbol\Phi_{q}(\boldsymbol\theta_{k}\circledast\mathbf{X}\boldsymbol\Phi_{k})^{T} \right](\boldsymbol\theta_{v}\circledast\mathbf{X}\boldsymbol\Phi_{v}), \tag{14}

\end{equation}

where $\boldsymbol\theta_{k}$ and $\boldsymbol\theta_{v}$ are the convolution kernels. If the stride $s$ is the same as the kernel size, then the effect is to take a learned weighted average of nearby key/value vectors and the resulting attention matrix reduces to size $I\times I/s$ (figure 5).

The Linformer applies a very similar trick that is motivated by the observation that the self-attention mechanism is often low-rank in practice. Consequently, we can reduce the complexity of the calculation by projecting the keys and value into a learned subspace:

\begin{equation}

\bf Sa[\mathbf{X}] = \bf Softmax\left[\mathbf{X}\boldsymbol\Phi_{q}(\boldsymbol\Psi_{k}\mathbf{X}\boldsymbol\Phi_{k})^{T} \right](\boldsymbol\Psi_{v}\mathbf{X}\boldsymbol\Phi_{v}), \tag{15}

\end{equation}

where $\boldsymbol\Psi_{k}$ and $\boldsymbol\Psi_{v}$ are the $I/s\times I$ projection matrices for the keys and values respectively.

Another approach to making attention more computationally efficient is to constrain the attention computation so that every input does not attend to every other input. In *local attention* the inputs are divided into disjoint groups of neighbours and each block is passed through a separate self-attention mechanism before recombining (figure 6) In this way, inputs within the same block only attend to one another. Of course, this has the disadvantage that elements that are far from each other in the sequence never interact with one another, but alternating transformer layers that use local and full attention solves this problem.

Local attention can be visualized by plotting a matrix showing interaction of the queries and keys (figure 6). Note that for the decoder version, we also employ masked self-attention so each query can only attend to keys that have the same index or less and there are no interactions in the upper triangular portion.

Visualizing attention in this way leads naturally to the idea of using a convolutional structure (figure 7), in which each input only interacts with the nearest few inputs (or nearest preceding inputs for decoders). When used alone, this will mean that it may take many layers for information to propagate along the sequence. Again, this drawback can be remedied by alternating layers with the convolutional attention patterns and layers with full attention. Indeed, this is what is done in GPT-3. A different approach that maintains the overall sparsity is to use dilated convolutions with different dilation rates in different layers (figure 7b-c), or by introducing layers where some a few of the queries interact with every key (figure 7d). Collectively, these methods are referred to as *sparse transformers*.

The *Longformer* also used a convolutional structure which is sometimes dilated, but simultaneously allowed some keys to and queries to interact with all of the others (figure 9a). This was referred to as *global attention* and the positions correspond to special tokens such as the $<$cls$>$ token in BERT or special tokens in question answering tasks that delimit the question and answer. Note that global attention can only be used in encoder models since elements attend to every other element and hence see ahead in the sequence.

A natural extension of this method is to define some new content embeddings which attend to all of the keys and queries, but do not themselves correspond to any individual tokens in the input (figure 9). This is known as the *extended transformer construction (ETC)*. These additional global content embeddings act as a kind of memory, which can both receive and broadcast information from all of the elements and are combined with a sparse convolutional pattern which ensures strong interactions between nearby inputs. The *BigBird* model took this idea by one step further by also adding sparse random connections between elements to help ensure the rapid mixing of information from different parts of the sequence.

One notable complication of using global content embeddings occurs if it is combined with relative attention; there is no relative offset between the global and regular elements, and so special relative position embeddings are learned for mapping to, from, and between, the global content embeddings.

In this section we have reviewed approaches that make self-attention more efficient, by limiting the interaction between different inputs. Note that all of these methods use pre-defined sparsity patterns. There is also another line of research that attempts to learn the sparsity pattern. This includes the routing transformer, reformer and Sinkhorn transformer.

A third approach to making self-attention more efficient it to approximate the attention computation using Kernel methods. The premise is that the dot product attention for the $i^{th}$ query can thought of as a special case of the following computation:

\begin{equation}

\mathbf{x}_{i}^{\prime} = \frac{\sum_{j=1}^{I}\mbox{sim}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}]\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\sum_{j=1}^{I}\mbox{sim}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}]} \tag{16}

\end{equation}

where $\mbox{sim}[\bullet,\bullet]$ returns a measure of similarity between the two arguments. For dot-product self-attention, this is defined as $\mbox{sim}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}] = \exp[\mathbf{x}_{i}\boldsymbol\Phi_{q}(\mathbf{x}_{j}\boldsymbol\Phi_{k})^{T}]$.

We now treat this similarity as a kernel function, and as such it can be expressed as the dot product of non-linear transformations $\bf z[\bullet]$ of the inputs

\begin{equation}

\mbox{sim}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}] = \bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}, \tag{17}

\end{equation}

which means that the output becomes:

\begin{eqnarray}

\mathbf{x}_{i}^{\prime} &=& \frac{\sum_{j=1}^{I}\bf z [\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z [\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\sum_{j=1}^{I}\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}}\nonumber \\

&=&\frac{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\sum_{j=1}^{I}\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\bf z[\mathbf{x}\boldsymbol\Phi_{q}]\sum_{j=1}^{I}\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}}, \tag{18}

\end{eqnarray}

where we have used the associativity property of matrix multiplication between the first and second lines.

If we could find $\bf z[\bullet]$ such that $\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T} = \exp[\mathbf{x}_{i}\boldsymbol\Phi_{q}(\mathbf{x}_{j}\boldsymbol\Phi_{k})^{T}]$, then this is much more efficient. We compute the terms in the sums once and then compute each $\mathbf{x}_{i}$ term separately with a matrix multiplication. It turns out that such a non-linear transform $\bf z[\bullet]$ does indeed exist, but unfortunately, it maps the argument to an infinite dimensional space. From a computational viewpoint, this is not very helpful!

We'll describe two approaches that sidestep this problem. First, the *linear transformer* implicitly uses a different measure of similarity $\bf sim[\mathbf{a},\mathbf{b}] = \bf z[\mathbf{a}]\bf z[\mathbf{b}]^{T}$ by defining a function $\bf z[\bullet]$ which is more tractable. In particular, they use $\bf z[\mathbf{a}] = \bf elu[\mathbf{a}]+1$ where $\bf elu[\bullet]$ is the exponential linear unit which is a pointwise non-linearity. Second, the *performer* attempts to approximate the standard dot-product similarity using a finite dimensional mapping $\bf z[\bullet]$. The latter approach is empirically more successful, but this may be because the tricks for training transformers (see part III of this blog) do not transfer effectively to using a different similarity measure.

These approaches can be also adapted to decoders. Here, when we calculate the output corresponding to input $\mathbf{x}_{i}$ we only use the partial sums up to index $i$:

\begin{eqnarray}

\mathbf{x}_{i}^{\prime} &=& \frac{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\sum_{j=1}^{i}\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\sum_{j=1}^{i}\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}} \nonumber \\

&=& \frac{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\mathbf{A}_{i}}{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\mathbf{b}_i}, \tag{19}

\end{eqnarray}

where $\mathbf{A}_{i}$ and $\mathbf{b}_{i}$ represent the partial sums in the numerator and denominator respectively. If we initialize $\mathbf{A}_{0}$ and $\mathbf{b}_{0}$ to zero, then the we can compute all the terms efficiently by iterating:

\begin{eqnarray}\label{eq:transformer_rnn}

\mathbf{A}_{i}&\leftarrow&\mathbf{A}_{i-1}+ \bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{i}\boldsymbol\Phi_{v}\nonumber \\

\mathbf{b}_{i}&\leftarrow&\mathbf{b}_{i-1}+ \bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\nonumber \\

\mathbf{x}_{i}^{\prime}&\leftarrow& \frac{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\mathbf{A}_{i}}{\bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\mathbf{b}_i}. \tag{20}

\end{eqnarray}

In conclusion, if we consider the interaction between the queries and keys to be a kernel function, we can replace this by the dot product of non-linear functions of the key and query. This leads naturally to a very efficient implementation for both encoder and decoder architectures.

In this section, we have reviewed three families of modifications that allow the self-attention mechanism to be extended to longer sequences without a quadratic increases in computation. To learn more about this area, consult this review paper.

In the previous sections, we have addressed the questions of how to encode position, and how to extend the transformer to longer sequence lengths. In this section, we shift gears and consider the relationship between the self-attention mechanism and other models. We'll also consider alternatives to the self-attention mechanism.

The first connection that we will draw is between the self-attention decoder and recurrent neural networks (RNNs). In the final part of the previous section, we re-interpreted the dot-product self-attention mechanism as a kernel function $\mbox{k}[\bullet, \bullet]$:

\begin{equation}

\mathbf{x}_{i}^{\prime} = \frac{\sum_{j=1}^{i}\mbox{k}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}]\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\sum_{j=1}^{i}\mbox{k}[\mathbf{x}_{i}\boldsymbol\Phi_{q}, \mathbf{x}_{j}\boldsymbol\Phi_{k}]} = \frac{\sum_{j=1}^{i} \bf z[\boldsymbol\Phi_{q}\mathbf{x}_{i}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}\mathbf{x}_{j}\boldsymbol\Phi_{v}}{\sum_{j=1}^{i} \bf z[\mathbf{x}_{i}\boldsymbol\Phi_{q}]\bf z[\mathbf{x}_{j}\boldsymbol\Phi_{k}]^{T}}. \tag{21}

\end{equation}

This means that the kernel function can be replaced by the dot product of non-linear functions $\bf z[\bullet]$ of the queries and keys and this led to the iterative computation in equation 20.

Viewed in this light, the decoder has an obvious mapping to an RNN. Each state is processed sequentially and the quantities $\mathbf{A}_{i}$ and $\mathbf{b}_{i}$ from equation 20 form the hidden state (figure 10). However, it turns out that to exactly replicate dot-product self-attention requires the function $\bf z[\bullet]$ to map its arguments to an infinite dimensional space. Hence, it is perhaps unsurprising that the transformer architecture out-performs the RNN in practice.

A *hypernetwork* is a network that is used to predict the parameters of a second network that then performs the main task in hand. In part I of this tutorial, we already saw that the attention matrix can be interpreted as forming the weights of a network that maps the values to the outputs (figure 11). These weights are (i) non-negative, (ii) sparse (there is no interaction between the different dimensions of the values) and (iii) shared (the same weight is used for every dimension of the interaction between the $i^{th}$ value and the $j^{th}$ output). As such they form a hypernetwork with a particular structure.

Viewed from this perspective, we might consider other mechanisms than dot-product self attention to create these weights (figure 12). The *synthesizer* uses a multi-layer perceptron $\bf MLP[\bullet]$ to create each row of the $I\times I$ matrix from input $\mathbf{x}_{i}$. This row is then passed through the softmax function to create the attention weights:

\begin{eqnarray}

\mbox{Synthesizer}\left[\mathbf{X} \right] &=&\bf Softmax\left[\bf MLP[\mathbf{X}]\right] \mathbf{X}\boldsymbol\Phi_{v}\nonumber \\

&=&\bf Softmax\left[\bf Relu[\mathbf{X}\boldsymbol\Phi_{1}]\boldsymbol\Phi_{2}]\right] \mathbf{X}\boldsymbol\Phi_{v}\nonumber

\end{eqnarray}

This is interesting since the rows of the attention matrix are no longer computed based on similarities between pairs of tokens, but just from each individual token alone. Surprisingly, it seems to work comparably well to the original dot-product self-attention mechanism.

A similar idea can be used to generate an attention matrix with convolutional structure. This belongs to the family of *dynamic convolutions* in which the convolution weights are themselves determined by the data. Part of the network block in the paper *Pay less attention* uses this approach. One advantage of this scheme is that there is no need for a position encoding; the convolution weights are determined by all of the inputs, and if we permute them, the result will be different.

Finally, it should be noted that linear transformers are also closely related to *fast weight memory systems* which are intellectual forerunners of hypernetworks.

A different way to think about self-attention is as a *routing network*. The attention matrix distributes (routes) each of the $I$ computed value vectors to the $I$ outputs. From this viewpoint, there is a connection between self-attention and capsule networks. Roughly speaking, a capsule network is intended to capture hierarchical relations in images, so lower network levels might detect facial parts (noses, mouths), which are then combined (routed) in higher level capsules that represent a face. One major difference is that capsule networks use *routing by agreement*. In self-attention, the elements $\mathbf{x}_{i}$ compete with each other for how much they contribute to output $j$ (via the softmax operation). In capsule networks, the higher levels of the network compete with each other for inputs from the lower levels.

Once we consider self-attention as a routing network, we can ask the question of whether it is necessary to make this routing dynamic (i.e, dependent on the data). Another variant of the synthesizer removed the dependence of the attention matrix on the inputs entirely and either used pre-determined random values or learned values (figure 13a). This performed surprisingly well across a variety of tasks.

Graph convolutional networks consider each input vector $\mathbf{x}_{i}$ to be associated with a node on a known graph, and process these nodes through a series of layers in which each node interacts with its neighbours. As such they have a close relationship to self-attention; they can be viewed as routing networks, but here the routing is determined by the adjacency matrix of the graph (figure 13b) and not the data.

Graph attention networks (figure 13c) combine both mechanisms; the routing depends both on the data (although using additive attention, not dot-product attention) and the graph structure (which is used to mask the attention matrix in a similar way to in masked self-attention in decoders).

Returning to the original self-attention mechanism, it is now clear that it can be viewed as a graph neural network on the complete graph, where the query tokens are the destination nodes and the key and value tokens are the source nodes.

Linear convolutions of the neighboring inputs in the sequence can be considered a special case of multi-head dot-product self attention with relative position embeddings. For example, consider using additive position embeddings so that the overall self-attention mechanism is given by:

\begin{equation}

{\bf Sa}[\mathbf{X}] =\bf Softmax\left[(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}+\boldsymbol\Pi\right]\mathbf{X}\boldsymbol\Phi_{v}, \tag{22}

\end{equation}

where the matrix $\boldsymbol\Pi$ has a different learned value $\pi_{i-j}$ for each offset $|i-j|$. Now consider setting $\boldsymbol\Phi_{q}=\boldsymbol\Phi_k = \mathbf{0}$ and $\boldsymbol\Phi_{v}=\mathbf{I}$ to yield:

\begin{equation}

{\bf Sa}[\mathbf{X}] =\bf Softmax\left[\boldsymbol\Pi\right]\mathbf{X}\nonumber.

\end{equation}

If we now choose the relative position contributions $\pi_{i-j}$ to be very large for one offset $i-j$ and small for all of the others, the overall effect will be to create an attention matrix with zeros everywhere except within a single diagonal offset by $i-j$ from the center, where the values will be one. When applied to the data $\mathbf{X}$, this has the effect of shifting the rows of the value matrix by $j$. In a multi-head attention context, each head could learn a different offset. When the outputs of these heads are recombined using:

\begin{equation}

{\bf MhSa}[\mathbf{X}] = \left[{\bf Sa}_{1}[\mathbf{X}]\;{\bf Sa}_{2}[\mathbf{X}]\;\ldots\;{\bf Sa}_{H}[\mathbf{X}] \right]\boldsymbol\Phi_{c}, \tag{23}

\end{equation}

it is possible to choose $\boldsymbol\Phi_{c}$ so that all of the outputs from the $h^{th}$ self attention mechanism have the same weight and so we have effectively performed a convolution on the rows of $\mathbf{X}$.

To summarize, it is possible for a multi-head self attention with relative position embeddings to simulate convolution. This is particularly interesting when the transformer is applied to vision problems where convolutional networks are the standard. Indeed, there is some evidence that this is exactly what transformers are doing in vision tasks.

A notable characteristic of the self attention mechanism and related models is that the processing divides into two paths, one of which is later used to modify the other. In attention, this modification takes the form of pre-multiplication by the attention matrix. However, there is another family of models which use one path to just modulate the magnitude of the other.

The gated linear unit (figure 14a) is an example of such a gating mechanism. The input $\mathbf{X}$ has a linear transformation $\boldsymbol\Phi_{u1}$ applied to it and the result is passed through a pointwise sigmoid function $\bf Sig[\bullet]$ . This maps the results to between zero and one so that they can be used to modulate the magnitude of the data $\mathbf{X}\boldsymbol\Phi_{u2}$ flowing down the other path, which have been subject to a a different linear transformation. The whole function is hence:

\begin{equation}

\bf GLU[\mathbf{X}] = \bf Sig[\mathbf{X}\boldsymbol\Phi_{u1}]\odot \mathbf{X}\boldsymbol\Phi_{u2}. \tag{24}

\end{equation}

Although the architecture is superficially similar, this is not really equivalent to a transformer, as each input $\mathbf{x}_{i}$ (row of $\mathbf{X}$) is treated independently. The gated MLP addresses this by modifying the architecture to incorporate a learned linear transformation $\boldsymbol\Psi$ that combines together the different inputs:

\begin{equation}

\bf GMLP[\mathbf{X}] = (\bf Sig[\mathbf{X}\boldsymbol\Phi_{u1}]\odot \boldsymbol\Psi\mathbf{X}\boldsymbol\Phi_{u2})\boldsymbol\Phi_{v}. \tag{25}

\end{equation}

as well as a final linear transform $\boldsymbol\Phi_{v}$ that remaps to the original dimensionality. This model again has the advantage that it does not need a position encoding; the inputs are mixed using $\boldsymbol\Psi$ and if we permute their order, the output will not just be a permutation of the input.

Finally, we'll consider the relationship between Hopfield networks and the attention mechanism. A Hopfield network can retrieve a stored memory based on a query via an iteratve procedure in which the query is updated after interaction with the system. They were originally defined for binary vectors, but the modern Hopfield network extends the idea to continuous values.

Ramsauer et al. (2020) show that for a carefully defined Hopfield energy function, the update rule is equivalent to self-attention mechanism. The most natural way to think of this is in terms of encoder-decoder attention. The decoder queries memories from the encoder network. If viewed as a Hopfield network, the query-key attention computes a simple iteration of the memory retrieval. To complete the process, the output of the attention network should be feed back in as a new query until a stable state is reached (figure 15).

In this blog, we have discussed extensions to the basic self-attention mechanism. First, we discussed how to incorporate positional information, and then how to extend the self-attention mechanism to longer sequences. Finally, we have discussed the relationship between self-attention and a number of other models, including RNNs, CNNs, graph convolutional networks and Hopfield networks. We note that some caution is required here. Recent work has suggested that many of the variations of the original model do not necessarily yield consistent performance benefits.

In part III of this blog, we discuss how to train transformers in practice. To make training stable, a number of tricks are required including unusual learning rate scheduled, various forms of normalization, and careful initialization.

1 In fact they also modified the value terms in a similar way although their ablation study suggested that this did not contribute much

]]>In the technical demo, users get to see this interactive system at work.

The value proposition of a project like this is about democratizing data-driven insights by enabling non-technical users to interact with structured data, using natural language.

"Today, a lot of potentially useful knowledge and insights is trapped in databases, and only technical users can access that information, typically by using SQL. Turing by Borealis AI’s database interface unlocks these insights for non-technical users, who can query the multitude of databases using natural language and get the results and insights they need."

- Yanshuai Cao, Senior Research Team Lead at Borealis AI

Turing by Borealis AI comes closer than most of technology available today, achieving and holding state-of-the-art performance levels, while reducing accuracy issues. Such cross-domain text-to-SQL semantic parsers generally have serious accuracy and usability problems, making practical applications a challenge. Unlike in online search, where approximate answers can be good enough, when users query relational databases to glean specific insights, high degree of accuracy is needed to provide value. With Turing by Borealis AI’s technology, a user can look at multiple hypotheses and with the help of explanation Turing by Borealis AI provides, can figure out which of the SQL queries comes closest to the search intent.

- SQL responses are explained in plain English to help with evaluating and understanding the results, which helps non-technical users select the appropriate SQL query from the highest-ranked options.
- Trained on 100+ databases, it can generalize to new, never-seen-before databases to answer NLP questions.
- Equipped with a state-of-the-art cross-domain semantic parser (we will be releasing the core of the semantic parser in August)
- Its text-to-SQL framework is evaluated on the Spider benchmark, placing among the top performing frameworks. Turing has achieved the record of the best performance in 2020 and held it for much of 2020/21.

Here’s a sample use case: Let's say a non-technical user is in the business of delivering supplies to gas stations. The user wants to query available databases and find out which stations to contact next, in order to grow the business. How would the user get these business insights, without relying on SQL to do the search across available databases? With Turing by Borealis AI, users can start the search by picking the ‘gas station domain’ and ask: "What are the locations with gas stations owned by companies making over 100 billion in sales?" Under the hood, there is a deep learning model that treats the text-to-SQL problem as graph-to-tree mapping and produces a SQL query, executing it against the database to return the results.

Turing by Borealis AI generates SQL and uses a synchronous context-free grammar system to provide a high-precision explanation, so that users can make sure the results are trustworthy and match the intent.

Learn more about cross-database text-to-SQL in this blog, with further details on Turing by Borealis AI in this paper and here.

The team is presenting Turing by Borealis AI and related works: two main conference papers, one demo paper and one workshop paper at the joint conference of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021) on August 1-6, 2021. The team is also aiming to release the core of its semantic parsing at that time.

]]>Each of this series of three blogs focuses on different aspects of the transformer. In Part I, we introduce self-attention, which is the core mechanism that underpins the transformer architecture. We then describe transformers themselves and how they can be used as encoders, decoders, or encoder-decoders using well-known examples such as BERT and GPT3. This discussion will be suitable for someone who knows machine learning, but who is not familiar with the transformer.

Part II considers how to adapt the transformer to cope with longer sequences, different methods for encoding the positions of elements in the sequence, and other modifications to the basic architecture. We also discuss the relationship between the transformer and other models. This will be suitable for a reader who knows the basics about transformers and wants to learn more.

Transformer models are difficult to train from scratch in practice. Part III details the tricks that are required to ensure that training does not fail. We conclude with a discussion of our recent work on how to modify the training procedure to fine-tune deep transformers when only sparse training data is available. This discussion will be suitable for practitioners who want to learn more about how to work effectively with transformers.

To motivate the transformer, consider the following passage:

The restaurant refused to serve me a ham sandwich, because it only cooks vegetarian food. In the end, they just gave me two slices of bread. Their ambience was just as good as the food and service.

We would like to build a network that can process this passage into a representation that is suitable for downstream tasks. For example, we might want to classify the review as positive or negative, or answer questions such as "Does the restaurant serve steak?". Two problems immediately present themselves:

First, the input representation will be large. Typically, we might describe each of the 37 words with an embedding vector of length 1024 and so the network input will be of length $37 *1024 = 37888$ even for this small passage. A more realistically sized input might have hundreds or even thousands of words. It's not clear that a standard fully-connected network would be practical here; it would need a very large number of parameters, and it's not obvious how to adapt such a network to inputs containing different numbers of words. This suggests the need for some kind of parameter sharing that is analogous to the use of convolutions in image processing.

Second, language is fundamentally ambiguous; it is not clear from the syntax alone that the pronoun it refers to the restaurant and not the ham sandwich. To fully understand the text, the word it should somehow be connected to the word restaurant. In the parlance of transformers, the former word should pay *attention* to the latter. This implies that there must be connections between the words, and that the strength of these connections will depend on the words themselves. Moreover, these connections need to extend across large spans of the text; the word their in the last sentence also refers to the restaurant.

In conclusion, we have argued that a model that can process real world text (i) will use parameter sharing so that it can cope with long input passages of differing lengths, and (ii) will contain connections between word representations that depend on the words themselves. The transformer acquires both of these properties by using *dot-product self-attention*.

A standard neural network layer $\bf nn[\bullet]$, takes a $D\times 1$ input $\mathbf{x}$, applies a linear transformation followed by a static non-linearity like a rectified linear unit (ReLU)

\begin{equation}

\bf nn[\mathbf{x}] = \bf ReLU[\boldsymbol\Phi\tilde{\mathbf{x}}], \tag{1}

\end{equation}

to return a modified output vector. Here, the notation $\tilde{\mathbf{x}}$ indicates that we have appended the constant value 1 to the end of $\mathbf{x}$ so that the parameter matrix $\boldsymbol\Phi$ can also represent the offsets in the linear transformation. For simplicity, we'll assume that we use this trick every time we apply a linear transformation and just write $\boldsymbol\Phi\mathbf{x}$ from now on.

In contrast, a self-attention block $\bf sa[\bullet]$ takes $I$ inputs $\mathbf{x}_{i}$, each of dimension $D\times 1$ and returns $I$ output vectors. In the context of NLP, each of the inputs $\mathbf{x}_{i}$ will represent a word or part of a word. For input $\mathbf{x}_{i}$, the self-attention block returns the weighted sum:

\begin{equation}

\mbox{sa}[\mathbf{x}_{i}] = \sum_{j=1}^{I}a[\mathbf{x}_{i}, \mathbf{x}_{j}]\boldsymbol\Phi_v \mathbf{x}_{j}. \tag{2}

\end{equation}

The sum is over all of the inputs $\{\mathbf{x}_{i}\}_{i=1}^{I}$ after applying the same linear transformation $\boldsymbol\Phi_{v}$ to each. We will refer to the parameters $\boldsymbol\Phi_{v}$ as *value weights* and the product $\boldsymbol\Phi_v \mathbf{x}_{i}$ as computing the *values* for the $i^{th}$ input. These values are weighted by the terms $a[\mathbf{x}_{i}, \mathbf{x}_{j}]$ which are scalars that represent the *attention* of input $\mathbf{x}_{i}$ to input $\mathbf{x}_{j}$.

In the following sections, we will look at this in more detail by breaking this computation down into two parts. First we'll consider the computation of the values and their subsequent weighting as described in equation 2. Then we'll describe how compute the attention weights $a[\mathbf{x}_{i}, \mathbf{x}_{j}]$.

The same value weights $\boldsymbol\Phi_{v}$ are applied to each input $\mathbf{x}_{i}$ and because of this parameter sharing, far fewer parameters are required than if we had used a fully-connected network (figure 1). Moreover, this part of the computation is easy to extend to different sequence lengths.

The attention weights $a[\mathbf{x}_{i}, \mathbf{x}_{j}]$ combine the values from different inputs. They are also sparse in a sense, since there is only one weight for each ordered pair of inputs $(\mathbf{x}_{i},\mathbf{x}_{j})$, regardless of the size of these inputs. It follows that the number of attention weights increases with the square of the sequence length $I$, but is independent of the length $D$ of each input $\mathbf{x}_{i}$.

In the previous section, we saw that the outputs are the result of two chained linear transformations; the values $\boldsymbol\Phi_{v}\mathbf{x}_{i}$ are computed independently for each input $\mathbf{x}_{i}$ and these vectors are combined linearly by the attention weights $a[\mathbf{x}_{i},\mathbf{x}_{j}]$. However, the overall self-attention computation is non-linear because the attention weights are themselves non-linear functions of the input.

More specifically, the attention weight $a[\mathbf{x}_{i},\mathbf{x}_{j}]$ depends on the dot-product $(\boldsymbol\Phi_{q}\mathbf{x}_{i})^{T}\boldsymbol\Phi_{k}\mathbf{x}_{j}$ between $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ *after* each as been transformed by a different linear transformations $\boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ respectively. To complete the computation of the attention weight, these dot-product similarities are passed through a softmax function:

\begin{eqnarray}\label{eq:sattention2}

a[\mathbf{x}_{i},\mathbf{x}_{j}] &=& \mbox{softmax}_{j}\left[(\boldsymbol\Phi_{q}\mathbf{x}_{i})^{T}\boldsymbol\Phi_{k}\mathbf{x}_{j} \right]\nonumber\\

&=& \frac{\exp\left[(\boldsymbol\Phi_{q}\mathbf{x}_{i})^{T}\boldsymbol\Phi_{k}\mathbf{x}_{j} \right]}{\sum_{j=1}^{I}\exp\left[(\boldsymbol\Phi_{q}\mathbf{x}_{i})^{T}\boldsymbol\Phi_{k}\mathbf{x}_{j} \right]} \tag{3}

\end{eqnarray}

and so for each $\mathbf{x}_{i}$ they are positive and sum to one (figure 2). For obvious reasons, this is known as *dot-product self-attention*.

The vectors $\boldsymbol\Phi_{q}\mathbf{x}_{i}$ and $\boldsymbol\Phi_{k}\mathbf{x}_{i}$ are known as the *queries* and *keys* respectively. These names were inherited from the field of information retrieval and have the following interpretation: the output for input $\mathbf{x}_{i}$ receives a weighted sum of values $\boldsymbol\Phi_v \mathbf{x}_{j}$, where the weights $a[\mathbf{x}_{i}, \mathbf{x}_{j}]$ depend on the similarity between the query vector $\boldsymbol\Phi_q \mathbf{x}_{j}$ and the key vector $\boldsymbol\Phi_k \mathbf{x}_{j}$.

To summarize, we see that for input $\mathbf{x}_{i}$, the output is a weighted sum of the same linear transformation $\boldsymbol\Phi_{v}$ of all of the inputs, where these weights are positive and sum to one. The weights depend on a measure of similarity between input $\mathbf{x}_{i}$ and the other inputs. The computation as a whole is non-linear due to the dot-product and softmax operation used to compute these weights. Consequently, there is no need for a pointwise non-linearity like a ReLU.

Note that this mechanism fulfils the requirements that we laid out earlier. First, there is a single shared set of of parameters $\boldsymbol\Phi_{v},\boldsymbol\Phi_{q},\boldsymbol\Phi_{k}$. This is independent of the number of inputs $I$ and so the network can be applied to different sequence lengths. Second, the connections between the inputs (words) depend on the input representations themselves via the computed attention values.

The above computation can be written in a more compact form if we assume that the $I$ inputs $\mathbf{x}_{i}$ are form the rows of the $I\times D$ matrix $\mathbf{x}$:

\begin{equation}

\mbox{Sa}[\mathbf{x}] = \mbox{Softmax}[\mathbf{X}\boldsymbol\Phi_{q}(\mathbf{X}\boldsymbol\Phi_{k})^{T}]\mathbf{X}\boldsymbol\Phi_{v}. \tag{4}

\end{equation}

where the function $\mbox{Softmax}[\bullet]$ takes a matrix and performs the softmax operation independently on each of its rows (figure 3). Note that here the matrices $\boldsymbol\Phi_{v}, \boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ are the transposes of those in the original formulation.

In the previous section, we described the dot-product self-attention mechanism. Here, we introduce three extensions that are all almost always used in practice.

Observant readers will have noticed that the above mechanism loses some important information; the computation will be the same, regardless of the order of the inputs $\mathbf{x}_{i}$. However, if the inputs correspond to the words in a sentence, it's clear that the order matters. To incorporate information about position, we add a matrix $\boldsymbol\Pi$ which is the same size as the input matrix that encodes this information.

The position matrix $\boldsymbol\Pi$ may either be chosen manually or learned. It may be added to the initial word embeddings only or it may be added at every layer of the network. Sometimes it is only added to $\mathbf{x}$ in the computation of the queries and keys. The contents of this vector and other variations will be discussed in detail in part II of this blog; however, the main idea is that there is unique vector added to each input $\mathbf{x}_{i}$ that lets the system know its position in the sequence.

The dot products in the attention computation may have very large magnitudes. This can move the arguments to the softmax function into a region where the largest value dominates to a large degree and consequently, the associated gradients are very small and the model becomes hard to train. To resolve this issue, it is typical to scale the computed attention values by the square root of dimension $d_{q}$ of the queries and keys (i.e., the number of columns in $\boldsymbol\Phi_{q}$ and $\boldsymbol\Phi_{k}$ which must be the same). This gives:

\begin{equation}

\mbox{Sa}[\mathbf{x}] =\mbox{Softmax}\left[\frac{(\mathbf{X}\boldsymbol\Phi_{q})(\mathbf{X}\boldsymbol\Phi_{k})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{v}. \tag{5}

\end{equation}

This is known as *scaled dot product self-attention*.

Practitioners usually apply multiple self-attention mechanisms in parallel, and this is known as *multi-head self attention*. The $h^{th}$ self-attention mechanism or *head* can be written as:

\begin{equation}

\mbox{Sa}_{h}[\mathbf{x}] =\mbox{Softmax}\left[\frac{(\mathbf{X}\boldsymbol\Phi_{qh})(\mathbf{X}\boldsymbol\Phi_{kh})^{T}}{\sqrt{d_{q}}}\right]\mathbf{X}\boldsymbol\Phi_{vh}. \tag{6}

\end{equation}

where we have different parameters $\boldsymbol\Phi_{qh}$, $\boldsymbol\Phi_{kh}$ and $\boldsymbol\Phi_{vh}$ for each head. The outputs of these self-attention mechanisms are concatenated and another linear transform $\boldsymbol\Phi_{c}$ is applied to combine them (figure 4):

\begin{equation}

\mbox{MhSa}[\mathbf{X}] = \left[\mbox{Sa}_{1}[\mathbf{X}]\;\mbox{Sa}_{2}[\mathbf{X}]\;\ldots\;\mbox{Sa}_{H}[\mathbf{X}] \right]\boldsymbol\Phi_{c}. \tag{7}

\end{equation}

This appears to be necessary to make the transformer work well in practice. It has been speculated that multiple heads make the self-attention network more robust to bad initializations. The fact that trained models only seem to depend on a subset of the heads lends credence to this speculation.

Self-attention is just one part of a larger *transformer layer*. This layer consists of a multi-head self-attention unit (which allows the word representations to interact with each other) followed by a fully connected network $\mbox{mlp}[\mathbf{x}_{i}]$ (that operates separately on each word representation). Both of these units are residual networks (i.e., their output is added back to the original input). In addition, it is typical to add a LayerNorm operation after both the self-attention and fully connected networks. The complete layer can be described by the following series of operations:

\begin{eqnarray}

\mathbf{x} &\leftarrow& \mathbf{x} + \mbox{MhSa}[\mathbf{x}] \nonumber \\

\mathbf{x} &\leftarrow& \mbox{Layernorm}[\mathbf{x}] \hspace{3cm}\nonumber\\

\mathbf{x}_{i} &\leftarrow& \mathbf{x}_{i}+\mbox{mlp}[\mathbf{x}_{i}] \hspace{3.6cm}\forall\; i\in\{1\ldots I\}\nonumber\\

\mathbf{x} &\leftarrow& \mbox{Layernorm}[\mathbf{x}], \tag{8}

\end{eqnarray}

where the column vectors $\mathbf{x}_{i}$ are transposed and form the rows of the full data matrix $\mathbf{x}$ in the first stage. In a real system, the data would pass through a series of these layers.

Now that we have a good understanding of self-attention and the transformer layer, let's walk through a typical modern NLP processing pipeline.

A text processing pipeline begins with a *tokenizer*. This splits the text into a *vocabulary* of smaller constituent units (tokens) that can be processed by the subsequent network. In the discussion above, we have implied that these are words, but there are a several difficulties with this.

- It's inevitable that some words (e.g., names) will not be in the vocabulary.
- It's not clear how to handle punctuation, but this is important. If a sentence ends in a question mark, then we need to encode this information.
- The vocabulary would need different tokens for versions of the same word with different suffixes (e.g., walk, walks, walked, walking) and there is no way to make clear that these variations are related.

One approach would be just to use letters and punctuation marks as the vocabulary, but this would mean splitting text into a large number of very small parts and requiring the subsequent network to re-learn the relations between them.

In practice, a compromise between using letters and full words is used, and the final vocabulary will include both common words and short parts of words from which larger and less frequent words can be composed. The vocabulary is computed using a method such as *byte pair encoding* that uses ideas from text compression methods; essentially it greedily merges commonly-occurring sub-strings based on their frequency. This type of approach is known as a *sub-word tokenizer*.

Each different token within the vocabulary is mapped to a *word embedding*. Importantly, the same token always maps to the same embedding. These embeddings are learned along with the rest of unknown parameters in the network. A typical embedding size is 1024 and a typical total vocabulary size is 30,000, and so even before the main network, there are a lot of parameters to learn.

These embeddings are then collected to form the rows of the input matrix $\mathbf{x}$ and the positional encoding $\boldsymbol\Pi$ may be added at this stage.

Finally, the input embedding matrix $\mathbf{X}$ is passed to a series of transformer layers, which we'll refer to as a *transformer network* from now on. There are three main types of transformer network. First, a transformer network can be used as an *encoder*. Here, the goal is to transform the text into a representation that can support a variety of language tasks, such as sentiment analysis or question answering. An example of an encoder model is the BERT model.

Second, a transformer network can be used as a *decoder*. Here, the goal of the network is to generate a new token that continues the input text. An example of a decoder model is GPT3.

Finally, transformer networks can be used to build *encoder-decoder models*. These are used in sequence to sequence models, which take one text string and convert them to another text string. For example, in machine translation, an input sentence in English might be processed by the encoder. The decoder then generates the translated sentence in French. An example of an encoder-decoder model is the paper where transformers were first introduced.

We'll now consider each of these three variations in turn.

BERT is an encoder model that uses s a vocabulary of 30,000 tokens. The tokens are converted to 1024 dimensional word embeddings and passed through 24 transformer layers. In each of these is a self-attention layer with 16 heads, and for each head the queries, keys, and values are of dimension 64 (i.e., the matrices $\boldsymbol\Phi_{vh},\boldsymbol\Phi_{qh},\boldsymbol\Phi_{kh}$ are of size $1024\times 64$). The dimension of the hidden layer in the neural network layer of the transformer is 4096. The total number of parameters is $\sim 340$ million. This sounds like a lot, but is tiny by modern standards.

Encoder models are trained in two stages. During *pre-training*, the parameters of the transformer architecture are learned using *self-supervision* from a large corpus of text. The goal here is for the model to learn general information about the statistics of language. In the *fine-tuning stage*, the resulting network is adapted to solve a particular task, using a smaller body of supervised training data. We'll now discuss each of these stages in turn for the BERT model.

In the pre-training stage, the network is trained using self-supervision. This allows the use of enormous amounts of data, without the need for manual labels. For BERT, the self-supervision task consists of predicting missing words from sentences from a large internet corpus (figure 7)^{1}. During training, the maximum input length was 512 tokens and the batch size is 256. The system is trained for 1,000,000 steps which is roughly 50 epochs of the 3.3 billion word corpus.

Trying to predict missing words forces the transformer network to understand something of the syntax of the language. For example, that it might learn that the adjective red is often found before nouns like house or car but never before a verb like shout. It also allows the model to learn some superficial *common sense* about the world. For example, after training, the model will assign a higher probability to the missing word train in the sentence The <mask> pulled into the station, than it would to the word peanut. However, there are persuasive arguments that the degree of "understanding" that this type of model can ever have is limited.

In the fine-tuning stage, the parameters of the model are adjusted to specialize it to a particular task. This usually involves adding an extra layer on top of the transformer network, to convert the collection of vectors $\mathbf{x}_{1},\ldots \mathbf{x}_{I}$ associated with the input tokens to the desired format of the output. Examples include:

**Text classification:** In BERT, there is a special token known as the $<$cls$>$ token (short for classification token) that is placed at the start of each string during pre-training. For text classification tasks like sentiment analysis, the vector associated with this string is mapped to a single number and passed through a logistic sigmoid. This creates a number between 0 and 1 that can be interpreted as the probability that the sentiment is positive and the system is fine-tuned to maximize this correct probability (figure 8a).

**Word classification:** In named entity recognition, the goal is to classify each individual word as an entity type (e.g., person, place, organization, or no-entity). To this end, the vector $\mathbf{x}_{i}$ associated with each token in the input sequence is mapped to a $K\times 1$ vector where $K$ is the entity type (figure 8a) and the system is fine tuned to maximize these probabilities (figure 8b).

**Text span prediction:** In the SQuAD 1.1 question answering task, both the question and a passage from Wikipedia containing the answer are input into the system. BERT is then used to predict the text span in the passage that contains the answer. Each token associated with the Wikipedia passage maps to two numbers, that indicate how likely it is that the text span begins and ends at this location. The resulting two sets of numbers are put through two softmax functions and the probability of any text span being the answer can then be derived by combining the probability of starting and ending at the appropriate places.

In this section, we present a high-level description of GPT3 which is an example of a transformer decoder model. The basic architecture is extremely similar to the encoder model in that it consists of a series of transformer layers that operate on learned word embeddings. However, the goal is different. The encoder aimed to build a representation of the text that could be fine-tuned to solve a more specific NLP task. However, the decoder has one purpose which is to generate the next token in a provided sequence. By iterating this procedure, the model can produce a body of coherent text.

More specifically, GPT3 constructs a language model. For any sentence it aims to model the joint probability $Pr(t_1,t_2,\ldots t_{N})$ of the $N$ observed tokens and it does this by factorizing this joint probability into an auto-regressive sequence:

\begin{equation}

Pr(t_{1},t_{2},\ldots t_{N}) = \prod_{n=1}^{N}Pr(t_{n}|t_{1}\ldots t_{n-1}). \tag{9}

\end{equation}

This is easiest to understand with a concrete example. Consider the sentence It takes great personal courage to let yourself appear weak. For simplicity, let's assume that the tokens are the full words. The probability of the full sentence is:

$Pr$(It takes great personal courage to let yourself appear weak) $=$

$Pr$(It) $\cdot$ $Pr$(takes$|$It) $\cdot$ $Pr$(great$|$It takes) $\cdot$ $Pr$(courage$|$It takes great) $\cdot$

$Pr$(to$|$It takes great courage) $\cdot$ $Pr$(let$|$It takes great courage to) $\cdot$

$Pr$(yourself$|$It takes great courage to let) $\cdot$

$Pr$(appear$|$It takes great courage to let yourself) $\cdot$

$Pr$(weak$|$It takes great courage to let yourself appear). (10)

This demonstrates the connection between the probabilistic formulation of the cost function and the next token prediction task.

When we train a decoder model, we aim to maximize the log-probability of the input text under the auto-regressive language model. Ideally, we would like to pass in the whole sentence and compute all of the log probabilities and their gradients simultaneously. However, this poses a problem; if we pass in the full sentence, then the term computing $\log$ $[$ $Pr$(great$|$It takes) $]$ will have access to both the answer great and also the right context courage to let yourself appear weak.

To see how to avoid this problem, recall that in a transformer network, the tokens only interact in the self-attention layers. This implies that the problem can be resolved by ensuring that the attention to the answer and the right context are zero. This can be achieved by setting the appropriate dot products to negative infinity before they are passed through the $\mbox{softmax}[\bullet]$ function. This idea is known as *masked self-attention*.

The overall decoder transformer network operates as follows. The input text is tokenized and the tokens are converted to embeddings. The embeddings are passed into the transformer network, but now the transformer layers use masked self-attention so that they can only attend to the current and previous tokens. You can think of each of the output embeddings as representing a partial sentence, and for each the goal is is to predict the next token in the sequence. Consequently, after the transformer layers, a linear layer maps each word embedding to the size of the vocabulary, followed by a $\mbox{softmax}[\bullet]$ function that converts these values to probabilities. We aim to maximize sum of the log probabilities of the next token in the ground truth sequence at every position (figure 9).

To generate from the model, we start with an input sequence of text (which might be just the special $<$start$>$ token) and feed this into the network which then outputs the probability of the next token. We can then either pick the most likely token or sample from this probability distribution. The new extended sequence can be fed back into the decoder network which outputs the probability distribution over the next token and in this way, we can generate large bodies of text. The computation can be made quite efficient as prior embeddings do not interact with subsequent ones due to the masked self-attention and so a lot of the earlier computation can be recycled as we generate subsequent tokens.

In practice, there are many strategies such as beam-search and top-K sampling that can be added to help make the output text more coherent. These are discussed in detail in our previous blog on natural language generation. Here's an example of completing text from the GPT2 model:

Borealis AI is a great place to work because there are a lot of people there that are passionate about this kind of technology, like me. There are some challenges for developers but it also gives us a great opportunity to work on different problems.

where the text provided to the model is in green and the generated text in in blue.

GPT3 applies these ideas on a massive scale. The sequence lengths are 2048 tokens long and since multiple spans of 2048 tokens are processed at once, the total batch size is 3.2 million tokens. There are 96 transformer layers (some of which implement a sparse version of attention), each of which processes a word embedding of size 12288. There are 96 heads in the self-attention layers and the value, query and key dimension is 128. It is trained with 300 billion tokens and learns a total of 175 billion parameters. It is interesting to note that since text is extremely compressible, it actually contains more than enough capacity to memorize the entire data set.

One surprising property of learning a model on this scale is that it can perform many tasks without the need for fine-tuning. If we provide several examples of correct question answer pairs, and then another question, it can often answer the final question correctly just by completing the sequence. One example of this is correcting English grammar:

**Poor English input:** I eated the purple berries.

**Good English output:** I ate the purple berries.

**Poor English input:** Thank you for picking me as your designer. I’d appreciate it.

**Good English output:** Thank you for choosing me as your designer. I appreciate it.

**Poor English input:** The mentioned changes have done. or I did the alteration that you requested. or I changed things you wanted and did the modifications.

**Good English output:** The requested changes have been made. or I made the alteration that you requested. or I changed things you wanted and made the modifications.

**Poor English input:** I’d be more than happy to work with you in another project.

**Good English output:** I'd be more than happy to work with you on another project.

The text containing the paired examples in green was provided as context for GPT3 and the system then generated the correct answer in blue. This phenomenon extends to many situations including generating code snippets based on natural language descriptions, simple arithmetic, translating between languages, answering questions about text passages and many more. Consequently, it is argued that enormous language models are *few-shot learners*; they can learn to do novel tasks based on just a few examples. However, in practice the performance is erratic and it is not clear the extent to which it is extrapolating from learned examples rather than merely interpolating, or even copying verbatim.

The original transformer paper focused on translation between languages, which is an example of a *sequence-to-sequence* task. Their original architecture was an *encoder-decoder* model that (as the name suggests) combines both encoder and decoder models.

Consider the example of translating from English to French. The encoder receives the sentence in English and processes it through a series of transformer layers to create an output representation for each token. The decoder receives the sentence in French and processes through a series of transformer layers that use masked self-attention. However, these transformer layers also attend to the output of the encoder. Consequently, each French output word conditioned not only on the previous output words, but also on the entire English sentence that it is translating (figure 10).

In practice this is achieved by modifying the transformer layer. The original transformer layer in the decoder (figure 5) consisted of a masked self-attention layer followed by a multi-layer perceptron applied individually to each embedding. In between these we now introduce a second attention layer, in which the embeddings attend to the output embeddings from the encoder. This uses a version of self-attention where the queries $\mathbf{X}_{d}\boldsymbol\Phi_{q}$ are computed from the decoder embeddings $\mathbf{X}_{d}$, and the keys $\mathbf{X}_{e}\boldsymbol\Phi_{k}$ and values $\mathbf{X}_{e}\boldsymbol\Phi_{v}$ are generated from the encoder embeddings $\mathbf{X}_{e}$:

\begin{equation}

\mbox{ Sa}[\mathbf{X}_{d},\mathbf{x}_{e}] = \mbox{Softmax}[\mathbf{X}_{d}\boldsymbol\Phi_{q}(\mathbf{X}_{e}\boldsymbol\Phi_{k})^{T}]\mathbf{X}_{e}\boldsymbol\Phi_{v}. \tag{11}

\end{equation}

This is known as *encoder-decoder attention* (figure 11).

In this blog, we introduced the idea of self-attention and then described how this fits into the transformer architecture. We then presented the encoder, decoder, and encoder-decoder versions of this architecture. We've seen that the transformer operates on sets of high-dimensional embeddings. It has a low computational complexity per layer and much of the computation can performed in parallel, using the matrix form. Since every input embedding interacts with every other, it can describe long-range dependencies in text. It is these characteristics that have allowed transformers to be applied in massive systems like GPT3.

In the second part of the blog we will discuss extensions of the basic transformer model. In particular, we will expand on methods to encode the position of tokens and methods to extend transformers to process very long sequences. We'll also discuss how the transformer architecture relates to other models. Finally, in the third part of this series, we will discuss the details of how to train transformer models successfully.

^{1} BERT also used a secondary task which involved predicting whether two sentences were originally adjacent in the text or not, but this only marginally improved performance.

People communicate in natural language, which is flexible but often vague, whereas computer languages have no room for ambiguity. For a computer to respond to users' questions or commands in natural language, it needs to extract meaning, resolve ambiguity, and translate to executable programs. This is the task of *semantic parsing* (SP), whose applications include voice assistants, code generation, natural language interfaces to databases (NLDB), and many more. Our Turing by Borealis AI system is an NLDB, a software system enabling users to interact with databases in natural language, as illustrated in Figure 1.

The semantic parsing model powering an NLDB needs to be trained with questions and their corresponding SQL queries. If the model only generalizes to new questions on the training domain, the NLDB cannot be quickly adapted to new databases, so it would not be very useful in practice. Hence, the model somehow needs to generalize to new databases with unseen schema and unseen questions. This is *cross-domain* or *cross-database* text-to-SQL semantic parsing.

The goal of this blog post is to glimpse into how models (like our Turing by Borealis AI system) for this task work without popping the hood. It is suitable for any reader with basic knowledge of machine learning and natural language processing.

We will first give a brief review of SQL that readers can skip if already familiar, then introduce two running examples of text-to-SQL prediction. The examples will illustrate some challenges involved in cross-domain semantic parsing and illustrate why simple methods would not succeed. Afterwards, we will describe a high-level framework that treats cross-database text-to-SQL as a graph-to-tree mapping. We will use the two running examples, to show how the framework tackles the challenges that we identified. Finally, we will provide some pointers for interested readers to learn more, including our recent ACL papers (Xu et al., 2021a,b; Norouzi et al., 2021) that respectively set the new state-of-the-art accuracy on the Spider text-to-SQL benchmark and some code generation problems.

Before showing the examples, let us review some SQL basics. SQL stands for *Structured Query Language* and is used for storing, manipulating and retrieving data in relational databases. We will just focus on the retrieval here.

Relational databases store information records in tables. The *schema* of a database describes the structure of the domain: what are the tables, what columns does each table contain, the data type of each column, as well as special roles that some columns play. The first type of special role is a *primary key*. This is a column or a combination of columns that has to be unique for each data record. The second type of special role is a *foreign key*, which is a column or combination of columns whose values match the primary key records of another table. Foreign key relations link tables together.

`SELECT`

QueryA basic SQL query looks like the following `SELECT * FROM my_table`

, where `*`

is a reserved token meaning "all columns". This query will return all rows of the table `my_table`

. The star can be replaced by one or more column names, in which case, the query would only return the mentioned attributes in each row. Slightly more advanced queries will involve filtering condition, expressed using a `WHERE`

clause: `SELECT * FROM my_table WHERE condition`

. This query will only return records for which the `condition`

holds true. The SQL syntax for the actual condition is generally self-explanatory.

`GROUP BY`

and `HAVING`

Sometimes columns could correspond to categorical attributes like "sector". Here, an interesting class of questions involves aggregating some properties associated with each categorical value of the column. For this purpose, we would need the `GROUP BY`

clause: `SELECT MAX(salary), sector FROM my_table GROUP BY sector`

, which would find the highest salary per each sector. If we want to filter the categories, we can use the `HAVING`

clause. For example, we might want to filter out sectors based on their associated statistics, `HAVING`

is similar to `WHERE`

but operates on grouped categories instead. For example, `SELECT MAX(income), sector FROM my_table GROUP BY sector HAVING AVG(salary) < 50000.`

`JOIN`

Last but not least, the concept of `JOIN`

needs some explanation. As SQL databases store records in tables, sometimes we need to "merge" corresponding rows of two or more tables. We might need the merged records as the final result or as an intermediate step to compute something else. This requires joining one or more tables with the syntax: `SELECT * FROM table1 JOIN table2 ON table1.col_fkey = table2.col_pkey`

. The `ON`

part introduces a condition that is usually an equality relation between the foreign key and primary key columns like in this example but can also be on other columns. This query returns the combination of rows in `table1`

and rows in `table2`

whose value in the column `col_fkey`

of `table1`

equals to the value of `col_pkey`

of `table2`

.

To predict the correct SQL from a natural language question, the model needs to correctly interpret each input word in the context of both the sentence and the schema. Furthermore, it needs to generate a syntactically correct SQL query as the output otherwise the database cannot execute it. To illustrate the challenges more concretely, let's consider two examples for the "Employee_hire_evaluation" database of the Spider benchmark. This database is a development set domain that models would not have seen during training.

The database has the following tables: `employee`

, `shop`

, `hiring`

, `evaluation`

. Each table has a number of columns:

`employee`

:`employee_id`

,`name`

,`age`

,`city`

, with`employee_id`

being the primary key.`shop: shop_id, name, location, district, number_products, manager_name`

, with`shop_ID`

being the primary key.`hiring: shop_id, employee_ID, start_from, is_full_time`

, with`employee_id`

being the primary key, and also a foreign key to Employee table's`employee_id`

, and`shop_id`

being a foreign key to`shop`

table's`shop_id`

.`evaluation: employee_id, year_awarded, bonus`

, with`employee_id`

and`year_awarded`

together as the primary key, and`employee_id`

as a foreign key referencing Employee table's`employee_id`

.

**Question:** Which cities have more than one employee under 30?

**Correct SQL:**

`SELECT employee.city`

FROM employee

WHERE employee.age < 30

GROUP BY employee.city

HAVING COUNT (*) > 1

**Analysis**: Besides the general logic of the SQL query, a model needs to infer two conditions from the question, `employee.age < 30`

and `COUNT (*) > 1`

. The entities involved in the conditions (tables, columns or the star) are not explicitly mentioned in the text and have to be inferred. The model needs to deduce that "employee under 30" requires column `age`

, by leveraging two pieces of information. First, it can have some prior common sense knowledge that the expression "employee under [NUMBER]" refers to employee age rather than some other attribute. Second, it could exclude other columns because the value "$30$" is too different from other columns' values based on type or range. For the second condition, the model needs to infer from the entire phrase "Which cities have more than one employee ..." that the condition is on the number of employees in each city, hence requiring `GROUP BY`

[$\ldots$] `HAVING`

[$\ldots$]. Finally, it needs to piece the two conditions together as well as the rest of the query using the correct syntax.

**Question:** What's the average age in each shop?

**Correct SQL:**

`SELECT AVG (employee.age) , shop.shop_id`

FROM employee

JOIN hiring

JOIN shop ON employee.employee_id = hiring.employee_id

AND hiring.shop_id = shop.shop_id

GROUP BY shop.shop_id

**Analysis**: To correctly predict this SQL, not only does the SP model needs to infer correctly from "in each shop" that the output contains `GROUP BY shop.shop_id`

, it also needs to infer the involvement of tables `employee`

, `hiring`

which are not explicitly mentioned like `shop`

. The table `employee`

can be inferred based on the need for its `age`

column. On the other hand, the `hiring`

table can only be inferred from the need to link between `employee.age`

and `shop.shop_id`

.

You might wonder whether some generic or simple approach can already solve this cross-database text-to-SQL problem. For example, let's consider the sequence-to-sequence model often used in machine translation. Text-to-SQL semantic parsing bears some similarity to machine translation if we view SQL as a foreign language to translate into. However, some crucial differences exist. First, typical training datasets for machine translation larger than those for SQL semantic parsing by two orders of magnitude or even more. Second, in machine translation, partially correct results can still provide partial utility, but for an NLDB, any small mistake in the predicted SQL query could invalidate the result. Third, as we have seen from the examples, the database schema is crucial for correct translation to SQL, which sequence-to-sequence machine translation models do not consider. For these reasons, typical neural sequence-to-sequence models do not work well.

Another baseline is *shallow semantic parsing* in which we simplify the problem and assume that there are a fixed number of user intentions. An intent classifier could then select the template that best corresponds to the user question from a pre-defined list. Then a model extracts the relevant information from the user question to fill in the template slots. For instance, we can turn the first example into a template whose SQL would have some slots to be filled:

`SELECT employee.city`

FROM employee

WHERE employee.age [COMP_A] [A]

GROUP BY employee.city

HAVING COUNT (*) [COMP_C] [C]

Given enough training examples of question tagged with its corresponding template ID and slot values, then a model could potentially answer questions like "show me the cities with less than 5 employees over twenty five.", by identifying this template out of many, then predicting that `COMP_A`

$:=$ `<`

, `A`

$:=$ `5`

, `COMP_C`

$:=$ `>`

, `C`

$:=$ `25`

. This approach is commonly used in voice-assistant and task-oriented dialogue systems. The main drawback is that the templates need to be pre-defined, so the system cannot generalize to new queries on the fly. Hence this approach is also unsuitable for cross-database NLDB in general.

As shown by the two running examples, successful cross-database SQL semantic parsing really requires the model to reason using at least three sets of knowledge:

- Explicit knowledge about the domain expressed in the schema;
- Implicit background or common sense knowledge;
- Knowledge of SQL.

We now describe a general framework for cross-database text-to-SQL that leverages all of this knowledge. The backbone of the overall system is a neural network with encoder-decoder architecture, which is adapted in various ways to leverage explicit symbolic knowledge.

Motivated by the examples, we see that the model needs to jointly encode the question and schema, considering how words relate to each other within and across the question and the schema. So the input for cross-database semantic parsing has an inherent graph structure; the nodes are the tokens in the questions and schema and are linked by different edges. On the output side, to produce grammatically correct SQLs and leverage programming-language-specific inductive prior, we treat the prediction problem as generation of the abstract syntax tree (AST) of the program. Hence, we can characterize this task a *graph-to-tree* mapping.

Figure 2 illustrates the overall framework for Example One: an encoder consumes the input graph, and a decoder produces the output AST. Joint modelling of question and schema as a graph was popularized by the relation-aware transformer (RAT) work (Wang et al., 2019) while TranX (Yin and Neubig, 2018) provides a unified framework for modelling output programs as ASTs. Our Turing by Borealis AI system also follows this overall approach, with many additional innovations that we will not cover here.

As mentioned above, we view each token in the question and schema as a node in a graph. The most basic edge type among the nodes is a generic link between any pair of tokens, reflecting the assumption that a-priori any token could provide relevant context to any other token, so a link cannot be ruled out. This essentially yields a fully connected graph. For visual simplicity, we omit these edges from Figure 2.

However, other types of relations carry special meanings and are sparse. These include (i) foreign key relations that link a column in one table to the primary key of another table, (ii) exact string match and partial string match between words in the questions and words in column or table names and (iii) implicit links between a table and its columns. Some of these edges are illustrated in different colours on the input side in Figure 2. Because there can be more than one type of edge between two tokens to be modelled, this input is technically a multi-graph.

How do these edges help predict the correct SQL? Let's return to the examples.

In Example One (Figure 2), the word "employee" in the question exactly matches the table name `employee`

, so a special edge for an exact match is created in this input graph during preprocessing. For a graph neural network or relation-aware transformer that can encode a graph by propagating information along edges, this link creates a potential pathway for information from the representation of the columns (`employee_ID, name, age, city`

) of table `employee`

to contextualize the representation of the question token "employee", and vice versa. This makes it more likely for `employee_ID, name, age, city`

to be selected compared to columns in the other tables when predicting a column corresponding to the condition "employee under 30".

The second example is more interesting. The question mentions the table name `shop`

explicitly, while the table `employee`

can be easily inferred based on the column mention `age`

. However, for `hiring`

there is no textual evidence from the question, direct or indirect, that the SQL query should involve `hiring`

. The only way to infer is through the foreign key links and the fact that otherwise `shop`

and `employee`

are disconnected and cannot be joined. This potential reasoning process is illustrated in Figure 3.

Now that we understand how the (multi-)graph structure would help the semantic parsing model, let's formalize what the encoder does at a high level. Let $\mathcal{S}=\{s_1,\dots, s_{\lvert \mathcal{S} \rvert}\}$ denote the schema elements, consisting of tables and their columns, and use $Q=q_1\dots q_{\lvert Q \rvert}$ to denote the sequence of words in the question. Let $\mathcal{G}=\langle\mathcal{V}, \mathcal{E}\rangle$ denote the multi-graph with edge sets $\mathcal{E}$. The encoder, $f_{\text{enc}}$, maps $\mathcal{G}$ to a joint representation $ \mathcal{H} =$ $\{\phi^q_1, \ldots,\phi^q_{\lvert Q \rvert} \} \cup \{\phi^s_1, \ldots,\phi^s_{\lvert \mathcal{S} \rvert} \}$. The fully connected portion of the multi-graph can be well modelled by a transformer (see [link] for our blog series on Transformers). Indeed, one can flatten the schema into a linear string, with the tokens belonging to different column or table names separated by a special token like "[SEP]" and concatenate this string with the question string before feeding into a pre-trained model such as BERT. The use of pre-trained BERT (or other variants) here is how implicit common sense knowledge is embodied in the semantic parser. To model information propagation along the special sparse edges of the multi-graph, we can then feed the BERT output embeddings into a relation-aware transformer (Wang et al., 2019). There are a few subtle details omitted here, which we will give some pointers for at the end of this article.

If we model SQL queries as linear sequences of text tokens on the output side, it is not easy to leverage the SQL grammar knowledge. During inference, one could use a grammar validator program to check if a generated sequence is legal; however, the neural network is still not using this information during training for better generalization. Furthermore, the grammar not only captures what is illegal but also how SQL expressions can be composed. Leveraging this prior knowledge will significantly improve the learning efficiency from a small number of examples. Therefore, we want to cast the problem as generating the abstract syntax tree of SQL queries.

A common approach to predict an abstract syntax tree (AST) is to use a grammar-based transition system like TranX (Yin and Neubig, 2018), which decomposes the generation process of an abstract syntax tree (AST) into a sequence of actions. The neural model learns to predict the action sequence, and the transition system then constructs the AST using the predicted action sequence. Finally, another deterministic routine maps the AST into a linear string format of SQL, *aka* the surface code (Figure 4).

Figure 5 shows a snippet of the SQL grammar for TranX used by our Turing by Borealis AI system. It is specified in an abstract syntax description language (ASDL). It is similar to a context-free grammar, but more powerful, with each production rule's right-hand side being a function call signature with strongly-typed arguments. The type names are non-terminal symbols in the grammar, for which there are further production rules. This grammar is specific to the programming language of interest, or a subset of features in a programming language, and needs to be developed by a human expert.

`stmt = Intersect(query_expr lbody, query_expr rbody)` |

The transition system converts between an AST and its AST-constructing action sequence, leveraging a grammar like the one in Figure 5. The transition system starts at the root of the AST and derives the action sequence by a top-down, left-to-right depth-first traversal of the tree. At each step, it generates one of the possible parametrized action types.

For cross-domain text-to-SQL parsing, the action types can include: (1) **ApplyRule**[$r$] which applies a production rule $r$ of the grammar to the latest generated node in the AST; (2) **Reduce** which marks the complete generation of a subtree corresponding to a function call (in the ASDL grammar); (3-4) **SelectTable**[$t$] and **SelectColumn**[$c$] which, respectively, choose a table $t$ and a column $c$ from the database schema $\mathcal{S}$; (5) **CopyToken**[$k$] which copies a token $q_k$ from the user question $Q$; (6) **GenToken**[$l$] which generates a token $w_l$ from a vocabulary. In practice, with careful design, it is possible to simplify and avoid SelectTable and GenToken, which is part of the technical novelties in our Turing by Borealis AI system.

Before training, the TranX system first converts the surface SQL code to the AST representation using a deterministic domain-specific routine. Then, leveraging the grammar, it converts the AST into the action sequence (Figure 4). The actual training is then standard maximum likelihood with teacher-forcing, which you can read about in this tutorial. At each step, the model predicts the correct action conditioned on the ground-truth partial action sequence up to that point, as well as the encoder representation $\mathcal{H}$ of the question and schema. Most of the action types are parameterized by some argument, for example, production rule $r$ for ApplyRule, column $c$ for SelectColumn. The model first predicts the action type, then conditioned on the ground-truth action type (regardless of the predicted one), predicts the argument.

The inference process builds upon beam-search, which you can learn more about in this tutorial. The difference here is that the beam-search is guided by the grammar and the transition system. This grammar-guided beam-search decoding sounds complex and indeed has many tedious implementation details, but it is conceptually simple: at each step of decoding, for each partial sequence in the beam, the transition system tracks all action types and arguments that are legal according to the grammar; the neural net can only select from those options. Once beam-search produces multiple action sequences, the transition system converts them to ASTs, then converts them to surface SQL code strings using the domain-specific post-processing routine as illustrated in Figure 5.

Besides the neural attention over the encoder representation, some other weak reasoning using the grammar happens here during beam-search inference. By tracking multiple partial trees (implicitly, via partial action sequences), a hypothesis scored high at the beginning could drop sharply because its high-probability continuation could violate the grammar. As a result, another partial tree that is less likely at first, could become more plausible and eventually be the top prediction.

**Inferring and encoding special edges in the multi-graph:** we saw some examples of special edges between a question token and schema word, but there could be other types of links. For example, suppose a question word happens to match a database value in some column. In that case, this is evidence that this question word has an implicit relationship to the corresponding column. More generally, these edges are inferred using heuristic pre-processing rules, in a process known as *schema linking*. The relation-aware transformer layers can learn to deal with some degree of noise in the links. For more details, please see the original RAT paper (Wang et al,. 2019).

We also discussed using a pre-trained transformer to encode the implicit fully-connected part of the multi-graph, in conjunction with RAT-based modelling of the sparse special edges. But the pre-trained transformer builds contextualized representation for subword tokens, whereas table and column names are usually phrases. The Appendix section of Xu et al. (2021a) contains more information about how these models can be pieced together.

**Modelling tables implicitly through columns:** as mentioned previously, it is possible to drop the SelectTable action altogether. The idea is to globally uniquely identify the columns rather than using the column names only. We can add the table representation to all of its column representations on the input encoding side before feeding into the RAT layers. On the output side, we can give each column a globally unique ID for SelectColumn. Then the table can be inferred deterministically from the predicted columns during post-processing. This design choice simplifies the relation learning for encoding and makes the output action sequences shorter. On some rare occasions, this becomes an over-simplification causing failures for some complex queries, for instance, when there are multiple self-joins. Please see XU et al., (2021b) for more details.

**TranX transition system and leveraging tree structures in the neural decoder:** so far, we only showed how TranX works on the high level, but readers interested in using the framework for semantic parsing should consult (Yin and Neubig, 2018) for more details. In particular, the TranX transition system exposes the topology of the AST to the linear action sequence decoding process via something called *parent frontier field*. The parent does not always correspond to the immediate preceding step in the action sequence. Yet, it is important to directly condition on its representation during decoding, which is known as *parent feeding*.

**Handling values in the question:** in Example One, the value $30$ from the question is exactly the token needed in the condition part of the SQL statement, so it can be just copied over. However, in general, this might not always be the case. Most models use a combination of generation and copy attention. But as mentioned earlier, Turing (Xu et al., 2021b) simplifies away the generation and only performs the copy action. The idea is that during training, the model learns to identify the question text span providing evidence for the value, which significantly simplifies the learning problem and reduces overfitting. A heuristic search-based post-processor is responsible for producing the actual value to be used in the SQL at inference time.

**Training and generalization when the model is deep and the dataset small**: using relation-aware transformer layers on top of pre-trained transformers like BERT or RoBERTa can quickly make the overall model very deep and hard to train. The usual rules-of-thumb for optimizing transformers are to use a large batch size, make the model shallower, or both. However, our recent work finds a way to train ultra-deep transformers ($48$ layers) using a small batch size and this improves the model generalization, especially for hard cases. This technique allowed us to place No. $1$ on the Spider Leaderboard (Exact Set Match without Values) ^{1}.

**Beyond teacher-forcing maximum likelihood**: other sequence learning methods could also be used in theory, such as scheduled sampling or beam-search optimization (BSO). See our work on training a globally normalized semantic parsing model using a method similar to BSO (Huang et al., 2021), which works on some simple dataset, but not yet on complex ones like Spider.

**Other approaches for semantic parsing**: there are other promising approaches that do not follow the framework presented in this blog. For cross-database semantic parsing, Rubin and Berant (2021) abandons autoregressive decoding, but instead performs semi-autoregressive bottom-up semantic parsing. The advantage is that at each step of decoding, the model both conditions on and predicts semantically meaningful sub-programs, instead of semantically-vacuous partial trees. The method performs competitively on Spider, which is impressive; moreover, it potentially has better compositional or out-of-distribution generalization. On the other end of the spectrum, if our goal is not cross-domain text-to-SQL, but generic code generation, then our recent ACL work (Norouzi et al., 2021) shows that leveraging a large monolingual corpus of programming language source code enables simple transformer-based seq-to-seq baseline to perform competitively. Note that this does not contradict our discussion about simple seq-to-seq baseline unable to perform well in cross-database semantic parsing.

**Explaining the queries**: an essential feature of Turing by Borealis AI is the ability to explain the predicted queries to non-technical users. This allows people to use their own judgment to pick out which of the top hypotheses is more likely to be correct. Please check out our paper (Xu et al., 2021b) for more information about the explanation system.

1 As of June-02-2021, the time of publication of this blog. Our entry is "DT-Fixup SQL-SP + RoBERTa (DB content used) Borealis AI".

- Chenyang Huang, Wei Yang, Yanshuai Cao, Osmar Za ̈ıane, and Lili Mou. 2021. A globally normalized neural model for semantic parsing. In
*ACL-IJCNLP-2021 5th Workshop on Structured Prediction for NLP, Online. Association for Computational Linguistics.* - Sajad Norouzi, Keyi Tang, and Yanshuai Cao. 2021. Code generation from natural language with less prior knowledge and more monolingual data. In
*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics.* - Ohad Rubin and Jonathan Berant. 2021. SmBoP: Semi-autoregressive bottom-up semantic parsing. In
*Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 311–324, Online. Association for Computational Linguistics.* - Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2019. Rat-sql: Relationaware schema encoding and linking for text-to-sql parsers.
*arXiv preprint arXiv:1911.04942.* - Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi Tang, Chenyang Huang, Jackie Chi Kit Cheung, Simon J. D. Prince, and Yanshuai Cao. 2021a. Optimizing deeper transformers on small datasets. In
*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, Online. Association for Computational Linguistics.* - PengXu, WenjieZi, Hamidreza Shahidi, Ákos Kádár, Key iTang, Wei Yang, Jawad Ateeq, Harsh Barot, Meidan Alon, and Yanshuai Cao. 2021b. Turing: an accurate and interpretable multi-hypothesis cross-domain natural language database interface. In
*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online. Association for Computational Linguistics.* - Pengcheng Yin and Graham Neubig. 2018. Tranx: A transition-based neural abstract syntax parser for semantic parsing and code generation.
*arXiv preprint arXiv:1810.02720.*

The current dominant paradigm in natural language processing is to build enormous language models based on the transformer architecture. Models such as GPT3 contain billions of parameters, which collectively describe joint statistics of spans of text and have been extremely successful over a wide range of tasks.

However, these models do not explicitly take advantage of the structure of language; native speakers understand that a sentence is syntactically valid, even if it is meaningless. Consider how Colorless green ideas sleep furiously feels like valid English, whereas Furiously sleep ideas green colorless does not ^{1}. This structure is formally described by a *grammar*, which is a set of rules that can generate an infinite number of sentences, all of which sound right, even if they mean nothing.

In this blog, we review earlier work that models grammatical structure. We introduce the CYK algorithm which finds the underlying syntactic structure of sentences and forms the basis of many algorithms for linguistic analysis. The algorithms are elegant and interesting for their own sake. However, we also believe that this topic remains important in the age of large transformers. We hypothesize that the future of NLP will consist of merging flexible transformers with linguistically informed algorithms to achieve systematic and compositional generalization in language processing.

Our discussion will focus on *context-free grammars* or *CFGs*. These provide a mathematically precise framework in which sentences are constructed by recursively combining smaller phrases usually referred to as *constituents*.^{2} Sentences under a CFG are analyzed through a tree-structured derivation in which the sentence is recursively generated phrase by phrase (figure 1).

The problem of recovering the underlying structure of a sentence is known as *parsing*. Unfortunately, natural language is ambiguous and so there may not be a single possible meaning; consider the sentence I saw him with the binoculars. Here, it is unclear whether the subject or the object of the sentence holds the binoculars (figure 2). To cope with this ambiguity, we will need weighted and probabilistic extensions to the context free grammar (referred to as WCFGs and PCFGs respectively). These allow us to compute a number that indicates how "good" each possible interpretation of a sentence is.

In Part I of this series of two blogs, we introduce the notion of a context-free grammar and consider how to parse sentences using this grammar. We then describe the *CYK recognition algorithm* which identifies whether the sentence can be parsed under a given grammar. In Part II, we introduce the aforementioned weighted context-free grammars and show how the CYK algorithm can be adapted to compute different quanties including the most likely sentence structure. In Part III we instroduce probabilistic context-free grammars, and we present the *inside-outside algorithm* which efficiently computes the expected counts of the rules in the grammar for all possible analyses of a sentence. These expected counts are used in the E-Step of an expectation-maximization procedure for learning the rule weights.

Before tackling these problems, we'll first discuss the properties of a parse tree (figure 3). The *root* of the tree is labelled as "sentence" or "start". The leaves or *terminals* of the tree contain the words of the sentence. The parents of these leaves are called *pre-terminals* and contain the part-of-speech (POS) categories of the words (e.g., verb, noun, adjective, preposition). Words are considered to be from the same category if a sentence is still syntactically valid when they are substituted. For example: The {sad, happy, excited, bored} person in the coffee shop. This is known as the *substitution test*. Above the pre-terminals, the word categories are collected together into *phrases*.

There are three more important things to notice. First, the verb phrase highlighted in magenta has three children. However, there is no theoretical limit to this number. We could easily add the prepositional phrases in the garden and under a tree and so on. The complexity of the sentence is limited in practice by human memory and not by the grammar itself.

Second, the grammatical structure allows for recursion. In this example, a verb phrase is embedded within a second verb phrase, which itself is embedded in a third verb phrase. Finally, we note that the parse tree disambiguates the meaning of the sentence. From a grammatical point of view, it could be that it was the bone that was enjoying every moment. However, it is clear that this is not the case, since the verb phrase corresponding to enjoying is attached to the verb phrase corresponding to eating and not the bone (see also figure 2).

In this section, we present a more formal treatment of context-free grammars. In the following section, we'll elucidate the main ideas with an example.

A *language* is a set of *strings*. Each string is a sequence of *terminal symbols*. In figure 3 these correspond to individual words, but more generally they may be abstract *tokens*. The set of terminals $\Sigma=\{\mbox{a,b,c},\ldots\}$ is called an *alphabet* or *lexicon*. There is also a set $\mathcal{V}=\{\mbox{A,B,C}\ldots...\}$ of *non-terminals*, one of which is the special *start symbol* $S$.

Finally, there are a set $\mathcal{R}$ of *production* or *re-write* rules. These relate the non-terminal symbols to each other and to the terminals. Formally, these grammar rules are a subset of the finite relation $\mathcal{R}\in \mathcal{V} \times (\Sigma \cup \mathcal{V})^*$ where $*$ denotes the Kleene star. Informally, this means that each grammar rule is an ordered pair where the first element is a non-terminal from $\mathcal{V}$ and the second is any possible string containing terminals from $\Sigma$ and non-terminal from $\mathcal{V}$. For example, B$\rightarrow$ab, C$\rightarrow$Baa and A$\rightarrow$AbCa are all production rules.

A *context free grammar* is the tuple $G=\{\mathcal{V}, \Sigma, \mathcal{R}, S\}$ consisting of the non-terminals $\mathcal{V}$, terminals $\Sigma$, production rules $\mathcal{R}$, and start symbol $S$. The associated context-free language consists of all possible strings of terminals that are derivable from the grammar.

Informally, the term *context-free* means that each production rule starts with a single non-terminal symbol. Context-free grammars are part of the *Chomsky hierarchy* of languages which contains (in order of increasing expressiveness) regular, context-free, context-sensitive, and recursively enumerable grammars. Each differs in the family of production rules that are permitted and the complexity of the associated parsing algorithms (table 1). As we shall see, context-free languages can be parsed in $O(n^{3})$ time where $n$ is the number of observed terminals. Parsing more expressive grammars in the Chomsky hierarchy has exponential complexity. In fact, context-free grammars are not considered to be expressive enough to model real languages. Many other types of grammar have been invented that are both more expressive and parseable in polynomial time, but these are beyond the scope of this post.

Language | Recognizer | Parsing Complexity |

Recursively enumerable Context-sensitive Context-free Regular |
Turing machine Linear-bounded automata Pushdown automata Finite-state automata |
decideable PSPACE $O(n^3)$ $O(n)$ |

Table 1. The Chomsky hierarchy of languages. As the grammar-type becomes simpler, the required computation model (recognizer) becomes less general and the parsing complexity decreases.

Consider the context free grammar that generated the example in figure 4. Here, the set of non-terminals $\mathcal{V}=\{\mbox{VP, PP, NP, DT, NN, VBZ, IN,}\ldots\}$ contains the start symbol, phrases, and pre-terminals. The set of terminals $\Sigma=\{$The, dog, is, in, the, garden, $\ldots \}$ contains the words. The production rules in the grammar associated with this example include:

Of course, a full model of English grammar contains many more non-terminals, terminals, and rules than we observed in this single example. The main point is that the tree structure in figure 4 can be created by the repeated application of a finite set of rules.

Later on, we will describe the CYK recognition algorithm. This takes a sentence and a context-free grammar and determines whether there is a valid parse tree that can explain the sentence in terms of the production rules of the CFG. However, the CYK algorithm assumes that the context free grammar is in *Chomsky Normal Form (CNF)*. A grammar is in CNF if it only contains the following types of rules:

\begin{align}

\tag{binary non-terminal}

\text{A} &\rightarrow \text{B} \; \text{C} \\

\tag{unary terminal}

\text{A} &\rightarrow \text{a} \\

\tag{delete sentence}

\text{S} &\rightarrow \epsilon

\end{align}

where A,B, and C are non-terminals, a is a token, S is the start symbol and $\epsilon$ represents the empty string.

The *binary non-terminal rule* means that a non-terminal can create exactly two other non-terminals. An example is the rule $S \rightarrow \text{NP} \; \text{VP}$ in figure 4. The *unary terminal rule* means that a non-terminal can create a single terminal. The rule $\text{NN} \rightarrow$ $\text{dog}$ in figure 4 is an example. The *delete sentence* rule allows the grammar to create empty strings, but in practice we avoid $\epsilon$-productions.

Notice that the parse tree in figure 3 is *not* in Chomsky Normal Form because it contains the rule $\text{VP} \rightarrow \text{VBG} \; \text{NP} \; \text{VP}$. For the case of natural language processing, there are two main tasks to convert a grammar to CNF:

- We deal with rules that produce more than two non-terminals by creating new intermediate non-terminals (figure 5a).
- We remove unary rules like A $\rightarrow$ B by creating a new node A_B (figure 5b).

Both of these operations introduce new non-terminals into the grammar. Indeed, in the former case, we may introduce different numbers of new non-terminals depending on which children we choose combine. It can be shown that in the worst-case scenario, converting CFGs into an equivalent grammar in Chomsky Normal Form results in a quadratic increase in the number of rules. Note also that although the CNF transformation is the most popular, it is not the only, or even the most efficient option.

Given a grammar in Chomsky Normal Form, we can turn our attention to *parsing* a sentence. The parsing algorithm will return a valid parse tree like the one in figure 6 if the sentence has a valid analysis, or indicate that there is no such valid parse tree.

It follows that one way to characterize a parsing algorithm is that it searches over the set of all possible parse trees. A naive approach might be to exhaustively search through these trees until we find one that obeys all of the rules in the grammar and yields the sentence. In the next section, we'll consider the size of this search space, find that it is very large, and draw the conclusion that this brute-force approach is intractable.

The parse tree of a sentence of length $n$ consists of a binary tree with $n-1$ internal nodes, plus another $n$ nodes connecting the pre-terminals to the terminals. The number of binary trees with $n$ internal nodes can be calculated via the recursion:

\begin{equation}

C_{n} = \sum_{i=0}^{n-1}C_{n-i}C_{i}. \tag{1}

\end{equation}

The intuition for this recursion is illustrated in figure 7. This series of intergers are known as the Catalan number and can be written out explicitly as:

\begin{equation}

C_n = \frac{(2n)!}{(n+1)!n!}. \tag{2}

\end{equation}

Needless to say the series grows extremely fast:

\begin{equation}

1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, \ldots \tag{3}

\end{equation}

Consider the example sentence I saw him with the binoculars. Here there are only C_5=42 possible trees, but these must be combined with the non-terminals in the grammar (figure 8). In this example, for each of the 42 trees, each of the six leaves must contain one of four possible parts of speech (DT, NN, P, VBD) and each of the five non-leaves must contain one of four possible clause types (S, NP, VP, PP) and so there are 42 * 4^6 * 4^5 = 176160768 possible parse trees.

Even this minimal example had a very large number of possible explanations. Now consider that (i) the average sentence length written by Charles Dickens was 20 words, with an associated $C_{20}=6,564,120,420$ possible binary trees and (ii) that there are many more parts of speech and clause types in a realistic model of the English language. It's clear that there are an enormous number of possible parses and it is not practical to employ exhaustive search to find the valid ones.

The *CYK algorithm* (named after inventors John Cocke, Daniel Younger, and Tadao Kasami) was the first polynomial time parsing algorithm that could be applied to ambiguous CFGs (i.e., CFGs that allow multiple derivations for the same string). In its simplest form, the CYK algorithm solves the *recognition problem*; it determines whether a string $\mathbf{w}$ can be derived from a grammar $G$. In other words, the algorithm takes a sentence and a context-free grammar and returns TRUE if there is a valid parse tree or FALSE otherwise.

This algorithm sidesteps the need to try every possible tree by exploiting the fact that a complete sentence is made by combining sub-clauses, or equivalently, a parse tree is made by combining sub-trees. A tree is only valid if its sub-trees are also valid. The algorithm works from the bottom of the tree upwards, storing possible valid sub-trees as it goes and building larger sub-trees from these components without the need to re-calculate them. As such, CYK is a *dynamic programming* algorithm.

The CYK algorithm is just a few lines of pseudo-code:

`0 # Initialize data structure`

1 chart[1...n, 1...n, 1...V] := FALSE

2

3 # Use unary rules to find possible parts of speech at pre-terminals

4 for p := 1 to n # start position

5 for each unary rule A -> w_p

6 chart[1, p, A] := TRUE

7

8 # Main parsing loop

9 for l := 2 to n # sub-string length

10 for p := 1 to n-l+1 #start position

11 for s := 1 to l-1 # split width

12 for each binary rule A -> B C

13 chart[l, p, A] = chart[l, p, A] OR

(chart[s, p, B] AND chart[l-s,p+s C])

14

15 return chart[n, 1, S]

The algorithm is simple, but is hard to understand from the code alone. In the next section, we will present a worked example which makes this much easier to comprehend. Before we do that though, let's make some high level observations. The algorithm consists of four sections:

**Chart:**On line 1, we initialize a data structure, which is usually known as a*chart*in the context of parsing. This can be thought of as an $n\times n$ table where $n$ is the sentence length. At each position, we have a length $V$ binary vector where $V=|\mathcal{V}|$ is the number of non-terminals (i.e., the total number of clause types and parts of speech).**Parts of speech:**In lines 4-6, we run through each word in the sentence and identify whether each part of speech (noun, verb, etc.) is compatible.**Main loop:**In lines 8-13, we run through three loops and assign non-terminals to the chart. This groups the words into possible valid sub-phrases.**Return value:**In line 15 we return TRUE if the start symbol $S$ is TRUE at position $(n,1)$.

The complexity of the algorithm is easy to discern. Lines 9-13 contain three for loops depending on the sentence length $n$ (lines 9-11) and one more depending on the number of grammar rules $|R|$ (line 12). This gives us a complexity of $\mathcal{O}(n^3 \cdot |R|)$.

To make the CYK algorithm easier to understand, we'll use the worked example of parsing the sentence I saw him with the binoculars. We already saw in figure 2 that this sentence has two possible meanings. We'll assume the minimal grammar from figure 8 that is sufficient to parse the sentence. In the next four subsections we'll consider the four parts of the algorithm in turn.

Figure 9 shows the chart for our example sentence, which is itself shown in an extra row under the chart. Each element in the chart corresponds to a sub-string of the sentence. The first index of the chart $l$ represents the length of that sub-string and the second index $p$ is the starting position. So, the element of the chart at position (4,2) represents the sub-string that is length four and starts at word two which is saw him with the. We do not use the upper triangular portion of the chart.

The CYK algorithm runs through each of the elements of the chart, starting with strings of length 1 and working through each position and then moving to strings of length 2, and so on, until we finally consider the whole sentence. This explains the loops in lines 9 and 10. The third loop considers possible binary splits of the strings and is indexed by $s$. For position (4,2), the string can be split into saw $|$ him with the ($s=1$, blue boxes), saw him $|$ with the ($s=2$, green boxes), or saw him with $|$ the ($s=3$, red boxes).

Now that we understand the meaning of the chart and how it is indexed, let's run through the algorithm step by step. First we deal with strings of length $l=1$ (i.e., the individual words). We run through each unary rule $A \rightarrow w_p$ in the grammar and set these elements to TRUE in the chart (figure 10). There is only one ambiguity here, which is the word saw which could be a past tense verb or a noun. This process corresponds to lines 5-6 of the algorithm.

In the main loop, we consider sub-strings of increasing length starting with pairs of words and working up to the full length of the sentence. For each sub-string, we determine if there is a rule of the form $\text{A}\rightarrow \text{B}\;\text{C}$ that can derive it.

We start with strings of length $l=2$. These can obviously only be split in one way. For each position, we note in the chart all the non-terminals A that can be expanded to generate the parts of speech B and C in the boxes corresponding to the individual words (figure 11).

In the next outer loop, we consider sub-strings of length $l=3$ (figure 12). For each position, we search for a rule that can derive the three words. However, now we must also consider two possible ways to split the length 3 sub-string. For example, for position $(3,2)$ we attempt to derive the sub-string saw him with. This can be split as saw him $|$ with corresponding to positions (2,2)$|$(1,4) which contain VP and P respectively. However, there is no rule of the form $\text{A}\rightarrow\text{VP}\;\text{P}$. Likewise, there is no rule that can derive the split saw $|$ him with since there was no rule that could derive him with. Consequently, we leave position $(3,2)$ empty. However, at position $(3,4)$, the rule $\text{PP}\rightarrow \text{P}\;\text{NP}$ can be applied as discussed in the legend of figure 12.

We continue this process, working upwards through the chart for longer and longer sub-strings (figure 13). For each sub-string length, we consider each position and each possible split and add non-terminals to the chart where we find an applicable rule. We note that position $(5,2)$ in figure 13b corresponding to the sub-string saw him with the binoculars is particularly interesting. Here there are two possible rules $\text{VP}\rightarrow\text{VP}\;\text{PP}$ and $\text{VP}\rightarrow\text{VBD}\;\text{NP}$ that both come to the conclusion that the sub-string can be derived by the non-terminal VP. This corresponds to the original ambiguity in the sentence.

When we reach the top-most row of the chart ($l=6$), we are considering the whole sentence. At this point, we discover if the start symbol $S$ can be used to derive the entire string. If there is such a rule, the sentence is valid under the grammar and if there isn't then it is not. This corresponds to the final line of the CYK algorithm pseudocode. For this example, we use the rule $S\rightarrow \text{NP}\;\text{VP}$ explain the entire sting with the noun phrase I and the verb phrase saw him with the binoculars and conclude that the sentence is valid under this context free grammar.

The basic CYK algorithm just returns a binary variable indicating whether the sentence can be parsed or not under a grammar $G$. Often we are interested in retrieving the parse tree(s). Figure 14 superimposes the paths that led to the start symbol in the top left from figures 11-13. These paths form a *shared parse forest*; two trees share the black paths, but the red paths are only in the first tree and the blue paths are only in the second tree. These two trees correspond to the two possible meanings of the sentence (figure 15).

These two figures show that it is trivial to reconstruct the parse tree once we have run the CYK algorithm as long as we cache the inputs to each position in the chart. We simply start from the start symbol at position (6,1) and work back down through the tree. At any point where there are two inputs into a cell, there is an ambiguity and we must enumerate all combinations of these ambiguities to find all the valid parses. This technique is similar to other dynamic programming problems (e.g.: the canonical implementation of the longest common subsequence algorithm computes only the size of the subsequence, but backpointers allow for retrieving the subsequence itself).

The previous example was relatively unambiguous. For a bit of fun, we'll also show the results on the famously difficult-to-understand sentence Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo. Surprisingly, this is a valid English sentence. To comprehend it, you need to know that (i) buffalo is a plural noun describing animals that are also known as bison, (ii) Buffalo is a city, and (iii) buffalo is a verb that means "to indimidate". The meaning of the sentence is thus:

*Bison from the city Buffalo that are intimidated by other bison from the city Buffalo, themselves intimidate yet other bison from the city Buffalo.*

To make things even harder, we'll assume that the text is written in all lower case, and so each instance of buffalo could correspond to any of the three meanings. Could you come up with a grammar that assigns an intuitive analysis to this sentence? In Figure 16 we provide a minimal, but sufficient grammar that allows the CYK algorithm to find a single and reasonable parse tree for this strange sentence.

In this part of the blog, we have described the CYK algorithm for the recognition problem; the algorithm determines whether a string can be generated by a given grammar. It is a classic example of a dynamic programming algorithm that explores an exponential search space in polynomial time by storing intermediate results. Another way of thinking about the CYK algorithm from a less *procedural* and more *declarative* perspective is that it is performing logical deduction. The axioms are the grammar rules and we are presented with facts which are the words. For a given sub-string length, we deduce new facts applying the rules of the grammar $G$ and facts (or axioms) that we had previous deduced about shorter sub-strings. We keep applying the rules to reach new facts about which sub-string is derivable by $G$ with the goal of proving that $S$ derives the sentence.

Note that we have used an unconventional indexing for the chart in our description. For a more typical presentation, consult these slides.

In part II, we will consider assigning probabilities to the production rules, so when the parse is ambiguous, we can assign probabilities to the different meanings. We will also consider the inside-outside algorithm which helps learn these probabilities.

^{1} This famous example was used in *Syntactic Structures* by Noam Chomsky in 1957 to motivate the independence of syntax and semantics.

^{2} The idea that sentences are recursively built up from smaller coherent parts dates back at least to a Sanskrit sutra of around 4000 verses known as Aṣṭādhyāyī written by Pāṇini probably around the 6th-4th century BC.

For many of the government and industry partners in attendance at the AI4Good Lab’s Industry Night organised by CIFAR this year, the plan was to provide support, mentorship and guidance to the participants of the AI4Good Lab – a Canadian AI training initiative for women-identified STEM students. Industry experts would spend time with participants, exploring their fields of study, their goals and ambitions, and future career opportunities in the field.

While COVID-19 may have prevented attendees from being together in person, the organizers ensured everyone felt relaxed and comfortable in the virtual booths. The Borealis AI/RBC Amplify booth, for example, featured a ‘virtually comfortable’ L-shape couch, two square stools and a rectangle coffee table. Non-virtual drinks were, sadly, ‘BYOB.’

The two hostesses of the Borealis AI/RBC Amplify booth, Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI and Rachael Rishworth from RBC Amplify, talked with the AI4Good Lab participants about the wide range of learning and career opportunities available as students continue their journey of lifelong learning – from the Borealis AI Fellowships (which support AI researchers at Canadian Universities) and Borealis AI Internships through to the RBC Amplify program, which provides interns with hands-on prototype development opportunities at the bank.

The AI4Good Lab participants certainly seemed enthusiastic to learn and share. The booth was full for the entire 2 hours – a testament to the quality of the discussions (and, perhaps, the comfort of the virtual furniture?).

Alongside other big names in AI such as CIFAR, AMII, Vector Institute, Google Canada, DeepMind, Accenture and Manulife, the Borealis AI/RBC Amplify team offered participants a view into the wide array of initiatives and opportunities available in the AI space. They also spent time answering the participants’ questions about careers in the field of AI.

Yet it wasn’t just the students of the AI4Good Lab that were learning that night; so, too, were the industry partners and booth hosts and hostesses. Virtual networking lounges were placed in between booths, creating unique spaces that encouraged fruitful discussions among all participants – students, partners and organizers. Hosts and hostesses also visited each other’s booths to talk with ecosystem partners; in fact, rumor has it that Eirene was spotted on one of the ‘virtually hipster’ stools at the Vector Institute booth, taking a few minutes to chat with good friends and their guests at the end of the event.

More importantly, perhaps, the event highlighted the future impact the participants can make in the field and in the world around them. Since the start of the AI4Good Lab program in early May, the female-identified students participating in the Lab’s two cohorts in Montreal and Edmonton have been building their AI skills and capabilities, in order to conceptualize, design and develop a prototype of an AI application for social good. It is their ideas, research and development that will shape the debate around the value and ethics of AI in the future.

Ultimately, the AI4Good Industry Night demonstrated that learning is a life-long and collaborative journey. Industry participants shared their experience and insights; the students and the organizers of the AI4Good Lab shared theirs. Everybody left the event with a renewed sense of optimism, new ideas and new network connections.

On behalf of the attendees of the AI4Good Lab Industry Night, we would like to thank Maya Marcus-Sells and Yosra Kazemi for organizing a fantastic event in the face of the continued disruption of COVID-19.

Below are just a few photos of the event. We are confident the ideas generated there will emerge into view over the coming months and years.

]]>Canada’s AI research ecosystem has a long history of producing cutting-edge work, leading to the highest concentration of deep learning researchers and students in the world (Invest in Canada, 2017).

“At Borealis AI, we believe Canada’s continued leadership as a global destination for the study of AI requires ongoing support and investment from the business community. As one of the leading voices on AI in Canada, we are committed to helping grow the ecosystem – supporting those researchers, universities, startups and companies that are driving the next wave of exploration and innovation,”noted Dr. Kathryn Hume, Interim Head of Borealis AI.

This year’s Fellowships were awarded to students at nine Canadian universities, from Dalhousie University on the Atlantic to UBC on the Pacific. The ten Fellows – five women and five men – reflect diverse backgrounds and research areas, focusing their skills on problems that range from measuring the level of privacy in anonymous databases through to uncovering new ways to screen for prostate cancer.

“We admire the great Machine Learning research being conducted within Canada’s academic programs and research institutes like AMII, MILA and Vector Institute. And we are keen to support the young research talent flowing out of our universities. By investing in cutting-edge deep learning researchers, their universities and advisors, our goal is to build and strengthen the broader Machine Learning research ecosystem in Canada,”added Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI.

Faculty:Dr. Martha White

Borealis AI 2021 Fellow:Vincent Liu

Research topic:Developing batch reinforcement learning algorithms with theoretical guarantees

Faculty:Dr. Purang Abolmaesumi

Borealis AI 2021 Fellow:Golara Javadi

Research topic:Applying Machine Learning to create novel techniques for prostate cancer detection

Faculty:Dr. Arash Mohammadi

Borealis AI 2021 Fellow:Parnian Afshar

Research topic:Deep learning-based radiomics for disease diagnosis

Faculty:Dr. Sageev Oore

Borealis AI 2021 Fellow:Chandramouli Shama Sastry

Research topic:Applying generative models to the identification of distribution shifts and the learning of robust representations

Faculty:Dr. Joelle Pineau

Borealis AI 2021 Fellow:Lucas Page-Caccia

Research topic:The development of neural representations that adapt to new data

Faculty:Dr. Doina Precup

Borealis AI 2021 Fellow:Veronica Chelu

Research topic:Temporal credit assignment problems in reinforcement learning

Faculty:Dr. Hans U. Boden

Borealis AI 2021 Fellow:Lindsay White

Research topic:Applying algebraic topology to measure privacy in anonymous databases

Faculty:Dr. Xiaodan Zhu

Borealis AI 2021 Fellow:Xiaoyu Yang

Research topic:Natural language reasoning and incorporating external knowledge into neural networks

Faculty:Dr. Yasutaka Furukawa

Borealis AI 2021 Fellow:Nelson Nauata

Research topic:Structured reasoning, structured generative models, geometry generation, and geometry reconstruction

Faculty:Dr. Kimon Fountoulakis

Borealis AI 2021 Fellow:Shenghao Yang

Research topic:Combining discrete and continuous optimization methods for graph-based Machine Learning

“These Borealis AI Fellowships are a strong endorsement of the hard work being done at Canada’s Universities and Machine Learning Research Institutes. More importantly, they directly support Canadian research and research teams – like those at Dalhousie University – as they strive to advance the field of Machine Learning,”added Dr. Sageev Oore, Faculty at Dalhousie University and Vector Institute and Advisor to the Dalhousie University Borealis AI 2021 Fellow.

The new cycle of Fellowship applications for the next academic year will open this fall. Please refer to our site for details and information about applying to our Graduate Fellowship program.

These fellowships are part of Borealis AI’s commitment to support Canadian academic excellence in AI and Machine Learning. They provide financial assistance for exceptional domestic and international graduate students to carry out fundamental research, as they pursue their Masters and PhDs in various fields of AI. The program is one of a number of Borealis AI initiatives designed to strengthen the partnership between academia and industry and advance the momentum of Canada’s leadership in the AI space.

To learn more visit: https://www.borealisai.com/en/about/fellowships/

]]>