Publications | Nikita Kitaev

2022

ACL

Learned Incremental Representations for Parsing

Kitaev, Nikita, Lu, Thomas, and Klein, Dan

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) May 2022

Abs PDF Code

We present an incremental syntactic representation that consists of assigning a single discrete label to each word in a sentence, where the label is predicted using strictly incremental processing of a prefix of the sentence, and the sequence of labels for a sentence fully determines a parse tree. Our goal is to induce a syntactic representation that commits to syntactic choices only as they are incrementally revealed by the input, in contrast with standard representations that must make output choices such as attachments speculatively and later throw out conflicting analyses. Our learned representations achieve 93.72 F1 on the Penn Treebank with as few as 5 bits per word, and at 8 bits per word they achieve 94.97 F1, which is comparable with other state of the art parsing models when using the same pre-trained embeddings. We also provide an analysis of the representations learned by our system, investigating properties such as the interpretable syntactic features captured by the system and mechanisms for deferred resolution of syntactic ambiguities.

2021

Interactive Assignments for Teaching Structured Neural NLP

Gaddy, David, Fried, Daniel, Kitaev, Nikita, Stern, Mitchell, Corona, Rodolfo, DeNero, John, and Klein, Dan

In Proceedings of the Fifth Workshop on Teaching NLP Jun 2021

Abs PDF

We present a set of assignments for a graduate-level NLP course. Assignments are designed to be interactive, easily gradable, and to give students hands-on experience with several key types of structure (sequences, tags, parse trees, and logical forms), modern neural architectures (LSTMs and Transformers), inference algorithms (dynamic programs and approximate search) and training methods (full and weak supervision). We designed assignments to build incrementally both within each assignment and across assignments, with the goal of enabling students to undertake graduate-level research in NLP by the end of the course.

2020

ACL

Tetra-Tagging: Word-Synchronous Parsing with Linear-Time Inference

Kitaev, Nikita, and Klein, Dan

In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics Jul 2020

Abs PDF Code

We present a constituency parsing algorithm that, like a supertagger, works by assigning labels to each word in a sentence. In order to maximally leverage current neural architectures, the model scores each word’s tags in parallel, with minimal task-specific structure. After scoring, a left-to-right reconciliation phase extracts a tree in (empirically) linear time. Our parser achieves 95.4 F1 on the WSJ test set while also achieving substantial speedups compared to current state-of-the-art parsers with comparable accuracies.
ICLR

Reformer: The Efficient Transformer

Kitaev, Nikita, Kaiser, Lukasz, and Levskaya, Anselm

In International Conference on Learning Representations Jul 2020

Abs PDF Code

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L^2) to O(L log L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
ICLR

Multilingual Alignment of Contextual Word Representations

Cao, Steven, Kitaev, Nikita, and Klein, Dan

In International Conference on Learning Representations Jul 2020

Abs PDF

We propose procedures for evaluating and strengthening contextual embedding alignment and show that they are useful in analyzing and improving multilingual BERT. In particular, after our proposed alignment procedure, BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model, remarkably matching pseudo-fully-supervised translate-train models for Bulgarian and Greek. Further, to measure the degree of alignment, we introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer. Using this word retrieval task, we also analyze BERT and find that it exhibits systematic deficiencies, e.g. worse alignment for open-class parts-of-speech and word pairs written in different scripts, that are corrected by the alignment procedure. These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
EMNLP

Unsupervised Parsing via Constituency Tests

Cao, Steven, Kitaev, Nikita, and Klein, Dan

In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Nov 2020

Abs PDF

We propose a method for unsupervised parsing based on the linguistic notion of a constituency test. One type of constituency test involves modifying the sentence via some transformation (e.g. replacing the span with a pronoun) and then judging the result (e.g. checking if it is grammatical). Motivated by this idea, we design an unsupervised parser by specifying a set of transformations and using an unsupervised neural acceptability model to make grammaticality decisions. To produce a tree given a sentence, we score each span by aggregating its constituency test judgments, and we choose the binary tree with the highest total score. While this approach already achieves performance in the range of current methods, we further improve accuracy by fine-tuning the grammaticality model through a refinement procedure, where we alternate between improving the estimated trees and improving the grammaticality model. The refined model achieves 62.8 F1 on the Penn Treebank test set, an absolute improvement of 7.6 points over the previously best published result.
SMYRF-Efficient Attention using Asymmetric Clustering

Daras, Giannis, Kitaev, Nikita, Odena, Augustus, and Dimakis, Alexandros G

Advances in Neural Information Processing Systems Nov 2020

Abs PDF

We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from O(N^2) to O(N \log N), where N is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers \textitwithout any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using 50% less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.

2019

KERMIT: Generative insertion-based modeling for sequences

Chan, William, Kitaev, Nikita, Guu, Kelvin, Stern, Mitchell, and Uszkoreit, Jakob

arXiv preprint arXiv:1906.01604 Nov 2019

Abs PDF

We present KERMIT, a simple insertion-based approach to generative modeling for sequences and sequence pairs. KERMIT models the joint distribution and its decompositions (i.e., marginals and conditionals) using a single neural network and, unlike much prior work, does not rely on a prespecified factorization of the data distribution. During training, one can feed KERMIT paired data (x,y) to learn the joint distribution p(x,y), and optionally mix in unpaired data x or y to refine the marginals p(x) or p(y). During inference, we have access to the conditionals p(x|y) and p(y|x) in both directions. We can also sample from the joint distribution or the marginals. The model supports both serial fully autoregressive decoding and parallel partially autoregressive decoding, with the latter exhibiting an empirically logarithmic runtime. We demonstrate through experiments in machine translation, representation learning, and zero-shot cloze question answering that our unified approach is capable of matching or exceeding the performance of dedicated state-of-the-art systems across a wide range of tasks without the need for problem-specific architectural adaptation.
ACL

Cross-Domain Generalization of Neural Constituency Parsers

Fried, Daniel, Kitaev, Nikita, and Klein, Dan

In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Jul 2019

Abs PDF Code

Neural parsers obtain state-of-the-art results on benchmark treebanks for constituency parsing—but to what degree do they generalize to other domains? We present three results about the generalization of neural parsers in a zero-shot setting: training on trees from one corpus and evaluating on out-of-domain corpora. First, neural and non-neural parsers generalize comparably to new domains. Second, incorporating pre-trained encoder representations into neural parsers substantially improves their performance across all domains, but does not give a larger relative improvement for out-of-domain treebanks. Finally, despite the rich input representations they learn, neural parsers still benefit from structured output prediction of output trees, yielding higher exact match accuracy and stronger generalization both to larger text spans and to out-of-domain corpora. We analyze generalization on English and Chinese corpora, and in the process obtain state-of-the-art parsing results for the Brown, Genia, and English Web treebanks.
ACL

Multilingual Constituency Parsing with Self-Attention and Pre-Training

Kitaev, Nikita, Cao, Steven, and Klein, Dan

In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Jul 2019

Abs PDF Code

We show that constituency parsing benefits from unsupervised pre-training across a variety of languages and a range of pre-training conditions. We first compare the benefits of no pre-training, fastText, ELMo, and BERT for English and find that BERT outperforms ELMo, in large part due to increased model capacity, whereas ELMo in turn outperforms the non-contextual fastText embeddings. We also find that pre-training is beneficial across all 11 languages tested; however, large model sizes (more than 100 million parameters) make it computationally expensive to train separate models for each language. To address this shortcoming, we show that joint multilingual pre-training and fine-tuning allows sharing all but a small number of parameters between ten languages in the final model. The 10x reduction in model size compared to fine-tuning one model per language causes only a 3.2% relative error increase in aggregate. We further explore the idea of joint fine-tuning and show that it gives low-resource languages a way to benefit from the larger datasets of other languages. Finally, we demonstrate new state-of-the-art results for 11 languages, including English (95.8 F1) and Chinese (91.8 F1).
ACL

CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

Kim, Jin-Hwa, Kitaev, Nikita, Chen, Xinlei, Rohrbach, Marcus, Zhang, Byoung-Tak, Tian, Yuandong, Batra, Dhruv, and Parikh, Devi

In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics Jul 2019

Abs PDF Code

In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human players. We define protocols and metrics to evaluate learned agents in this testbed, highlighting the need for a novel “crosstalk” evaluation condition which pairs agents trained independently on disjoint subsets of the training data. We present models for our task and benchmark them using both fully automated evaluation and by having them play the game live with humans.

2018

ACL

Constituency Parsing with a Self-Attentive Encoder

Kitaev, Nikita, and Klein, Dan

In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Jul 2018

Abs PDF Code

We demonstrate that replacing an LSTM encoder with a self-attentive architecture can lead to improvements to a state-of-the-art discriminative constituency parser. The use of attention makes explicit the manner in which information is propagated between different locations in the sentence, which we use to both analyze our model and propose potential improvements. For example, we find that separating positional and content information in the encoder can lead to improved parsing accuracy. Additionally, we evaluate different approaches for lexical representation. Our parser achieves new state-of-the-art results for single models trained on the Penn Treebank: 93.55 F1 without the use of any external data, and 95.13 F1 when using pre-trained word representations. Our parser also outperforms the previous best-published accuracy figures on 8 of the 9 languages in the SPMRL dataset.

2017

EMNLP

Where is Misty? Interpreting Spatial Descriptors by Modeling Regions in Space

Kitaev, Nikita, and Klein, Dan

In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing Sep 2017

Abs PDF Code

We present a model for locating regions in space based on natural language descriptions. Starting with a 3D scene and a sentence, our model is able to associate words in the sentence with regions in the scene, interpret relations such as ‘on top of’ or ‘next to,’ and finally locate the region described in the sentence. All components form a single neural network that is trained end-to-end without prior knowledge of object segmentation. To evaluate our model, we construct and release a new dataset consisting of Minecraft scenes with crowdsourced natural language descriptions. We achieve a 32% relative error reduction compared to a strong neural baseline.

2015

Physics-based trajectory optimization for grasping in cluttered environments

Kitaev, Nikita, Mordatch, Igor, Patil, Sachin, and Abbeel, Pieter

In 2015 IEEE International Conference on Robotics and Automation (ICRA) Sep 2015

Abs PDF

Grasping an object in a cluttered, unorganized environment is challenging because of unavoidable contacts and interactions between the robot and multiple immovable (static) and movable (dynamic) obstacles in the environment. Planning an approach trajectory for grasping in such situations can benefit from physics-based simulations that describe the dynamics of the interaction between the robot manipulator and the environment. In this work, we present a physics-based trajectory optimization approach for planning grasp approach trajectories. We present novel cost objectives and identify failure modes relevant to grasping in cluttered environments. Our approach uses rollouts of physics-based simulations to compute the gradient of the objective and of the dynamics. Our approach naturally generates behaviors such as choosing to push objects that are less likely to topple over, recognizing and avoiding situations which might cause a cascade of objects to fall over, and adjusting the manipulator trajectory to push objects aside in a direction orthogonal to the grasping direction. We present results in simulation for grasping in a variety of cluttered environments with varying levels of density of obstacles in the environment. Our experiments in simulation indicate that our approach outperforms a baseline approachthat considers multiple straight-line trajectories modified to account for static obstacles by an aggregate success rate of 14% with varying degrees of object clutter.

2013

BOSS: Building Operating System Services

Dawson-Haggerty, Stephen, Krioukov, Andrew, Taneja, Jay, Karandikar, Sagar, Fierro, Gabe, Kitaev, Nikita, and Culler, David

In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13) Sep 2013

Abs PDF

Commercial buildings are attractive targets for introducing innovative cyber-physical control systems, because they are already highly instrumented distributed systems which consume large quantities of energy. However, they are not currently programmable in a meaningful sense because each building is constructed with vertically integrated, closed subsystems and without uniform abstractions to write applications against. We develop a set of operating system services called BOSS, which supports multiple portable, fault-tolerant applications on top of the distributed physical resources present in large commercial buildings. We evaluate our system based on lessons learned from deployments of many novel applications in our test building, a four-year-old, 140,000 sf building with modern digital controls, as well as partial deployments at other sites.

2012

Building application stack (BAS)

Krioukov, Andrew, Fierro, Gabe, Kitaev, Nikita, and Culler, David

In Proceedings of the Fourth ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings Sep 2012

Abs PDF

Many commercial buildings have digital controls and extensive sensor networks that can be used to develop novel applications for saving energy, detecting faults, improving comfort, etc. However, buildings are custom designed, leading to differences in functionality, connectivity, controls and operation. As a result today’s building applications are hard to write and non-portable. What is required is a form of mass customization that allows applications to automatically adapt to differences in buildings.