6.884: project ideas

Most of the experiments we're looking at in this class, especially on very small datasets, involve details of English syntax or morphology. Do these experiments change if performed on typologically very different languages? (E.g.: Do recent results on past tense formation in neural models carry over to languages whose tense system is more complex than English? Do tree-shaped models do anything in non-configurational languages?
Take one of the diagnostic datasets we looked at in class (SCAN, CLUTRR, etc.). Identify a new kind of generalization we might expect a model to make and implement a corresponding dataset split. Does the relative performance of existing models change on this dataset? Can you come up with new model architectures (neural, symbolic, or both) that improve performance? Concrete ideas: operationalize classical accounts of the ``poverty of the stimulus'' by systematically excluding complex Wh questions from training data; split SCAN (or more realistic semantic parsing datasets) by syntactic depth rather than length.
If you work with data from human subjects: train a model to perform the same task as your humans (as in the Hu et al. paper). Compare model predictions to human behavior / judgments / brain recordings. What models perform best? Are models with explicit symbolic scaffolding (RNNGs etc) helpful?
Characterize, empirically or theoretically, behavior on out-of-sample data in "unstructured" neural models like convnets or transformers. We saw that sequence-to-sequence models often do the "wrong" thing from the perspective of human users; what do they do instead? Something like this paper but with training data.
For tasks with unstructured inputs but where an explicit symbolic representation of the input is available at training time (e.g. kinship graphs in CLUTRR), what's the best way to use it when making predictions from unstructured data? Add prediction of the structured object as an auxiliary loss at training time? Treat the structured object as a latent variable and marginalize over it at test time?
Analyze "symbolic" inductive bias in non-linguistic tasks with models trained on language data. That is: get a big pretrained language model, fine-tune it on reasoning problems encoded with arbitrary symbols, and measure performance. (See e.g. this blog post.) Does pretraining help? Why?