Thus, the expected outcomes are:
(1) In-depth theoretical and practical understanding of neural network methods and the underlying theoretical basis of multimodal neural architectures.
(2) To develop sensitivity analysis tools for NLG in general, and especially for NLG from visual input, which gobeyond the state of the art.
(3) Implementation of diverse multimodal neural architectures.
(4) New dataset for vision-text tasks (e.g., image captioning, grounded textual entailment generation) with an emphasis on high diversity and sensitivity of textual samples to visual inputs, and with an attention to design issues to avoid redundancy between modalities.
(5) Design and execution of evaluation experiments, both automatic and involving human participants, to evaluate sensitivity and the groundedness of multimodal models.