Publications

List of my research publications and preprints

Manifold Research Publications, 2025Technical Report

Benchmarking the Generality of Vision-Language-Action Models

Guruprasad P, Chowdhury S, Sikka H, Sharma M, Lu H, Rivera S, Khurana A, Wang Y

Our findings reveal significant insights into the current state of multimodal AI, highlighting both promising capabilities and critical limitations that inform future research directions. We release our complete benchmark suite, evaluation framework, and detailed analysis to accelerate progress in this field.

AAAI-AIA, 2026Conference

Confirmation bias: A challenge for scalable oversight

Recchia G, Mangat CS, Nyachhyon J, Sharma M, Canavan C, Epstein-Gross D, Abdulbari M

We conducted two studies examining the performance of simple oversight protocols where evaluators know that the model is correct most of the time, but not all of the time.

AACL-IJCNLP, 2025Conference

Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks

Nyachhyon J, Sharma M, Thapa P, Bal BK

We introduce twelvw new datasets, creating a new benchmark, the Nepali Language Understanding Evaluation (NLUE) benchmark, for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include single-sentence classification, similarity and paraphrase tasks, and Natural Language Inference (NLI) tasks. On evaluating the models using added tasks, we observe that the existing models fall short in handling complex NLU tasks effectively.

(CHiPSAL) COLING, 2025Conference Workshop

Development of Pre-Trained Transformer-based Models for the Nepali Language

Thapa P, Nyachhyon J, Sharma M, Bal BK

We collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language.

Arxiv Pre-print, 2025Preprint

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Chang TA, Arnett C,............., Sharma M, ....

We present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements.

Arxiv Pre-print, 2025Preprint

Local Herb Identification Using Transfer Learning: A CNN-Powered Mobile Application for Nepalese Flora

Thapa P, Sharma M, Nyachhyon J, Pandeya YR

Collected image datasets for local herbs and trained vision models for classification.