Case Study

Academic Research Support with
Custom NLP and Classification Models

Company	Industry
	Academic Research

Client

The InfoCoV project was led by the Faculty of Informatics and Digital Technologies at the University of Rijeka (UNIRi). It was focused on understanding how information spreads during crises, so their team developed a research framework for analyzing COVID-19-related communication on social media. This case reflects a collaboration between academia and industry, where Velebit AI supported the project with advanced NLP model development and implementation.

Impact

The project aimed to analyze how information and sentiment around COVID-19 spread on Twitter – specifically in Croatian, a language with limited NLP resources. The client team had collected a large dataset of tweets and associated metadata, but needed custom-trained models to extract meaningful insights from it.

Our role focused on building language models, developing classification systems, and integrating multiple data sources into a unified workflow. The end goal was to predict tweet sentiment and forecast how likely a tweet would be retweeted.

Problem

The project aimed to analyze how information and sentiment around COVID-19 spread on Twitter – specifically in Croatian, a language with limited NLP resources. The client team had collected a large dataset of tweets and associated metadata, but needed custom-trained models to extract meaningful insights from it.

Our role focused on building language models, developing classification systems, and integrating multiple data sources into a unified workflow. The end goal was to predict tweet sentiment and forecast how likely a tweet would be retweeted.

Challenges

Massive communication datasets

The dataset included over 200,000 COVID-19-related tweets, over half a million user comments on COVID-19 articles in online portals, and over 180,000 full-text articles. The dataset was of high variability in language and limited structure. Extracting accurate insights from this volume, especially during a fast-evolving global crisis, required robust models and targeted pre-processing.

Low-resource language limitations

Croatian lacks large pretrained NLP resources, making transfer learning less effective out of the box. We needed to train and fine-tune models specifically on Croatian COVID-related texts to reach usable performance levels.

COVID-19-related new vocabulary

New pandemic-related terms, hashtags, and phrases emerged quickly, many of which weren’t captured by existing tokenizers. Accurately handling this vocabulary was key to effective sentiment and topic classification.

Imbalanced sentiment labels

The human-labeled tweet sentiment dataset of 10,000 examples (Senti-Cro-CoV-Tweets) was heavily skewed toward neutral and negative examples. Training models on this imbalance risked poor performance on underrepresented classes, requiring careful balancing and loss adjustments.

Integration of knowledge from various data sources

Combining tweet text with structured user data (e.g., follower count, engagement history) introduced architectural and normalization complexity. But this integration was critical for improving prediction accuracy, especially for retweet forecasting.

Solution

We fine-tuned two models in a self-supervised fashion, BERT and ELECTRA, on a large Croatian COVID-related unlabeled text corpus. On top of that, we implemented supervised models for two main tasks: predicting tweet sentiment and estimating how likely a tweet would be retweeted. These models combined the tweet text with structured features like user metadata, all implemented using custom PyTorch architectures.

To boost performance, we engineered additional features from both tweet-level and user-level data and ran feature importance analysis to better understand what was driving the predictions. All models and code were delivered with clear documentation, making it easy for the researchers to adapt and reuse the work in future studies.

Tools and Technologies

Python

PyTorch

Hugging Face

LightGBM

Scikit-learn

Results

The final models combined structured and unstructured data in a way that significantly improved prediction accuracy, especially for sentiment and virality tasks. By tailoring the language models to Croatian COVID-19 content, we helped the team analyze complex patterns in a challenging, low-resource language setting. The results fed directly into their ongoing research, with published papers, public presentations, and further studies now building on this work.

Back

Let’s discuss how AI can help
your business success.

Contact us

Academic Research Support with Custom NLP and Classification Models

Company

Industry