I have used bert base pretrained model with 512 dimensions to generate contextual features. Feeding those vectors to random forest classifier is providing 83 percent accuracy but in various researches i have seen that bert minimal gives 90 percent. I have some other features too like word2vec, lexicon, TFIDF and punctuation features. Even when i merged all the features i got 83 percent accuracy. The research paper which i am using as base paper mentioned an accuracy score of 92 percent but they have used an ensemble based approach in which they classified through bert and trained random forest on weights. But i was willing to do some innovation thus didn't followed that approach. My dataset is biased to positive reviews so according to me the accuracy is less as model is also biased for positive labels but still I am looking for an expert advise
Code implementation of bert
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/Bert_Features.ipynb
Random forest on all features independently
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/RandomForestClassifier.ipynb
Random forest on all features jointly
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/Merging_Feature.ipynb
Regarding the "no improvements despite adding more features" - some researchers believe that the BERT word embeddings already contain all the available information presented in text, so then it doesn't matter how fancy a classification head you add to it, doesn't matter if it is a linear model that uses the embeddings, or a complicated ML algorithm with a number of other features, they will not provide significant improvements in many tasks. They argue, that since BERT is a context-aware, bidirectional language model - that is trained extensively on MLM and NSP tasks, it already grasps most of the things that additional features for punctuation, word2vec and tfidf could convey. The lexicon could probably help a little in the sentiment task, if it is relevant, but the one or two extra variables, that you likely use to represent it, probably get drowned in all the other features.
Other than that, the accuracy of BERT-based models depends on the dataset used, sometimes the data is simply too diverse to obtain a perfect score, e.g. if there are some instances of observations that are very similar, but with different class labels etc. You can see in the BERT papers, that the accuracy widely depends on the task, e.g. in some tasks it is indeed 90+%, but for some tasks, e.g. Masked Language Modeling, where the model needs to choose a particular word from a vocab of over 30K words, the accuracy of 20% could be impressive in some cases. So in order to obtain a reliable comparison with bert papers, you'd need to pick a dataset that they've used and then compare.
Regarding the dataset balance, for deep learning models in general, the rule of thumb is that the training set should be more or less balanced w.r.t. the fraction of data covered by each class label. So if you have 2 labels, should be ~50-50, if 5 labels, then each should be at around 20% of training dataset, etc. That is because most NN's work in batches, where they update the model weights based on the feedback from each batch. So if you have too many values of one class, the batch updates will be dominated by that one class, effectively worsening the quality of your training.
So, if you want to improve the accuracy of your model, balancing the dataset could be an easy fix. And if you have e.g. 5 ordered classes with differing sizes, you may consider merging some of them (e.g. reviews from 1-2 as bad, 3 as neutral, 4-5 as good) and then rebalancing, if still necessary.
(Unless it's a situation where e.g. 1 class has 80% of data, and 4 classes share the remaining 20%. In such a case you should probably consider some more advanced options, such as partitioning the algo to two parts, one predicting whether or not an instance is in class 1 (so a binary classifier), the other to distinguish between the 4 underrepresented classes. )