This repository was created to compare the performance of foundational models at different tasks and levels of complexity, using visualisation and statistics.
Data:
Data was facilitated by DataAnnotation Tech
Categories:
| 1. Adversarial Dishones | 8. Extraction |
| 2. Adversarial Harmfulness | 9. Mathemathical Reasoning |
| 3. Brain Storming | 10. Open QA |
| 4. Classification | 11. Poetry |
| 5. Closed QA | 12. Rewriting |
| 6. Creative Writing | 13. Summarization |
| 7. Coding |
Likertype rating scale:
- Bard much better
- Bard better
- Bard slightly better
- About the same
- ChatGPT slightly better
- ChatGPT better
- Chat GPT much better
Tools used: pandas, plotly, statsmodels and scipy and scikit-posthocs
Note: Imbalance dataset, a prime number of prompts 1003, Bard was not rated "Bard much better in the Poetry Category" and "Bard better" in the category Creative Writing for simple prompts.
- Chi-square with Monte Carlo iterations p-value: 0.0001
- Kruskal-Wallis p-value: 6.96E-7
- Multinomila Logistic regression p-value: 0.00015
|
|
|
|
|
|
|