Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly.