Repository to extract gender-exclusive words based on their affixes.
As outlined in our paper From Showgirls to Performers: Fine-tuning with Gender-inclusive Language for Bias Reduction in LLMs, there are three main rounds of extraction & verification.
Use word_extraction.ipynb
to extract words with gender-marking affixes from the 200M words OpenWebText corpus.
The files created are
words/prefixes.json
words/suffixes.txt
The second round of verification uses the BabelNet Lexical Resource to verify whether the words are commonly used. For this step, use word_verification.ipynb
.
The files created are
words/verified_prefixes.csv
words/verified_suffixes.csv
words/verified_affixes.csv
(1. and 2. combined)
Round three was a round of manual verification and adding of gender-neutral replacements, which was done manually. The file that contains the words left after Round 3 is:
words/replacements.csv
Finally, plural versions were added with the following notebook:
word_extension.ipynb
This created the file:
words/replacements+plural.csv
The final version of words with gender-marking affixes and gender-neutral replacements, after one last round of manual checks, is:
words/replacements+plural-final.csv