[SEACrowd] Public SEA Datasheets
Available datasheet list can be accessed via https://tinyurl.com/SCDatasheets. Before filling this form, please kindly check whether the dataset is present in the list or not. Since this initiative is mainly to represent SEA, please make sure that the datasets are either collected: 1) from speakers in SEA or 2) in SEA regions. UPDATE: Submissions after 31 March 23:59 (UTC) will receive 0 points.
Public Datasheet
All about the dataset hose datasets.
Dataset name
*
For example: NusaSenti, IndoNLU BaPOS, Indonesian Clickbait, IndoNLG TED En-Id, etc.
Dataset subset(s)
Dataset description
*
A brief (3-4 sentences long) description of the dataset. For example, IndoNLU TermA's description is: "The TermA span-extraction dataset is collected from the hotel aggregator platform, AiryRooms. The dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment."
Dataset URL
*
Direct link to the dataset repository.
HuggingFace URL
Link to the dataset's HuggingFace (if present), e.g., https://huggingface.co/datasets/indonlu.
Dataset language(s)
*
Dataset collection region — Where are the annotators from? Or is the data collected in/from specific SEA region(s)?
*
Dataset task(s)
*
Dataset modality
*
Sprache
Vision
Speech
Other
Dataset domain(s)
*
Dataset license
*
Dataset annotation collection style
*
Crawling
Crowdsourced
Expert-generated
Machine-generated
Expert-translated
Machine-translated
Dataset annotation validation style
*
None
Automatic
Manual (partial)
Manual (full)
Automatic & Manual (partial)
Automatic & Manual (full)
Number of annotators per sample
Leave blank if unknown and cannot be inferred
Inter-annotator agreement
Leave blank if unknown and cannot be inferred
Does the dataset have a pre-defined split?
*
Yes
No
Train split data size
Validation split data size
Test split data size
Please read this
We understand that some datasets have a huge number of subsets and it will be very tedious to input the data sizes per subset one by one through the form. If the dataset has >10 subsets, please just input "Contact me for data size" for the data size questions, and we will help you inputting it.
Train split data size (per data subset)
*
Validation split data size (per data subset)
*
Test split data size (per data subset)
*
No split data size (per data subset)
*
No split data size
*
Data size unit
*
Total data size
Total data size (all subsets)
Does the data have PII (personally identifiable information)?
*
Yes
No
Does the data have sensitive content/information?
*
Yes
No
Dataset Credentials
Dataset provider/affiliation
*
Dataset or dataset paper publish year
*
Dataset paper title
*
Publication venue
*
Dataset paper URL
Only leave blank if there's no publication
Dataset access
*
Free
Free upon request
Paid
Derived from...
Whether the dataset is derived from other sources or not. For example: CC100-Indonesian dataset is derived from Common Crawl.
Contributor Information
For clarity, we might want to do a follow-up about the details on your datasheet.
Are you...?
*
Yes
No
Dataset paper author
Dataset owner
Was this dataset publicly available before?
*
Please Select
Yes, it was public before.
It was a private dataset, but I made it public for SEACrowd.
This is a newly created dataset that was created during/for SEACrowd.
What was your reason to not open it before?
*
We collect this answer for quantitative analysis.
Contributor name
*
Vorname
Nachname
Contributor email
*
Confirmation Email
example@example.com
Save & Continue Later
Senden
Should be Empty: