[SEACrowd] Public SEA Datasheets

[SEACrowd] Public SEA Datasheets

Available datasheet list can be accessed via https://tinyurl.com/SCDatasheets. Before filling this form, please kindly check whether the dataset is present in the list or not. Since this initiative is mainly to represent SEA, please make sure that the datasets are either collected: 1) from speakers in SEA or 2) in SEA regions. UPDATE: Submissions after 31 March 23:59 (UTC) will receive 0 points.
Public Datasheet

All about the dataset hose datasets.
Dataset name*
For example: NusaSenti, IndoNLU BaPOS, Indonesian Clickbait, IndoNLG TED En-Id, etc.
Dataset subset(s)
Dataset description*
A brief (3-4 sentences long) description of the dataset. For example, IndoNLU TermA's description is: "The TermA span-extraction dataset is collected from the hotel aggregator platform, AiryRooms. The dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment."
Dataset URL*
Direct link to the dataset repository.
HuggingFace URL
Link to the dataset's HuggingFace (if present), e.g., https://huggingface.co/datasets/indonlu.
Dataset language(s)*
Dataset collection region — Where are the annotators from? Or is the data collected in/from specific SEA region(s)?*
Dataset task(s)*
Dataset modality*

SpracheVisionSpeechOther
Dataset domain(s)*
Dataset license*
Dataset annotation collection style*

CrawlingCrowdsourcedExpert-generatedMachine-generatedExpert-translatedMachine-translated
Dataset annotation validation style*

NoneAutomaticManual (partial)Manual (full)Automatic & Manual (partial)Automatic & Manual (full)
Number of annotators per sample
Leave blank if unknown and cannot be inferred
Inter-annotator agreement
Leave blank if unknown and cannot be inferred
Does the dataset have a pre-defined split?*

YesNo
Train split data size
Validation split data size
Test split data size
Please read this

We understand that some datasets have a huge number of subsets and it will be very tedious to input the data sizes per subset one by one through the form. If the dataset has >10 subsets, please just input "Contact me for data size" for the data size questions, and we will help you inputting it.
Train split data size (per data subset)*
Validation split data size (per data subset)*
Test split data size (per data subset)*
No split data size (per data subset)*
No split data size*
Data size unit*
Total data size
Total data size (all subsets)
Does the data have PII (personally identifiable information)?*

YesNo
Does the data have sensitive content/information?*

YesNo
Dataset Credentials
Dataset provider/affiliation*
Dataset or dataset paper publish year*
Dataset paper title*
Publication venue*
Dataset paper URL
Only leave blank if there's no publication
Dataset access*

FreeFree upon requestPaid
Derived from...
Whether the dataset is derived from other sources or not. For example: CC100-Indonesian dataset is derived from Common Crawl.
Contributor Information

For clarity, we might want to do a follow-up about the details on your datasheet.

Are you...?*

	Yes	No
Dataset paper author
Dataset owner

Was this dataset publicly available before?*
What was your reason to not open it before?*
We collect this answer for quantitative analysis.
Contributor name*

VornameNachname
Contributor email*

Confirmation Emailexample@example.com
Should be Empty:

[SEACrowd] Public SEA Datasheets

Public Datasheet

Please read this

Dataset Credentials

Contributor Information