jaj > corpus | BibSonomy

bookmarks (hide)38
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

1Diachronic Electronic Corpus of Tyneside English
a corpus of dialect speech from the Tyneside area of North-East England. DECTE is an amalgamation of the existing Newcastle Electronic Corpus of Tyneside English (NECTE) created between 2001 and 2005 (http://research.ncl.ac.uk/necte), and NECTE2, a collection of interviews conducted in the Tyneside area since 2007. It thereby constitutes a rare example of a publicly available on-line corpus presenting dialect material spanning five decades. The present website is designed for research use. DECTE also, however, includes an interactive website, The Talk of the Toon, which integrates topics and narratives of regional cultural significance in the corpus with relevant still and moving images, and which is designed primarily for use in schools and museums and by the general public.
10 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1Manually Annotated Sub-Corpus (MASC) Open American National Corpus
The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
10 years ago by @jaj
show all tags
corpus
reference
corpusreference
(0)
copydelete
- community post
- history of this post
1OntoNotes Release 4.0 - Linguistic Data Consortium
Developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
10 years ago by @jaj
show all tags
corpus
reference
corpusreference
(0)
copydelete
- community post
- history of this post
1subset of govdocs1 corpus
a subset of the govdocs1 corpus for testing file-characterization tools
11 years ago by @jaj
show all tags
corpus
digital_preservation
file_formats
tools
corpusdigital_preservationfile_formatstools
(0)
copydelete
- community post
- history of this post
1Digital Corpora » Govdocs1
a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.
11 years ago by @jaj
show all tags
corpus
digital_preservation
govdocs
tools
corpusdigital_preservationgovdocstools
(0)
copydelete
- community post
- history of this post
1openplanets/format-corpus · GitHub
An openly-licensed corpus of small example files, covering a wide range of formats and creation tools.
11 years ago by @jaj
show all tags
corpus
digital_preservation
file_formats
tools
corpusdigital_preservationfile_formatstools
(0)
copydelete
- community post
- history of this post
2LAUDATIO – Long-term Access and Usage of Deeply Annotated Information: XMLObjects
LAUDATIO aims to build an open access research data repository for historical linguistic data with respect to the above mentioned requirements of historical corpus linguistics. For the access and (re-)use of historical linguistic data the LAUDATIO repository uses a flexible and appropriate documentation schema with a subset of TEI customized by TEI ODD.
11 years ago by @jaj
show all tags
corpus
linguistics
corpuslinguistics
(0)
copydelete
- community post
- history of this post
2Web Research Collections - Web Track
The University of Glasgow took over the distribution of the WT2g/WT10g/.GOV/.GOV2 Web Research Collections from CSIRO (Commonwealth Scientific and Industrial Research Organisation), which has been distributing the Web Research collections to organizations and individuals engaged in research and development of natural language processing, information retrieval or document understanding systems, strictly for research purposes only. These collections have been used in the TREC Web & Terabyte tracks. In addition, as part of the TREC Blog track, the University of Glasgow is currently distributing the Blogs06 & Blogs08 test collections. Getting access to the test collections (including .GOV, .GOV2, Blogs06, and Blogs08)
12 years ago by @jaj
show all tags
corpus
web_archives
corpusweb_archives
(0)
copydelete
- community post
- history of this post
1An Automated Method of Topic-Coding Legislative Speech Over Time with Application to the 105th-108th U.S. Senate
Kevin M. Quinn, et.al, July 18, 2006. based on “United States Congressional Speech Corpus.”
12 years ago by @jaj
show all tags
congress
corpus
congresscorpus
(0)
copydelete
- community post
- history of this post
7LDC - Linguistic Data Consortium
supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. LDC's Catalog contains hundreds of corpora of language data including Santa Barbara Corpus of Spoken American
12 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1The Blog Authorship Corpus
consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
12 years ago by @jaj
show all tags
blogs
corpus
data
blogscorpusdata
(0)
copydelete
- community post
- history of this post
7Open Language Archives Community
an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. includes a search across text-archives.
12 years ago by @jaj
show all tags
corpus
linguistics
data
corpuslinguisticsdata
(0)
copydelete
- community post
- history of this post
1Linguist List - Open Language Archives Community
dedicated to collecting information about language resources and making it available from a single search.
12 years ago by @jaj
show all tags
corpus
linguistics
data
corpuslinguisticsdata
(0)
copydelete
- community post
- history of this post
1The Petabyte Age: Because More Isn't Just More — More Is Different
Wired Magazine issue 16.07. Data Deluge. Crop predictions. Quark. Data mining. tracking news. watching the skies, scanning skeletons. airfares. voting. epidemics. google events. terrorism. visualizing big data
12 years ago by @jaj
show all tags
data
datavisualization
corpus
textmining
datadatavisualizationcorpustextmining
(0)
copydelete
- community post
- history of this post
2WaCKy: Web-as-Corpus kool ynitiative
We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them. We also
12 years ago by @jaj
show all tags
corpus
corpus
(0)
copydelete
- community post
- history of this post
1Web as Corpus
English-language corpora compiled from the Web in 2006 and 2007, and more
12 years ago by @jaj
show all tags
corpus
concordances
corpusconcordances
(0)
copydelete
- community post
- history of this post
2Phrases in English
PIE incorporates a database derived from the second or World Edition of the British National Corpus (BNC 2000). It aims to provide a simple yet powerful interface for studying words and phrases up to eight words long appropriate for both experienced researchers and novice users.
12 years ago by @jaj
show all tags
corpus
tools
linguistics
corpustoolslinguistics
(0)
copydelete
- community post
- history of this post
11British National Corpus [bnc]
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
12 years ago by @jaj
show all tags
corpus
data
reference
linguistics
corpusdatareferencelinguistics
(0)
copydelete
- community post
- history of this post
1WebAsCorpus.org - find Web Concordances
search the web for words, phrases. get results with hits marked. download all pages for further research.
12 years ago by @jaj
show all tags
corpus
searchengine
research
linguistics
textmining
corpussearchengineresearchlinguisticstextmining
(0)
copydelete
- community post
- history of this post
1UCI Knowledge Discovery in Databases (KDD) Archive
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).
12 years ago by @jaj
show all tags
data_archive
datasets
datamining
big_data
corpus
data_archivedatasetsdataminingbig_datacorpus
(0)
copydelete
- community post
- history of this post

⟨⟨
⟨
1
2
⟩
⟩⟩

publications (hide)
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

No matching posts.

⟨⟨
⟨
⟩
⟩⟩

BibSonomy

bookmarks (hide)38
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

1Diachronic Electronic Corpus of Tyneside English

1Manually Annotated Sub-Corpus (MASC) Open American National Corpus

1OntoNotes Release 4.0 - Linguistic Data Consortium

1subset of govdocs1 corpus

1Digital Corpora » Govdocs1

1openplanets/format-corpus · GitHub

2LAUDATIO – Long-term Access and Usage of Deeply Annotated Information: XMLObjects

2Web Research Collections - Web Track

1An Automated Method of Topic-Coding Legislative Speech Over Time with Application to the 105th-108th U.S. Senate

7LDC - Linguistic Data Consortium

1The Blog Authorship Corpus

7Open Language Archives Community

1Linguist List - Open Language Archives Community

1The Petabyte Age: Because More Isn't Just More — More Is Different

2WaCKy: Web-as-Corpus kool ynitiative

1Web as Corpus

2Phrases in English

11British National Corpus [bnc]

1WebAsCorpus.org - find Web Concordances

1UCI Knowledge Discovery in Databases (KDD) Archive

publications (hide)
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

browse

related tags

concepts

tags

bookmarks (hide)38 displayallbookmarks onlybookmarks per page5102050100 sort byadded attitle RSSBibTeXXML

publications (hide) displayallpublications onlypublications per page5102050100 sort byadded attitleauthorpublication dateentry typehelp for advanced sorting... RSSBibTeXRDFmore...

browse

related tags

tags

bookmarks (hide)38
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

publications (hide)
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...