a corpus of dialect speech from the Tyneside area of North-East England. DECTE is an amalgamation of the existing Newcastle Electronic Corpus of Tyneside English (NECTE) created between 2001 and 2005 (http://research.ncl.ac.uk/necte), and NECTE2, a collection of interviews conducted in the Tyneside area since 2007. It thereby constitutes a rare example of a publicly available on-line corpus presenting dialect material spanning five decades. The present website is designed for research use. DECTE also, however, includes an interactive website, The Talk of the Toon, which integrates topics and narratives of regional cultural significance in the corpus with relevant still and moving images, and which is designed primarily for use in schools and museums and by the general public.
The Manually Annotated Sub-Corpus (MASC) consists of approximately 500,000 words of contemporary American English written and spoken data drawn from the Open American National Corpus (OANC).
Developed as part of the OntoNotes project, a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania and the University of Southern Californias Information Sciences Institute. The goal of the project is to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference).
a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.
LAUDATIO aims to build an open access research data repository for historical linguistic data with respect to the above mentioned requirements of historical corpus linguistics. For the access and (re-)use of historical linguistic data the LAUDATIO repository uses a flexible and appropriate documentation schema with a subset of TEI customized by TEI ODD.
The University of Glasgow took over the distribution of the WT2g/WT10g/.GOV/.GOV2 Web Research Collections from CSIRO (Commonwealth Scientific and Industrial Research Organisation), which has been distributing the Web Research collections to organizations and individuals engaged in research and development of natural language processing, information retrieval or document understanding systems, strictly for research purposes only. These collections have been used in the TREC Web & Terabyte tracks. In addition, as part of the TREC Blog track, the University of Glasgow is currently distributing the Blogs06 & Blogs08 test collections. Getting access to the test collections (including .GOV, .GOV2, Blogs06, and Blogs08)
supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards. LDC's Catalog contains hundreds of corpora of language data including Santa Barbara Corpus of Spoken American
consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. includes a search across text-archives.
Wired Magazine issue 16.07. Data Deluge. Crop predictions. Quark. Data mining. tracking news. watching the skies, scanning skeletons. airfares. voting. epidemics. google events. terrorism. visualizing big data
We are a community of linguists and information technology specialists who got together to develop a set of tools (and interfaces to existing tools) that will allow linguists to crawl a section of the web, process the data, index and search them. We also
PIE incorporates a database derived from the second or World Edition of the British National Corpus (BNC 2000). It aims to provide a simple yet powerful interface for studying words and phrases up to eight words long appropriate for both experienced researchers and novice users.
The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.
Online repository of large data sets for researchers in knowledge discovery and data mining. includes Discrete Sequence Data, Image Data, Multivariate Data, Relational Data, Spatio-Temporal Data, Text (corpora), Time Series, Web Data (web pages and log files).