jaj > web_archives

bookmarks (hide)22
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

1OutbackCDX (nee tinycdxserver)
A RocksDB-based web archive index server which can serve records using Wayback's RemoteResourceIndex (xmlquery) protocol.
8 years ago by @jaj
show all tags
digital_preservation
tools
web_archives
digital_preservationtoolsweb_archives
(0)
copydelete
- community post
- history of this post
1Archive-It - Federal Depository Library Program Web Archive
Harvested by: Library Services and Content Management, U.S. Government Publishing Office. To provide permanent public access to Federal Agency Web content, the Federal Depository Library Program harvests selected U.S. Government Web sites in their entirety. Access to these sites is made available through links in our online public access catalog, the Catalog of U. S. Government Publications http://catalog.gpo.gov.
10 years ago by @jaj
show all tags
fdlp
govdocs
web_archives
fdlpgovdocsweb_archives
(0)
copydelete
- community post
- history of this post
1NetarchiveSuite - NetarchiveSuite - SBForge Confluence
The NetarchiveSuite is a complete web archiving software package developed from 2004 and onwards. The primary function of the NetarchiveSuite is to plan, schedule and run web harvests of parts of the Internet. It scales to a wide range of tasks, from small, thematic harvests (e.g. related to special events, or special domains) to harvesting and archiving the content of an entire national domain. The software has built-in bit preservation functionality. The systems architecture allows for the software to be distributed among several machines, possibly on more than one geographical location. The NetarchiveSuite is built around the Heritrix web crawler, which it uses to harvest the web.
11 years ago by @jaj
show all tags
tools
web_archives
web_harvesting
toolsweb_archivesweb_harvesting
(0)
copydelete
- community post
- history of this post
1Member Archives, IIPC
International Internet Preservation Consortium
11 years ago by @jaj
show all tags
web_archives
web_archives
(0)
copydelete
- community post
- history of this post
1User:WebCiteBOT - Wikipedia, the free encyclopedia
WebCiteBOT's purpose is to combat link rot by automatically WebCiting newly added URLs. It is written in Perl and runs automatically with only occasional supervision.
11 years ago by @jaj
show all tags
link_rot
tools
web_archives
link_rottoolsweb_archives
(0)
copydelete
- community post
- history of this post
8Archive.is - webpage capture
Archive.is allows you to create a copy of a webpage that will always be up even if the original link is down.
11 years ago by @jaj
show all tags
web_archives
web_archives
(0)
copydelete
- community post
- history of this post
2Web Research Collections - Web Track
The University of Glasgow took over the distribution of the WT2g/WT10g/.GOV/.GOV2 Web Research Collections from CSIRO (Commonwealth Scientific and Industrial Research Organisation), which has been distributing the Web Research collections to organizations and individuals engaged in research and development of natural language processing, information retrieval or document understanding systems, strictly for research purposes only. These collections have been used in the TREC Web & Terabyte tracks. In addition, as part of the TREC Blog track, the University of Glasgow is currently distributing the Blogs06 & Blogs08 test collections. Getting access to the test collections (including .GOV, .GOV2, Blogs06, and Blogs08)
12 years ago by @jaj
show all tags
corpus
web_archives
corpusweb_archives
(0)
copydelete
- community post
- history of this post
3International Internet Preservation Consortium (IIPC)
Libraries, Archives, Museums and other heritage or research institutions demonstrating a significant experience or level of commitment in the field of Web Archiving are entitled to apply for membership to the Consortium. Other consortia cannot apply for membership as a group; institutions within a consortium must apply individually.
12 years ago by @jaj
show all tags
organization
web_archives
organizationweb_archives
(0)
copydelete
- community post
- history of this post
1List of Web archiving initiatives - Wikipedia, the free encyclopedia
a list of Web archiving initiatives worldwide.
12 years ago by @jaj
show all tags
reference
web_archives
referenceweb_archives
(0)
copydelete
- community post
- history of this post
2SiteStory Web Archive - SiteStory Transactional Web Archive
Transactional Archiving consists of selectively capturing and storing transactions that take place between a web client (browser) and a web server. Most existing web archives recurrently send out bots to crawl the content of web servers. This results in observations of a server's content at the time of crawling. Since the crawling frequency is generally not aligned with the change rate of a server's resources, this approach is typically not able to capture all versions of a server's resource. The resulting archive may provide an acceptable overview of a server's evolution over time, but it will not provide an accurate representation of the server's entire history. A SiteStory Web Archive, however, captures every version of a resource as it is being requested by a browser. The resulting archive is effectively representative of a server's entire history...
12 years ago by @jaj
show all tags
tool
web_archives
toolweb_archives
(0)
copydelete
- community post
- history of this post
1International Internet Preservation Consortium (IIPC) | IIPC
a global network of experts archiving the web. The WARC archival standard, the Heritrix crawler, and the WARC analytic tools are all products of IIPC working groups and projects and initiatives, and they make up the standard tool-kit for archival web capture around the world.
12 years ago by @jaj
show all tags
organization
web_archives
organizationweb_archives
(0)
copydelete
- community post
- history of this post
1Archive-It.org - Wiretapping and the National Security Agency
COLLECTION: Wiretapping and the National Security Agency. Partner: John Gilmore. Crawling Activity: 2007 - present.
12 years ago by @jaj
show all tags
web_archives
privacy
spying
web_archivesprivacyspying
(0)
copydelete
- community post
- history of this post
1rescue.media.org: digital conservation.
media.org is a is a collective of artists/architects, netizens fueled by a passion for the potential of the Internet. Co-founded by Carl Malamud and webchick, their goal for the organization is to push the Internet to greater heights through public works and activism. They also collaborate as the Internet Multicasting Service (IMS), the nonprofit group that helped pioneer some important early content on the World Wide Web.
12 years ago by @jaj
show all tags
preservation
web_archives
preservationweb_archives
(0)
copydelete
- community post
- history of this post
1museum.media.org: all for the love of data
Rescued Works, sites we have chosen to refurbish and republish on the Internet for posterity. The second is Living Works, sites we created that now have a life of their own and are maintained by someone else. And the third are what we call the World Wide Cobweb, sites that live on the net in their original state without any active maintenance.
12 years ago by @jaj
show all tags
preservation
web_archives
preservationweb_archives
(0)
copydelete
- community post
- history of this post
5The Memento Project
wouldn’t it be nice if you could just connect to cnn.com, Wikipedia, or news.bbc.co.uk indicating that you are interested in the pages of March 20 2008, not the current ones? The Memento project proposes new ideas related to Web Archiving, focusing on the integration of archived resources in regular Web navigation. Memento is a collaboration between: The Prototyping Team of the Research Library of the Los Alamos National Laboratory: Luydmilla Balakireva, Robert Sanderson, Harihar Shankar, Herbert Van de Sompel. The Computer Science Department of Old Dominion University: Scott Ainsworth, Michael Nelson.
12 years ago by @jaj
show all tags
browsers
web_archives
web_preservation
browsersweb_archivesweb_preservation
(0)
copydelete
- community post
- history of this post
4UK Web Archive
Thousands of UK websites have been collected since 2004 and the Archive is growing fast. Here you can see how sites have changed over time, locate information no longer available on the live Web and observe the unfolding history of a spectrum of UK activities represented online. Sites that no longer exist elsewhere are found here and those yet to be archived can be saved for the future by nominating them. The Archive contains sites that reflect the rich diversity of lives and interests throughout the UK. Search is by Title of Website, Full Text or URL, or browse by Subject, Special Collection or Alphabetical List.
12 years ago by @jaj
show all tags
uk
web_archives
ukweb_archives
(0)
copydelete
- community post
- history of this post
3Memento: Adding Time to the Web
Memento wants to make it as straightforward to access the Web of the past as it is to access the current Web. If you know the URI of a Web resource, the technical framework proposed by Memento allows you to see a version of that resource as it existed at some date in the past, by entering that URI in your browser like you always do and by specifying the desired date in a browser plug-in. Or you can actually browse the Web of the past by selecting a date and clicking away. Whatever you land upon will be versions of Web resources as they were around the selected date. Obviously, this will only work if previous versions are available somewhere on the Web. But if they are, and if they are on servers that support the Memento framework, you will get to them.
12 years ago by @jaj
show all tags
preservation
web_archives
preservationweb_archives
(0)
copydelete
- community post
- history of this post
17HTTrack Website Copier - Offline Browser
hoover up those sites
12 years ago by @jaj
show all tags
crawler
tools
web_archives
crawlertoolsweb_archives
(0)
copydelete
- community post
- history of this post
1Getleft Freeware download and review - web site downloader from SnapFiles
hoover up those sites. Getleft is a web site downloader, that downloads complete web sites according to the settings provided by the user. It automatically changes all the absolute links to relative ones, so you can surf the downloaded pages (web sites) on your local computer without the need to connect to the internet. so that you can surf the site in your hard disk. Getleft supports several filters, allowing you to limit the download to certain files, as well as resuming , following of external links, sitemap and more. Getleft supports proxy connections and can be scheduled to update downloaded pages automatically.
12 years ago by @jaj
show all tags
crawler
tools
web_archives
crawlertoolsweb_archives
(0)
copydelete
- community post
- history of this post
1WARC, Web ARChive file format
The WARC (Web ARChive) format specifies a method for combining multiple digital resources into an aggregate archival file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format [ARC_IA] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.
12 years ago by @jaj
show all tags
digital_preservation
file_formats
web_archives
digital_preservationfile_formatsweb_archives
(0)
copydelete
- community post
- history of this post

⟨⟨
⟨
1
2
⟩
⟩⟩

publications (hide)
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

No matching posts.

⟨⟨
⟨
⟩
⟩⟩

BibSonomy

bookmarks (hide)22
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

1OutbackCDX (nee tinycdxserver)

1Archive-It - Federal Depository Library Program Web Archive

1NetarchiveSuite - NetarchiveSuite - SBForge Confluence

1Member Archives, IIPC

1User:WebCiteBOT - Wikipedia, the free encyclopedia

8Archive.is - webpage capture

2Web Research Collections - Web Track

3International Internet Preservation Consortium (IIPC)

1List of Web archiving initiatives - Wikipedia, the free encyclopedia

2SiteStory Web Archive - SiteStory Transactional Web Archive

1International Internet Preservation Consortium (IIPC) | IIPC

1Archive-It.org - Wiretapping and the National Security Agency

1rescue.media.org: digital conservation.

1museum.media.org: all for the love of data

5The Memento Project

4UK Web Archive

3Memento: Adding Time to the Web

17HTTrack Website Copier - Offline Browser

1Getleft Freeware download and review - web site downloader from SnapFiles

1WARC, Web ARChive file format

publications (hide)
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...

browse

related tags

concepts

tags

bookmarks (hide)22 displayallbookmarks onlybookmarks per page5102050100 sort byadded attitle RSSBibTeXXML

publications (hide) displayallpublications onlypublications per page5102050100 sort byadded attitleauthorpublication dateentry typehelp for advanced sorting... RSSBibTeXRDFmore...

browse

related tags

tags

bookmarks (hide)22
display
all
bookmarks only
bookmarks per page
5
10
20
50
100
sort by
added at
title
RSS
BibTeX
XML

publications (hide)
display
all
publications only
publications per page
5
10
20
50
100
sort by
added at
title
author
publication date
entry type
help for advanced sorting...
RSS
BibTeX
RDF
more...