DISQUS

Scripting News: Future-safe archives (Scripting News)

  • JeffSand · 2 years ago
    While on sabbatical this summer I started to work on this and didn't really come up with a great solution.

    I ended up doing was creating a USB stick for my wife that includes my Flickr/Blog/Hosting/Server/Facebook passwords and billing information for my wife. Along with that I included close technical friends that will hopefully help her keep it up and going in the near term.

    This buys me a bit of time in case I get his by a bus tomorrow. Right now I'm working on a solution to pull out my content from Facebook those and stream it back to my blog cloud, etc and hopefully back it up to S3. Definitely not something for the average consumer.

    *sigh*
  • dave · 2 years ago
    S3 is tantalizingly close to being part of the solution.

    I'd like to see them do two things to start.

    1. Provide us a way to define an index page. The page it displays when someone goes to http://somesite.com/. Amazingly now there is no way to do it. So I couldn't use S3 to store scripting.com which is a totally static site.

    2. I'd like to be able to pay them say $10K and get a persistent SLA for a site like scripting.com. I don't think the bandwidth bills for such a site come close to the interest generated on $10K, so it could be perpetual.

    I would be concerned about Amazon's longevity though. They appear to be a solid company, but how long will they be around? Will they survive Bezos??
  • Jauder Ho · 2 years ago
    This works if this is just content. It gets much much harder if you start thinking about how to preserve applications that power sites. For example, how would one save Twitters in perpetuity? archive.org? Google cache? Twitter? Some sort of snapshot to PDF?

    Obviously this is not a problem just in the consumer space. In many industries, there are requirements to preserve information for 7 or more years. While it may be possible to preserve things at the byte level via tape backups, it is often impossible to retrieve because the software and hardware are obsolete and not available.
  • JeffSand · 2 years ago
    my plan is to grab things like twitter from the RSS feed from the site and again publish to my blogging engine via metaweblog api. from there i'll push to a more permanent store.
  • Bryn · 2 years ago
    I think RSS and metaweblogapi are not a sufficiently abstract way to think about this. When you're gone, the things you want to preserve are no longer dynamic they are raw assets, they stop changing. Search indexing technology is much closer to the right algorithm to statically preserve your digital memory. You need:
    1. A sufficient map of each of the entry points of your stuff.
    2. A crawler which can scan each entry point deeply, without missing any assets
    3. A proxy which caches the responses obtained by the crawler, and saves them in a resource structure which is accessible by the originating requests. The storage format should be standard, and portable.
    4. A perpetual domain registrar which maps storage instances of the proxy cache format to web domains via the mapping from item 1.
    5. A perpetual storage service.
    6. A financial trust to keep 4 and 5 running (archive.org?)
  • Eric link · 2 years ago
    Yeah, I agree. wget + long term storage for static pages, that would cover most things. And put a pc or two in the bank vault loaded w/ software to view it (browsers of they day, media players, etc). When we developed government and banking software, we archived the OS, compilers, everything needed to recreate the software, and put all that in escroe for the clients.
  • JeffSand · 2 years ago
    the piece I need to get this to the next level is a reblogger tool. I need something that will take an RSS feed and republish it via the metaweblogapi.

    with this i can take my posted items from facebook, my items in my flickr feed, etc and publish to blog store. after that I'll look to pushing to S3 or maybe an MSN space too. ;-)

    anyone know of a tool or a service I can use to do this?
  • Ethan Osten · 2 years ago
    Isn't this the problem that the Wayback Machine was supposed to help solve?
  • dave · 2 years ago
    Yes, that's a common question, first I have to ask you if you think it does?

    What happens when someone goes to one of Marc's sites -- how would they know to go to archive.org?

    Have you tested it your assumption? Have you looked to see how complete an archive they have of your sites? (I have, it's not complete.)

    What I want is better than what archive.org provides. It could be part of the solution, but if it were the complete solution, we'd just host our sites there, and that would be the end of the problem. But we don't -- for good reason. They don't have the technical answer, they don't host your domains (and is there a way to pay a fee for a domain that lasts for perpetuity?) and they don't provide any guarantee that part or all of your archive will be maitnained for any period of time. Even if they did, you'd have to ask what's the likelihood they'll be around in 50 years, or 100 years to fullfill the terms of the agreement? That's why the longeivty of the institution doing the archive matters too. Harvard parterning with archive.org might be a good solution. Add a SLA in there, and we might be getting close to the answer.
  • eas · 2 years ago
    As Dave points out, Archive.org is incomplete. There are various reasons for this, including the fact that some people don't want their sites archived because they loose control of their data, but that gives me an idea.

    What if those who cared could communicate a policy to archive.org. Maybe they could say "archive this, but don't make it publicly available for 5 years from the date you archived it. Or, archive this, but "hide it from public view until 10 years after it disappears from the web."
  • dave · 2 years ago
    Do you think it does?
  • Rex Hammock · 2 years ago
    I'm sure Google, Yahoo!, Amazon, et al, would love to host a backup of all you've written, Dave. Aren't they mad-dashing digitizing all the books ever published. Just think of all the keywords you've used in the past ten years that anyone of them would love to monetize.

    In all seriousness, I appreciate you talking about this topic. It's made me think about some things I'd otherwise have not considered.
  • dave · 2 years ago
    I promise you I have never gotten a response to one of these posts from any of those companies (this is the first FSA piece I've written). I don't think they're thinking that way, their interest is in monetizing, not preserving, despite what they say. I wouldn't support them monetizing my work, so they're probably not interested. I think we have to look to academia for a solution, and each of us should think in terms of creating a mini-endowment to support preserving our work (with other benefits to the organization receiving our largesse).
  • dave · 2 years ago
    Ooops, meant to say this is *not* the first FSA piece I've written.
  • byoogle · 2 years ago
    I was just asking on Twitter (bonus, "perfect" twit!):

    Why not host your content on big companies' free services? They're likely around for a while and incented to keep your stuff up.

    I don’t get the $ argument. Why do publishers still print Shakespeare books? Because people still buy them.
  • Scobleizer · 2 years ago
    Yeah, Archive.com isn't even close to complete. I'd guess they only have 10% of my first two-years worth of blogging backed up and none of the important ones, like what I wrote on 9/11.

    Speaking of which I gotta write to Lawrence and see if he has a copy of all my earlier blogs on some hard drive somewhere.
  • Garrett Combs · 2 years ago
    The more and more a toss this concept around in my head, I keep coming back to content being in hard-copy form. jauderho's point emphasizes just that. There's no way to guarantee that the information that you store digitally will still be accessible say, even tomorrow. Technology is constantly evolving, and 100 years from now, who's to say I'll still be able to open a RAW image?

    As a photographer, I am thinking about preserving images that I take. Hard-copy works well for this. But as we all have encountered with attempting to digitize every bit of hard-copy information these days and become "paperless" in our lives, hard-copy information degrades with time. It's a catch-22 either way you go, for my situation.

    On the topic of archiving content, such as scripting.com, I think you could digitally archive textual content relatively easily. But as Dave has already pointed out, who do you trust with that information? How long will a company/service/whatever be around? There's no guarantee.

    You've definitely provoked thought in my mind; we all (those of us that wish to maintain some type of legacy after we've passed) have a problem that needs solved, and relatively quickly. There's no telling when our day will inevitably come.
  • Hanan Cohen · 2 years ago
    This post takes me back to something I wrote back in 2004 about Weblogs.com , MailToTheFuture.com and death.

    It begins like this:

    "(To make things clear from the beginning, this piece is not about Dave Winer)

    It's time to add "death" to our thinking about the Internet."

    http://info.org.il/english/mail_to_the_future_d...
  • Jodie Miners · 2 years ago
    I've been thinking about this for the past few weeks after I was disgusted to hear the story of Australian "journalists" using facebook images of a soldier that was killed in Afghanistan for their story.
    So therefore automated dead man's switch type scenarios don't work because they take too long. Some form of Escrow service that you can keep updated and give access to a few select friend or family members would be good, along with some basic instructions, such as "take down flickr immediately" or "keep my blog going as a memorial" or something to that effect.
    But even that doesn't solve the problem of way long term storage of things of historical significance. The National Library of Australia has some websites they are archiving as significant resources see http://pandora.nla.gov.au/about.html but even that would be miniscule.
    So I suppose at the moment it's up to each one of us individually to have a personal plan kept in place with a few trusted people that know about it. I like Jeff Sand's idea of the Memory Stick but I would like to see something available on the Web.
  • John Norris · 2 years ago
    How about your blog in a nice, long lasting, book form? You'd lose a lot of the interactivity of a site, but maybe keep at least some of the content.

    Looks like there are some folks doing something like that already (blurb.com, I don't know anything about them.)

    I have 5 years worth of posts and scraped it into a nice big file, that's a bit of a hairball...now what?

    Excellent topic
  • https://openid.org/steven · 2 years ago
    Dave, you are spot on. I made a post about this a while back and re-entered it due to your post.

    "The dead web - Google + Archive.org? "

    http://tinyurl.com/2jdnmu
  • John Evans (Syntagma) · 2 years ago
    You've hit on the major problem with web-based work -- it needs constant attention and a team to watch over it in perpetuity.

    Books require no attention, except a decent fire-safe environment. Your work would be much more likely to survive if you have it printed in book form and donate copies to a wide spread of major libraries : Congress, British Library etc.

    I doubt Amazon, Google etc. will be around in 100 years, but there's a very strong chance the libraries will.

    What keeps the great writers alive and in print, of course, is public demand. Without that, anyone's work is doomed. The best we can do is to give it a chance by putting it in a place of assured safety -- books in libraries.
  • bowenmark · 2 years ago
    Thought the same things myself many a time, storage mediums that rely on advanced technology (well, more advanced than acid free paper or stone tablets) tend to be flaky to start with, remember micro-fiche?
  • michael · 2 years ago
    One piece of the pie is to nominate a (very) few formats as immortal. Those formats would be "kept alive" by folks going forward so future generations will be able to access the information we create now. Certainly no one will be able to read my old ClarisWorks docs in fifty years (most can't read them today). But there is hope for reading that same document in pdf (maybe).

    These formats should not be moving targets. They need to be stable and easy to deal with.

    We'll will need to update the list - certainly the current pdf format can't hold future holographic data (or even modern video). But this process shouldn't happen too quickly. Maybe a 10 or 20 year time frame for adding a new format would be appropriate.

    And obviously we need open formats with multiple open source readers available.
  • jeremyw · 2 years ago
    It would make sense to involve trained archivists in discussing this problem. Digital record preservation is a very important topic in the field at present. They may not develop the software-level solutions, but they have deep knowledge of and experience in not only the maintenance and organization of collections of documents in various forms and media, but also in dealing with legal and ethical issues arising.

    These latter issues just as necessary to deal with as the technology side of things. For instance, even if a web archival system is entirely voluntary, should there be restrictions on the use of content? Should a formula of standard practices developed for the preservation of blogs and other websites for people who do not specify how their legacy should be maintained once they pass? Should family members have a say in what is or is not preserved?

    On the technology and organization side of things, as people have begun to suggest, the issue of how web material should be preserved must be addressed. Should websites be maintained in a static or dynamic form as-is, i.e., maintaining the same URIs and preserving everything intact, without necessarily indicating that the site one is viewing is 'historical'? Is a particular degree of centralization or distribution of repositories desirable?

    It is also worth considering drawbacks to what might become a de facto automatic archival system that follows an author's death. Should we take any measures to avoid extreme data clutter that could develop over the next century, or less, as hundreds of millions of new creators take to the web? What do people involved in the search engine business have to say? (Do they think that many fiscal quarters in advance?) Should we even care, or do we just let them adapt?
  • Steve · 2 years ago
    I just suggested the exact same thing to Oliver Starr....going out to collect and collate all of Marc's digital breadcrumbs to preserve them for his children and the grandchildren that will, sadly, never know him in person.

    --Steve
  • hardaway · 2 years ago
    You have hit a nerve here. Can't there be a subscription service that we can order if we want to preserve our work? Or is there something we can will to our chiidren? My sites are hosted by a friend, and I think my daughters would pay to preserve them the way a gravesite is now preserved in a cemetary, because this is where my "essence" is.

    Interestingly enough, the social networks may end up being a solution for this.
  • AndrewBurton · 2 years ago
    The really sad thing about post-life digital presence is there's already a perfectly good infrastructure in place to handle it: the Internet. Barring the complete annihilation of mankind the 'Net likely isn't going away. PC's will always be connected, as will servers, XBoxes, etc. What someone needs to do is capitalize on the botnet business model: make software that's one part web server and one part BitTorrent, accessible lives broken up and stored across countless servers.
  • Joe · 2 years ago
    PutPlace.com my company is working on an OpenSource interface to S3 that gives a RESTful API to S3 with support for Hierarchical Storage, User Accessing Control, quotaing and caching. We also have some other tricks up our sleeves to insulate you from changes in the S3 T&Cs.
  • Jeff Ubois · 2 years ago
    Dave,

    Really good to see you hammering on this problem.

    Writing may be kinda solitary, but preservation has to be collaborative, and there are a lot of disparate efforts underway now to find solutions. There are a lot of subtle aspects to this, ranging from file formats, emulating applications, broken links to things outside the personal collection, and domain registration, to sustaining institutions over the long term.

    Alfred de Grazia (who did computer based social network analysis in the early 1950s using punch cards) came up with an economic model he described at http://www.grazian-archive.com/projects/archvpt....

    Cathy Marshall at Microsoft has done a lot of excellent working looking at user behavior, i.e. the human dimension, and how things really get lost. See Evaluating Personal Archiving Strategies for Internet-based Information at arxiv.org/pdf/0704.3647.pdf