CMU Archives Repository Roadmap v.1
2017-08-09
Main sub-projects:
1. Metadata clean-up
-pull metadata out of ArchivalWare. Fix, normalize, augment & convert.
2. Scanning Workflow
-automate OCR & ingest
3. Archive Repository Software
-build repository, workflows, UI & preservation
Can be done in parallel:
Metadata Cleanup--------------------------------
\ (ingest into repository)
Scanning Workflow----------------
(OCR) (ingest into repository)
Archive Repository-------------------------------------------------------------…------->
Tasks in the sub-projects:
Metadata Clean-up
-dump from ArchivalWare
- check for completeness -(restore data lost on ArchivalWare import?) -evaluate -plan [vocabularies, taxonomies, schemas to use for each type of collection] [vocabularies, taxonomies, schemas to use in common – DublinCore/MODS +?] how do we get from metadata we have to metadata we want? -automate conversions we can - openrefine / conciliator + cmu modifications? -automate verification/analysis (e.g. show all values of each field & # occurances) -global find-replaces -convert to standard md models (DC/MODS) -(eventually map to linked data?)
Scanning Workflow
-bagit (-bagit with collection affiliation? -use scanned path?) -Abbyy OCR automation -command line version -config file version -server (pull tiffs from repository, push OCR/text in) -PDF web optimization(?) -conversion scripts to do: -derivative generation (e.g. text, thumbnails, JP2, epub etc.) -metadata lookup (use id to get metadata) -FITS metadata generation (file info tool set technical md) -ingest into repositories (CMU’s ArchiveRepository, figshare, box, etc.) -backup (rsync)
Archive Repository Software
-evaluate current state (hydra/CLAW) -choose platform (probably CLAW. both supposedly interoperable - PCDM) -meet w/ Pitt about their CLAW experience [8/21/2017, ongoing] -discuss w/ peer institutions (Penn St, U Maryland, U Oregon) -discuss w/ CLAW team -CLAW tech calls -map each type of our collections to existing/needed templates (photos/books/newspapers/finding aids/etc.) -put a collection which is supported into system for evaluation -determine needed pieces which are not being worked on (based on our use cases) -build missing pieces (easier said than done) -user interface design -admin user interface design -preservation layer (determine what this means & provide it) -APIs (OAI-PMH (for Primo etc.), LOCKSS (for MetaArchive? Etc.) , SPARQL) -staging system w/ administrative functions -production system -usability studies -accessibility testing
References
Islandora CLAW
https://islandora.ca/CLAW
Intro to Islandora CLAW
https://islandora-claw.github.io/CLAW/user-documentation/intro-to-claw/
Islandora CLAW MVP
https://islandora-claw.github.io/CLAW/mvp/mvp_doc/
LoC Standards (EAD,MODS,METS,MARC,MARCXML,VRA, etc.) http://www.loc.gov/standards/ VIAF: The Virtual International Authority File https://viaf.org/ W3C Linked Data https://www.w3.org/standards/semanticweb/data