CMU Archives Repository Roadmap v.1
2017-08-09
Main sub-projects:
Metadata clean-up
pull metadata out of ArchivalWare. Fix, normalize, augment & convert.
Scanning Workflow
automate OCR & ingest
Archive Repository Software
build repository, workflows, UI & preservation
Can be done in parallel:
Metadata Cleanup--------------------------------\
\ (ingest into repository)
Scanning Workflow---\-------------\
\(OCR) \(ingest into repository)
Archive Repository-------------------------------------------------------------…------->
Tasks in the sub-projects:
Metadata Clean-up
- dump from ArchivalWare
-
check for completeness
- restore data lost on ArchivalWare import?
- evaluate
-
plan [vocabularies, taxonomies, schemas to use for each type of collection]
[vocabularies, taxonomies, schemas to use in common – DublinCore/MODS +?]
how do we get from metadata we have to metadata we want? - automate conversions we can - openrefine / conciliator + cmu modifications?
- automate verification/analysis (e.g. show all values of each field & # occurances)
- global find-replaces
- convert to standard md models (DC/MODS)
- eventually map to linked data?
Scanning Workflow
- bagit initial
- bagit with collection affiliation? -use scanned path?
- Abbyy OCR automation - [x] command line version - [ ] config file version - [ ] server (pull tiffs from repository, push OCR/text in)
- PDF web optimization(?)
- conversion scripts to do: - [ ] derivative generation (e.g. text, thumbnails, JP2, epub etc.) - [ ] metadata lookup (use id to get metadata) - [ ] FITS metadata generation (file info tool set technical md)
- ingest into repositories (CMU’s ArchiveRepository, figshare, box, etc.)
- backup (rsync)
Archive Repository Software
- evaluate current state (hydra/CLAW)
- choose platform (probably CLAW. both supposedly interoperable - PCDM)
- meet w/ Pitt about their CLAW experience [8/21/2017, ongoing]
- discuss w/ peer institutions (Penn St, U Maryland, U Oregon)
- discuss w/ CLAW team
- CLAW tech calls
-
map each type of our collections to existing/needed templates
(photos/books/newspapers/finding aids/etc.) - put a collection which is supported into system for evaluation
-
determine needed pieces which are not being worked on
(based on our use cases) - build missing pieces (easier said than done)
- user interface design
- admin user interface design
- preservation layer (determine what this means & provide it)
- APIs (OAI-PMH (for Primo etc.), LOCKSS (for MetaArchive? Etc.) , SPARQL)
- staging system w/ administrative functions
- production system
- usability studies
- accessibility testing
References
Islandora CLAW
https://islandora.ca/CLAW
Intro to Islandora CLAW
https://islandora-claw.github.io/CLAW/user-documentation/intro-to-claw/
Islandora CLAW MVP
https://islandora-claw.github.io/CLAW/mvp/mvp_doc/
LoC Standards (EAD,MODS,METS,MARC,MARCXML,VRA, etc.)
http://www.loc.gov/standards/
VIAF: The Virtual International Authority File
https://viaf.org/
W3C Linked Data
https://www.w3.org/standards/semanticweb/data