First, he reviewed types of content: “Opaque” content, such as paper copies, is not processable. “Annoying” content insists on unwieldy proprietary formats. “Polluted” content is corrupted or mixes formats. And “tolerable” content comes along as HTML, a Word document or something that is more or less manageable.
For these types of content, there are various strategies to convert it: You can do it manually, get a tool to do it, or outsource it. His best practice in a nutshell is to stay flexible and open to find the best possible mix of tools, specialist help and automation.
Spelled out as a process, it looks something like this:
- Decide on the legacy sources and the target schema to convert.
- Analyze your sources carefully (and possibly clean them up where necessary).
- Map sources to the target schema.
- Establish conversion rules (and the gaps to fill by manual editing).
- Perform the actual conversion.
- If possible and desired, add necessary and useful metadata, links and connections to topics.
- Check the converted contents for accuracy, consistency and completeness (according to initial scope).
He also pointed out a few caveats:
- Do not to underestimate the complexity of the conversion process.
- Focus on the conversion purpose and business case, because neither structured content nor conversion can be an end in itself.