A data pipeline using machine learning for UK taxpayers
The National Archives is a veritable treasure trove of interesting, and potentially valuable information and artefacts from a mountain of over 50,000 court judgments to military records dating back to 1703.
How could The National Archives’ materials be captured as data, processed and organised taking into account their varied nature, ruling and detail? How could this data be enriched to include hyperlinks, reconcile inconsistencies, add citations, and identify abbreviations to improve accessibility? How could all of this be achieved without the taxpayer funding a small army to manually crunch through it?
The National Archives asked MDRx to develop a sophisticated machine learning-enabled data pipeline that would ingest a vast number of documents, enrich them without the need for human intervention, and deliver value for the UK taxpayer and the most likely users of the platform including researchers, academics and lawyers.
MDRx deployed a deeply expert team to run an effective Discovery in which we mapped the existing data capture processes, agreed on universal simplified language for public consumption, validated the most appropriate natural language processing (NLP) technology and data science approaches, and understood the broader Government’s goals including those of the Ministry.
We designed an end-to-end process to make court judgments and all underlying data machine-readable. We did so by first breaking down text, ingested as either .csv, .xls, .xlsx, .pdf, .doc or .docx, into metadata for pattern identification. We then designed and deployed data models to predict outcomes for similar cases, before developing a pipeline of citations to enrich key materials further. We detected abbreviations and other legalese to enhance the clarity and accessibility of the materials.
We used a combination of natural language processing, machine learning and rules-based automation to deliver iteratively, with a disciplined Agile Scrum approach that allowed us to launch a public-facing piece of national infrastructure in just three months. We developed a portal that housed and catalogued materials in an accessible and flexibly searchable way.
We supported The National Archives’ procurement of a third-party product, vLex Justis, giving it access to one of the largest information repositories in the world and setting the product up for success once we had completed our engagement. We invested heavily in stakeholder engagement, knowledge transfer and documentation to ensure the public could benefit from a manageable, maintainable and resilient system for generations to come.
- World first: The English legal system is the first common law system to democratise judgments by providing complete and free public access.
- World-leading: We benchmarked the product favourably against global comparators, reaffirming the UK’s pre-eminence as a fair, transparent and commercially minded jurisdiction.
- Taxpayer value: We delivered a world-class system that exceeded expectations in a lean and cost-effective manner, and produced results in record time.
- Innovation, de-risked: We deployed intellectual property and court procedural experts to allow The National Archives to run faster, securing the knowledge that compliance underpins everything we do.