DOCUMENT INTELLIGENCE ANALYSIS ANALYSIS

Document Metadata Extraction

Documents published online often contain embedded properties detailing their authors, internal file paths, and editing history. This showcase displays the scripts used to gather and parse public media documents, and presents the resulting findings.

Step 01 // Bulk Media Fetcher (Python)

To retrieve attachments in bulk from sequential pathways, we use a Python script. It automatically requests each incremental ID and determines the appropriate file extension by inspecting HTTP Content-Type headers.

import os
import requests
import mimetypes
from urllib.parse import urljoin
import time

# Configured for local testing
BASE_URL = "https://www.cafcass.gov.uk/media/"
OUTPUT_DIR = "./downloaded_assets"
START_ID = 1
END_ID = 5000

os.makedirs(OUTPUT_DIR, exist_ok=True)

def download_file(resource_id):
    file_url = urljoin(BASE_URL, str(resource_id))
    
    try:
        # Use a GET request to retrieve the headers and content
        response = requests.get(file_url, timeout=10, verify=False)
        
        if response.status_code == 200:
            # 1. Read the Content-Type header (e.g., "application/pdf")
            content_type = response.headers.get('Content-Type', '').split(';')[0].strip()
            
            # 2. Look up the standard extension for this MIME type
            ext = mimetypes.guess_extension(content_type)
            
            # Fall back to a generic extension if the type is unknown
            if not ext:
                ext = '.dat'
                
            # 3. Construct the filename using the correct extension
            filename = f"asset_{resource_id}{ext}"
            filepath = os.path.join(OUTPUT_DIR, filename)
            
            with open(filepath, 'wb') as f:
                f.write(response.content)
            print(f"[+] Downloaded asset {resource_id} as {filename} (MIME: {content_type})")
        else:
            print(f"[-] Asset {resource_id} returned status: {response.status_code}")
            
    except requests.exceptions.RequestException as e:
        print(f"[!] Error fetching asset {resource_id}: {e}")

if __name__ == "__main__":
    print(f"Starting generic download from {BASE_URL}...")
    for rid in range(START_ID, END_ID + 1):
        download_file(rid)
        time.sleep(0.1)
Step 02 // Metadata Property Parser (PowerShell)

Once files are downloaded locally, we use a Windows PowerShell script utilizing the COM Shell.Application object. This extracts extended shell properties such as System.Author and System.Document.LastAuthor.

$folderPath = "C:\Users\rtrav\govuk\downloaded_assets"
$shell = New-Object -com shell.application
$folder = $shell.Namespace($folderPath)

Get-ChildItem $folderPath | ForEach-Object {
    $file = $folder.ParseName($_.Name)
    [PSCustomObject]@{
        FileName      = $_.Name
        Author        = ($file.ExtendedProperty("System.Author") -join "; ")
        LastSavedBy   = $file.ExtendedProperty("System.Document.LastAuthor")
    }
} | Export-Csv -Path "metadata_export.csv" -NoTypeInformation
Security Recommendation

Sanitizing Before Publishing

Organizations should mandate automated document scrubbing/sanitization pipelines before assets are uploaded to public CMS repositories. Modern document management suites can strip custom properties, author identities, revision trackers, and internal folder path strings automatically.

Parsed Findings // Extracted Author Names

By scanning metadata on Cafcass documents, the following names, department tags, and creator signatures were extracted. This demonstrates how easily public files leak team directories and individual contributors.

Lynch, Jennifer - Cafcass
Evans, Claire - Cafcass
Cheema, Sandeep - Cafcass
Weetch, Emma - Cafcass
john, Rebecca - Cafcass
Nelmes, Linda - Cafcass
Pitcher, David - Cafcass
Hyde, Andy - Cafcass
Halliday, Emily - Cafcass
john, Rebecca - Cafcass
Baldwin, Carol - Cafcass
Marrinan, Maria - Cafcass
Grammatica, Karen - Cafcass
Marsh, Jane - Cafcass
Rodger, Holly - Cafcass
Blakebrough, Nicola - Cafcass
Egbewole-Adereti, Grace - Cafcass
Peter Bates
Natasha Graves
Alex Jones
Stuart Robinson Sussex University
Jigna Patel
Jennifer Okoro-Thompson
fgood
Ria Carrogan
Gemma Gratton
rcafagafar
Terry Phillips
sadam
Sandeep Cheema
Maria Marrinan
Daniel Kelly (he/him)
Nicola Rodgers
rcafRjohn
Daniel Kelly
Saskia Pemberton
Natalie Wyatt
Dani Spadavecchia (she/her)
Penfold, Hannah
Sarah Rothera
John McGagh
Charlotte Cooklin
dlionetti
David Pitcher
rcafRjohn
rcafmmarrinan
Chris MacDonald
Sheena Webb
Gemma Banks
Sarah Parsons
Jacob Lund
gpointstudio
Emese
fizkes
Monkey Business Images
bbernard
Anna Kadulina
Pixel-Shot
Amanda Flower
Amanda Flower
Ria Carrogan
Thomas, Liz
Billy Marsh
Dawn Hodson
maria.calver
Maria Marrinan
Shaelyn Stout
Shaelyn Stout
Andrew Lamberti
Andrew Lamberti
Kitty Clark
Andrew Hyde
Dani Spadavecchia (she/her)
Jennifer Gibbon-Lynch
Lee Dales
fgood
John McGagh (he/him)
Vickie Clare (she/her)
Vickie Clare (she/her)
HappyKids
SOL STOCK LTD
spb2015
SDI PRODUCTIONS
Fly View Productions
drazen_zigic
bymuratdeniz
Julia Dark
Alex Muntoni
James Jackson-Ellis
Hannah Lamb
James Jackson-Ellis
Ngoc Khanh Ha
Julie Brown
Jacky Tiotto
Nicola Blakebrough
Gardner, Matthew
Orme, Liam [NOMS]
rcafssheikh