Datamule Cloud V3 Reconciliation
Sometimes things break. Reconciliation is the name for datamule's lambda worker that notices issues, then fixes them on a daily trigger.
Schedule
- rolling week inventory: Tue-Fri at 05:00 UTC
- complete inventory: Saturday at 05:00 UTC
- master submissions metadata updater: daily at 08:00 UTC
- reconciliation: Tue-Sat at 09:00 UTC
Inventories
- Once per week runs a full inventory of everything in the SEC SGML archive. This costs ~$0.10 in R2 Class B operations and takes 4 minutes on a 1vcpu lambda configuration.
- Once per day (excluding Sunday), runs an inventory of the last week.
How it works:
1. Lists all YYYY-MM-DD prefixes in the bucket 2. We process about ~100 prefixes at a time. For each prefix, begins sequential list operations. 3. Uploads the inventory as a .txt.zst file
1993-02-11/999999999704035713.sgml.zst
1993-02-12/999999999705015654.sgml.zst
1993-02-24/999999999705037760.sgml.zst
1993-02-26/999999999705050433.sgml.zst
1993-04-28/999999999705050434.sgml.zst
> Note: I was able to achieve ~4x throughput through switching from python to rust.
Metadata Updater
- Once per day (excluding Sunday), downloads the SEC master submissions metadata. This is the official list of what filings are available on sec.gov.
Reconciliation
- Compares inventory against master submissions metadata. Sends missing filings to datamule's lambda sec filings archive updater.