Datamule Cloud V3 SEC Filings Archive
The SEC Filings Archive has been reprocessed. Before, SGML objects under 500 KB were not compressed. Now all objects are compressed. Standard compression level is now zstd level 9, instead of level 3.
Stats (5/28/26):
- SGML Archive reduced from 3.69 TB to 3.36 TB
- TAR Archive reduced from 3.69 TB to 3.52 TB
Download speed for larger filings, such as 10-Ks, should be about 15% faster on average. Requests for parts of filings, such as the 10-K file within a 10-K should also be faster, as information needed for HTTP range requests are now stored in a database, rather than requiring a metadata call to the filings metadata.json at the head of an object.
Technical Setup
Monitor detects new filings and submits them to SQS. Archive Updater processes new filings, uploading to R2, then updating the lookup database that allows users to access the filings.
Archive Updater
Our bottleneck is the SEC and Cloudflare R2's PUT latency. Sometimes the SEC is very slow, this is unavoidable. R2 can also be slow. Sometimes reaching 5s for a 20 MB PUT. This is avoidable. The CEO of Tigris Data reached out. They have free egress as well. I think Datamule may move there.
- 1 vCPU and 4 vCPU configuration in Lambda. Files above 10 MB are sent to 4 vCPU configuration.
- Downloads raw SGML file from the SEC, asks for gzip compression. By default, the SEC serves uncompressed data, and has low bandwidth. So a 100 MB filing will take ~10 seconds to download uncompressed versus ~800ms compressed.
- Streams SGML download in, compresses it with zstandard level 9, async uploads to R2, processes TAR, async upload to R2, and then updates database.
- Download is the bottleneck. Compression streaming ensures processing completes at same time as download.
- Written in rust to minimize cold starts. Uses ARM for 33% cheaper cost.
- Warm caching used to minimize SEC and R2 overhead
SEC Filings Lookup
We currently use a t4g.medium. It's about $45/month. I think we may change. This has not been optimized.
sec_submission_details_table
- accessionNumber BIGINT UNSIGNED NOT NULL PRIMARY KEY
- submissionType VARCHAR(32) NOT NULL
- filingDate DATE NOT NULL
- reportDate DATE NULL
- detectedTime DATETIME NULL
- containsXBRL BOOLEAN NOT NULL DEFAULT FALSE
Indexes:
- PRIMARY KEY (accessionNumber)
- idx_submission_type_filing_accession (submissionType, filingDate, accessionNumber)
- idx_filing_accession (filingDate, accessionNumber)
- idx_report_accession (reportDate, accessionNumber)
- idx_detected_accession (detectedTime, accessionNumber)
sec_accession_cik_table
- accessionNumber BIGINT UNSIGNED NOT NULL
- cik BIGINT UNSIGNED NOT NULL
Indexes:
- PRIMARY KEY (accessionNumber, cik)
- idx_cik_accession (cik, accessionNumber)
sec_documents_table
- accessionNumber BIGINT UNSIGNED NOT NULL
- sequence SMALLINT UNSIGNED NOT NULL
- documentType VARCHAR(128) NOT NULL
- filename VARCHAR(500) NOT NULL
- description VARCHAR(1000) NOT NULL
- tarStartByte BIGINT UNSIGNED NOT NULL
- tarEndByte BIGINT UNSIGNED NOT NULL
Indexes:
- PRIMARY KEY (accessionNumber, sequence)
- idx_documents_accession_type_sequence (accessionNumber, documentType, sequence)
Datamule's Cost
Archive Updater
Extrapolating from year 2025, for SGML.
| Size bucket | n | Config | Lambda memory | Benchmark used | Avg total (warm) | Total seconds | Cost |
|---|---|---|---|---|---|---|---|
| < 0.1 MB | 730,670 | 1 vCPU | 1769 MB | 11KB warm (lvl9/1t) | 150ms | 109,600s | $2.67 |
| 0.1–1 MB | 312,477 | 1 vCPU | 1769 MB | interpolated | 215ms | 67,183s | $1.61 |
| 1–10 MB | 149,141 | 1 vCPU | 1769 MB | 11MB warm (lvl9/1t) | 370ms | 55,182s | $1.30 |
| 10–50 MB | 31,100 | 4 vCPU | 7076 MB | 107MB warm (lvl9/4t) | 850ms | 26,435s | $2.44 |
| 50–100 MB | 4,627 | 4 vCPU | 7076 MB | 107MB warm (lvl9/4t) | 850ms | 3,933s | $0.36 |
| 100–250 MB | 5,435 | 4 vCPU | 7076 MB | 107MB warm (lvl9/4t) | 850ms | 4,620s | $0.43 |
| 250–500 MB | 824 | 4 vCPU | 7076 MB | 528MB warm (lvl9/4t) | 2,280ms | 1,879s | $0.17 |
| 500+ MB | 0 | 4 vCPU | 7076 MB | 528MB warm (lvl9/4t) | 2,280ms | 0s | $0.00 |
| Total | 1,234,274 | 268,832s | $8.99 |
TAR adds a little overhead. Expect ~$12/year.
Usage
TAR Archive (default)
Downloads filings. If specific files within a filing are specified, only downloads those.
from datamulehub import sec_filings_archive
# download tar document ranges to outfolder/filingDate/accession/filename
sec_filings_archive.download_tar(cik=320193, document_type=["10-K", "10-Q"], output_dir="outfolder")
SGML Archive
Downloads filings as raw sgml.
from datamulehub import sec_filings_archive
# download decompressed sgml filings to outfolder/filingDate/accession.sgml
sec_filings_archive.download_sgml(cik=320193, submission_type="10-K", output_dir="outfolder")
Quirks
Archive includes filings that are no longer present in the SEC. See The SEC publishes a lot of filings then deletes them. Here they are.