Update pnpm onlyBuiltDependencies config
#521
onlyBuiltDependencies configDevelopment PRs
Closes: #521 Closes: #705 Closes: #714
- Note the addition of
yamlpackage (reco by e18e https://e18e.dev/guide/replacement-guides/js-yaml.html) - working with
pnpm-workspace.yamlfrom[email protected](7 months old). If not fallback topackage.jsonconfig like before this PR.
Closes: #521 Closes: #705 Closes: #714
- Note the addition of
yamlpackage (reco by e18e https://e18e.dev/guide/replacement-guides/js-yaml.html) - working with
pnpm-workspace.yamlfrom[email protected](7 months old). If not fallback topackage.jsonconfig like before this PR. - Targeting the "new tests branch" to see a smaller diff. #717 would need to be merged first.
pnpm-workspace.yaml
Closes: #521 Closes: #705 Closes: #714
tests
Proposal to change how tests are organized (to speedup tests again ^^)
variant+kind=flavorAddonTestCasesetupTestprepare the test mono repo with all flavors and only one install is neededremoveps-treedep
On my machine (windows & wsl), add-ons part is going 30% faster!
Summary
Replaces the spoofable browser-supplied Content-Type allowlist in apps/processor with Google Magika–verified detection on every upload, locks down the serve path so it cannot be coerced into emitting a script-active type, and migrates the processor base image off Alpine so @tensorflow/tfjs-node can actually load in production. This is the processor half of a two-worktree change; a follow-up magika-web-uploads worktree will land the matching changes in apps/web once this merges. No apps/web/** files are touched here.
Why
The current multer.fileFilter in apps/processor/src/api/upload.ts is a prefix allowlist (image/, application/pdf, text/, application/vnd) keyed on whatever the browser claimed. That's trivially bypassable: a .py file arrives as text/plain and gets treated as a generic blob; a renamed Mach-O/ELF/PE binary arrives as image/png and is happily accepted; an HTML payload masquerading as text/plain rides through to the serve path and is later echoed back with whatever the DB row says.
PageSpace is pivoting toward cloud-IDE use cases where source-code uploads are first-class. We need an authoritative content-type signal that's independent of the browser. Google's Magika gives us ~5 ms CPU classification across 200+ labels, including code languages, with a vendored ~3 MB tfjs model.
What landed (commit-by-commit)
feat(processor): add magika-based content-type detector — Tasks A + B
- Pins
magika@^1.0.0inapps/processor/package.json. - Vendors the
standard_v3_3model (3.1 MB) underapps/processor/assets/magika/standard_v3_3/so detection runs without network access in dev, CI, and prod. - New
apps/processor/src/services/content-detector.ts:- Singleton
MagikaNodeinstance lazily initialised via a memoisedPromise<Magika | null>so concurrent first calls share one load. detectContentType(filePath)→DetectedContentType { label, mimeType, group, score, source: 'magika' | 'fallback' }.- 250 ms
Promise.racetimeout ceiling. - All error paths return a frozen
FALLBACK_DETECTION(application/octet-stream) and log viaprocessorLoggerusingtempPathonly — neveroriginalName(preserves the PII-scrubbing posture fromdbc52496).
- Singleton
- New
apps/processor/src/services/__tests__/content-detector.test.ts(4 tests, all green): real-fixture classification (png/python/pdf), missing-file → fallback, sequential and concurrent-cold-start singleton reuse viavi.spyOn(MagikaNode, 'create'). - Whitelists
@tensorflow/tfjs-nodein the rootpackage.json#pnpm.onlyBuiltDependenciesso the native tfjs binding builds on freshpnpm install(pnpm 10 blocks build scripts by default). This is one of the two files touched outsideapps/processor/**(the other is the lockfile) — without it, fresh installs leave Magika unable to load and the feature silently degrades. - Adds a
magika/nodepaths mapping inapps/processor/tsconfig.jsonso the legacymoduleResolution: "node"setting can find Magika's subpath types.
feat(processor): enforce magika-verified types on upload — Task C
- Drops the
allowedTypesprefix check frommulter.fileFilter; keeps the empty-filename guard. - Both
/singleand/multiplehandlers now calldetectContentType(tempPath)after the temp file is written and validated againstTEMP_UPLOADS_DIR. If the verifiedlabelis inDENIED_LABELS = { pebin, elf, macho, dex, html, svg, xhtml }:- The temp file is
fs.unlinked immediately (same warn-and-continue cleanup pattern used elsewhere). /singlereturnsHTTP 415 { error: 'Unsupported file type', detectedLabel }./multiplerecords the rejection in the per-fileresults[]and removes the path fromtempFilePathsso the batch cleanup loop doesn't double-unlink.
- The temp file is
- Verified
mimeTypeand the newdetectedLabelfield are threaded throughqueueProcessingJobsintoIngestFileJobData(extended inapps/processor/src/types/index.ts).queue-manager.tsdispatch is unchanged — it still keys onmimeType.startsWith('image/')/needsTextExtraction(mimeType), but now receives a trustworthy value. upload.test.ts: a hoistedmockDetectContentTypedefaults toimage/jpegso all 28 pre-existing tests pass unchanged. 4 new tests cover 415 rejection + temp cleanup formacho, verifiedtext/x-pythonflowing into the ingest job, fallbackapplication/octet-streamaccepted, and a mixed-batch/multiplewhere one file is rejected and a sibling is accepted.
feat(processor): harden serve-path content-type handling — Task D
- Targets only the
/originalroute inapps/processor/src/api/serve.ts. The/presetbranch is intentionally left alone — those are processor-generated variants whose Content-Type derives from a known suffix. Content-Typeis sourced strictly fromfileRecord.mimeType(withpageRecord.mimeTypeas a per-page fallback). Whitespace-only stored values are normalised away. There is no caller-controlled fallback path.- A new
forceAttachment = isDangerousMimeType(contentType) || !hasStoredMimeflag forcesContent-Disposition: attachmentplus the strictdefault-src 'none'; style-src 'unsafe-inline'; img-src data:; sandbox;CSP whenever the stored mime is missing OR matchesDANGEROUS_MIME_TYPES(reused fromutils/security.ts:text/html,application/xhtml+xml,image/svg+xml,application/xml,text/xml). X-Content-Type-Options: nosniffis set on every response (was already set; preserved).- 5 new tests in
serve.test.ts: safe-mime inline, script-active force-download, missing-mime →octet-stream+ attachment, whitespace-only mime treated as missing, query-param override ignored.
build(processor): switch base image to node:22 bookworm-slim
@tensorflow/tfjs-nodehas no Alpine prebuilds and Alpine isn't an officially supported tfjs platform — the binding requires glibc'sld-linux-x86-64.so.2(tfjs#1425, tfjs#6556). As long as processor uploads run through Magika, glibc is a hard requirement, so the prod image had to migrate off Alpine.- Migrated to
node:22.17.0-bookworm-slimin both the build and production stages. - Sharp 0.33+ ships self-contained glibc prebuilds that bundle libvips, so the entire
cairo-dev jpeg-dev pango-dev giflib-dev pixman-dev pangomm-dev libjpeg-turbo-dev freetype-devchain (plus the matching runtime libs in the prod stage) was deleted. Tesseract.js is pure JS + WASM with zero system deps. - Build stage now installs only
ca-certificates curl python3 make g++ pkg-config(the last three are kept so node-pre-gyp can compile from source if a future native dep ships without a glibc prebuild). Production stage installs onlyca-certificates. - Verified end-to-end with
docker build --platform linux/amd64(matches theubuntu-latestCI runner used by.github/workflows/docker-images.yml) followed by an in-container smoke test:require('sharp')→ v0.33.5 loadsrequire('tesseract.js').recognizeis a functionMagikaNode.create()loads the vendored model from/app/apps/processor/assets/magika/standard_v3_3/- libtensorflow announces
AVX2 FMAat startup (proves the nativetfjs_binding.nodeactually opened the underlying.so) - Magika correctly classifies a Python source as
pythonand an HTML payload ashtml(which is inDENIED_LABELS, so the upload route would now reject it with HTTP 415)
- Final amd64 image: ~587 MB, smaller than the previous Alpine prod image once you account for the deleted dev-libs chain.
Files touched
| File | Notes |
|---|---|
apps/processor/Dockerfile | Alpine → bookworm-slim, deleted sharp dev-libs |
apps/processor/package.json | + magika@^1.0.0 |
apps/processor/src/services/content-detector.ts | new |
apps/processor/src/services/__tests__/content-detector.test.ts | new |
apps/processor/assets/magika/standard_v3_3/{model.json,config.min.json,group1-shard1of1.bin} | new (3.1 MB vendored model) |
apps/processor/src/api/upload.ts | denylist + detector wiring + extended queueProcessingJobs |
apps/processor/src/types/index.ts | + detectedLabel?: string on IngestFileJobData |
apps/processor/src/api/serve.ts | strict mime sourcing + forceAttachment |
apps/processor/src/api/__tests__/upload.test.ts | mock detector + 4 new tests |
apps/processor/src/api/__tests__/serve.test.ts | 5 new lockdown tests |
apps/processor/tsconfig.json | magika/node paths mapping |
package.json (root) | pnpm.onlyBuiltDependencies: ["@tensorflow/tfjs-node"] |
pnpm-lock.yaml | updated by pnpm install |
Acceptance checklist
-
pnpm --filter processor test— 801 / 801 passing (all 36 test files) -
pnpm --filter processor build— clean -
pnpm typecheck(turbo, repo-wide) — clean across all 12 workspaces -
pnpm lint(turbo, repo-wide) — clean -
docker build --platform linux/amd64 -f apps/processor/Dockerfile .— succeeds - In-container smoke test: sharp + tesseract.js + tfjs-node native binding + Magika classification all working
- No
anytypes introduced - No
originalNameever passed intoprocessorLoggerfrom the new detector code - No edits to
apps/web/**,packages/**,apps/realtime/**, etc. - Detector singleton reuse, fallback path, denylist rejection, /single + /multiple temp cleanup on rejection, serve-path Content-Type sourcing, serve-path script-active force-download — all covered by tests
Known follow-up (out of scope here)
- The
magika-web-uploadsworktree will wire the verifiedmimeTypeanddetectedLabelfromIngestFileJobDataintopages.mimeType/files.mimeTypeon theapps/webupload path, and use Magika's code-language labels to route.py,.rs,.ts, etc. toPageType.CODE.
🤖 Generated with Claude Code
Issue
onlyBuiltDependencies in pnpm-workspace.yaml is supported from [email protected]:
The
pnpm.*settings frompackage.jsoncan now be specified in thepnpm-workspace.yamlfile instead.
onlyBuiltDependencies:
- esbuild
- fuse-native
pnpm approve-builds adds onlyBuiltDependencies to pnpm-workspace.yaml as of 10.7.1.
Current implementation is fixed to the pnpm.onlyBuiltDependencies field in the pacakge.json:
Backwards compatibility is supported and is probably fixed (?) in 10.6.1
When executing the
approve-buildscommand, if package.json containsonlyBuiltDependenciesorignoredBuiltDependencies, the selected dependency package will continue to be written intopackage.json.
Info
I wonder what should happen if there is no pnpm-workspace.yaml because you are creating a new project. Should it just be created to only add those three lines?
Basically I think this is a good idea, because it puts pnpm related stuff in an pnpm file
The pnpm-workspace.yaml seems to be created.
corepack use pnpm@latest
# Installing [email protected] in the project...
pnpm i @sveltejs/kit -D
# Ignored build scripts: esbuild.
# Run "pnpm approve-builds" to pick which dependencies should be allowed to run scripts.
pnpm approve-builds
# √ The next packages will now be built: esbuild.
# Do you approve? (y/N) · true -- workspace file is generated at this point.
# pnpm-workspace.yaml
onlyBuiltDependencies:
- esbuild
Pro tip: You can prefix GitHub URLs of issues, PRs or discussions with svcl.dev/ to view them on this page! Also try it on a GitHub release ;)