iText RUPS Tips & Tricks: Inspecting Cross-References, Streams, and Fonts
iText RUPS (Read/Unlock/Print/Save) is a lightweight GUI tool included with iText that exposes the internal structure of PDF files. It’s indispensable when you need to debug PDFs, inspect corruption, or understand how content and resources are represented. This article gives concise, actionable tips for inspecting three critical PDF areas in RUPS: cross-reference tables (and xref streams), object streams and compressed objects, and embedded fonts.
Getting started quickly
- Open RUPS and load a PDF (File → Open).
- Use the left-hand tree to expand the document structure: trailer, cross-reference, objects, catalog, pages, etc.
- Use the right-hand panes to view raw PDF syntax or a rendered interpretation where available.
Inspecting cross-references (xref tables and xref streams)
Cross-references map object numbers to byte offsets (or entries in xref streams). Problems here cause broken object resolution or “invalid cross-reference table” errors.
Tips
- Locate xref sections: In the left tree, expand “Cross Reference” to see traditional xref sections or xref streams. RUPS shows both the old-style table and newer compressed xref streams.
- Compare offsets: Click an xref entry to highlight the corresponding object in the object list. If the object content shown doesn’t match expected syntax, the offset may be wrong.
- Check trailer and /Prev: For linearized or incrementally updated PDFs, follow the trailer chain via the /Prev entries. RUPS displays trailer dictionaries; open them to confirm /Size and /Root agree across updates.
- Xref stream decoding: RUPS decodes xref streams automatically. Inspect the decoded stream to verify object number ranges and fields (type, offset, generation).
- Rebuild hints: If offsets are corrupt, export the document from a PDF library (e.g., iText) to force rebuilding of xref tables. RUPS is useful to verify the rebuilt PDF.
Common issues and how to spot them
- Missing object at listed offset → see incorrect byte sequence when jumping to that offset.
- /Size mismatch → count objects listed vs. /Size value in trailer.
- Multiple trailers without /Prev consistent chain → broken incremental update history.
Inspecting streams and object streams
PDF content (page content, images, metadata) is stored in streams; object streams (PDF 1.5+) compress many small objects into a single stream.
Tips
- View raw vs. decoded: RUPS provides both raw stream bytes and the decoded content (after filters are applied). Use the decoded view to read textual content or see embedded chunks.
- Identify filters: Look at the stream dictionary (/Filter, /DecodeParms). Common filters: FlateDecode, LZWDecode, DCTDecode, JPXDecode. RUPS shows these so you can choose correct processing.
- Check stream lengths: Compare /Length value vs. actual data length. Mismatches can indicate corruption or missing bytes.
- Inspect object streams: Expand “Object Streams” to see which objects are packed. RUPS lists the object numbers inside each object stream and shows the decoded inner objects.
- Search within decoded streams: Use RUPS search to locate strings (e.g., font names, image hints) inside decoded streams.
- Extracting images/text: Right-click stream contents to save decoded bytes externally for further analysis (e.g., open a JPEG or run OCR).
Common problems
- Unrecognized filter → RUPS may not decode; inspect /Filter value to choose external tool or library.
- Truncated streams → decoding fails; check file size and offsets.
- Incorrect /Length → causes parsing issues; correcting it often fixes rendering.
Inspecting embedded fonts
Fonts affect text extraction, rendering, and PDF size. RUPS helps identify font types, encodings, and embedded subsets.
Tips
- Find font dictionaries: In the page resource (Page→Resources→Font) or the global resource dictionaries, expand each font entry to view its dictionary.
- Check /Subtype: Identify Type0 (CIDFont), Type1, TrueType (/FontFile2), Type3, or OpenType (/FontFile3 with /Subtype /OpenType).
- Embedded vs. referenced: Look for /FontFile, /FontFile2, /FontFile3 entries inside font dictionaries. If absent, the font is not embedded and may cause substitution.
- Subset fonts: Subset fonts have names like ABCDEF+FontName. RUPS shows the font’s BaseFont name. Subsets should still include glyph streams; verify presence of /FontFile2/3.
- CMaps and encodings: For CIDFonts, inspect /CMap or /Encoding to understand how character codes map to glyphs; if missing or incorrect, text extraction will fail.
- Inspect ToUnicode maps: If present, the /ToUnicode stream maps character codes to Unicode points—crucial for accurate extraction and searchability. Inspect decoded /ToUnicode stream to verify mappings.
- Glyph check: For problematic glyphs, inspect the font program stream (decoded) or use external font tools after exporting the font data via RUPS.
Common font issues
- Missing /ToUnicode → copy/paste and text search yield gibberish.
- Subset warning → ensure glyphs needed for text extraction are present.
- Corrupt font stream → broken rendering or fallback fonts.
Quick workflows (actionable)
- Diagnosing “invalid cross-reference table”:
- Open PDF in RUPS → expand Cross Reference → inspect trailer and /Prev chain → click suspicious xref entries → jump to object offsets to validate bytes. If corrupt, rebuild with iText’s PdfReader/PdfWriter and rewrite the file.
- Recovering text from a PDF with missing /ToUnicode:
- Inspect fonts in Resources → confirm absence of /ToUnicode → if fonts are embedded subset, extract /FontFile streams via RUPS and use a font tool to map glyphs, or use OCR on page images.
- Extracting an embedded image:
- Find XObject of subtype /Image → open stream → save decoded bytes.
Troubleshooting checklist
- Are trailers and /Size consistent across incremental updates?
- Do xref offsets point to valid object syntax?
- Are stream /Length values correct and filters supported?
- Are fonts embedded and do they include /ToUnicode mappings?
- Can decoded streams be searched or exported for external analysis?
Tools and commands that pair well with RUPS
- iText (Java/.NET) — rebuild PDFs and programmatically inspect/fix structures.
- qpdf — linearize, rebuild xref tables, and inspect object offsets.
- pdfcpu — validate and inspect PDF structure.
- fontTools — analyze and inspect exported font programs.
- Image viewers and hex editors — verify extracted streams.
Final tips
- RUPS is read-only and diagnostic: use it to inspect and export, then fix issues with libraries like iText or qpdf.
- Always compare raw and decoded stream views to separate compression/filtering issues from content corruption.
- When in doubt, rebuild the file with a robust PDF library and re-inspect the output in RUPS.
If you want, I can produce a short checklist you can print and use during inspections or an example iText script to rebuild xref tables and rewrite a damaged PDF.
Leave a Reply