OL Learn

Split PDF by File Size

Hi all, is there anyway I can split my PDF file to smaller size?
Let’s say I have a PDF with file size of 60mb and would like to split it to smaller size at 5mb each (to nearest page).
Your suggestion is much appreciated.

I don’t know of any good way to split specifically by the final output size. The PDF splitter is designed to split by number of pages or at logical break points (start of a new document, for instance), not by how large the file is.

Also depending on the way the source PDF was made, the sum total size of the split pages may actually be larger than the original file size, making this sort of calculation very complicated.

You would have to first split the whole PDF into single pages.
Then, using a script and the Alambic API, you would look at all the pages and concatenate them 1 by one until you reach your file size.

You could do it in Workflow using only plugins but In would recommend you do this for small amount of PDF pages. If you go 100 and more, I would use a script with Alambic API.

Here is an example on how to use Alambic API to concatenate PDFs together:

Option explicit

dim fso, objFile
dim line
dim myNewPDF

Set fso = CreateObject("Scripting.FileSystemObject")
Set objFile = fso.OpenTextFile (Watch.GetJobFileName,1)

Set myNewPDF = Watch.GetPDFEditObject
myNewPDF.Create Watch.GetVariable("MergePDFFolder") & "\\" & Watch.GetVariable("MergePDFName")

Do Until objFile.AtEndOfStream
  line = objFile.ReadLine
  myNewPDF.Pages().InsertFrom line,0,-1,myNewPDF.Pages().Count
Loop

myNewPDF.save true
myNewPDF.close
objFile.close

In this script, I was receiving as a data file a text file with the name (and path) of all the PDF that needed to be concatenated together. In your case, you could do the following:

  1. Split your main PDF per page into a folder using the PDF Splitter
  2. Use the Folder Listing plugin to get the list of PDF resulting of the previous split (that will generate an XML file in which you have the path, name of the file as well as the file size).
  3. Use the XML Splitter to split the XML file as below:
  4. Use the Mathematical Operations plugin to add the size of each file
  5. Using a condition, if you haven’t reach your 5 mb, you will use the Create File plugin and put in it the name of the file to concatenate.
  6. An Send To folder plugin to concatenate each file name that will be merge together. Make sure you check the option Concatenate file, leave Separator string empty and set a fix name for the file.
  7. Use that file with the script above.

That should provide a good basis to start building your whole process. This is not a complete process given. You will need to adapt it, but with a little elbow oil, you should manage :wink:

Further to the previous posts:

It’s almost impossible to efficiently split by size. A PDF contains a series of shared resources (fonts, graphics, dictionnaries) that are used by some or all pages. When you split a PDF, those shared resources have to be added to each resulting PDF, which means - as @AlbertsN already pointed out - that the total size of all separate PDFs will be much larger than that of the original one. You often see single page PDFs being almost as large as 500-page PDFs because all those shared resources are being added as soon as you have a single PDF page.

@hamelj’s scripted solution would be the proper way to go. But note that this is still going to be a slow process because you have to save and close the split PDF in order to determine its current size, and then re-open it to add the next page. Note also that with the AlambicEdit API, you can specify an “Optimize” option when saving a PDF, which ensures that shared resources in a PDF are discarded if none of the pages use them, which would reduce the size of each split PDF… but optimization takes additional time, so the process would be even slower.

But in the end, it would achieve what you are looking for.

Actually, @Phil let me know if I am wrong, but my solution doesn’t involved opening and closing the PDF in merging process to know the size. I actually already know the size from reading the XML file, which is the result of the Folder Listing.

I know, once merged and optimised, the size will be smaller but I think this is a little inconvenient which can be overlook by the overall speed gained by using a script with Alambic API…no?

You know the size of the original file, but you don’t know the size of each sub-PDF as you add each page, unless you close it in between each page to check if you’ve reached the maximum limit for each sub-PDF.

Your technique of summing up the size of one-page PDFs in order to determine the maximum number of pages you can include in the final results won’t work because of the shared resource issue I mentioned earlier. Well, to be more precise, it will work, but it will be extremely inefficient and will create many more PDFs than necessary.

I made a mistake in my original post when I said your scripted solution would work: I only looked at it quickly and thought it was a page-appending script that just needed a few changes to check the size of the resulting PDF between each page being appended. But it is in fact a basic PDF-concatenation script that adds PDFs to one another. As I said, it can work, but not very efficiently.

I did a similar thing to @hamelj before but it was adding existing print files to a batch file until a predefined limit was reached - it did the enumeration of the file size on the fly in the script.

' Script to chunk PDF output
' Date: 2017-01-25
'
Option Explicit

Dim ChunkNumber, ChunkSize
Dim sCurWorkingDir, sGlobalUserPath, sJobID, sPackID, sUsername, sChunkFile, s
Dim oFSO, oPDF, Folder, Files, File

sGlobalUserPath = Watch.GetVariable("global.DIR_Users")
sCurWorkingDir  = Watch.GetVariable("CurWorkingDir")
sJobID          = Watch.GetVariable("JOB_ID")
sPackID         = Watch.GetVariable("PACK_ID")
sUsername       = Watch.GetVariable("Username")
sChunkFile      = sCurWorkingDir & "chunk" & ".tmp"

Set oFSO   = CreateObject("Scripting.FileSystemObject")
Set oPDF   = Watch.GetPDFEditObject
Set Folder = oFSO.GetFolder(sCurWorkingDir)
Set Files  = Folder.Files

ChunkNumber = 0
ChunkSize   = 0

If Not oFSO.FolderExists(sCurWorkingDir & "Chunk\") Then oFSO.CreateFolder sCurWorkingDir & "Chunk\"

For Each File in Files
        If LCase(Right(File.Name, 4)) = ".pdf" Then
                If oFSO.FileExists(sChunkFile) Then
                        oPDF.Open sChunkFile, False
                Else
                        oPDF.Create sChunkFile
                End If
                oPDF.Pages.InsertFrom File, 0, -1, oPDF.Pages.Count
                oPDF.Save True
                oPDF.Close()
                ChunkSize = CDbl(oFSO.GetFile(sChunkFile).Size)
                If ChunkSize >= 20971520 Then
                        oFSO.MoveFile sChunkFile, sCurWorkingDir & "\Chunk\" & ChunkNumber
                        ChunkNumber = ChunkNumber + 1
                        ChunkSize = 0
                End If
                oFSO.DeleteFile File, True
        End If
Next

If oFSO.FileExists(sChunkFile) Then oFSO.MoveFile sChunkFile, sCurWorkingDir & "\Chunk\" & ChunkNumber

Set oPDF = Nothing
Set oFSO = Nothing

You’ll want to ignore some of the global variables which were solution-specific but this does away with needing several steps - does the same thing as above though but after your would-be split.

You’re output will be: Chunk0, Chunk1, Chunk2 etc. In this case, it looks like each ‘chunk’ (as I call them) is 20MiB.

NB: @Phil is right (not that this should be any surprise). Efficiency is always a problem when doing things like this.

Regards,

James.

oooooooo… nicely done, Jim. Exactly what I was attempting to describe.
Thanks for sharing that code!

1 Like

Thanks all, I will try out the suggestion above and revert the result.