Extract PDF Pages Based on Content

Posted on April 25, 2014 by Karl Heinz Kremer

How would we identify pages in a PDF document that contain a certain word and extract those pages into a new document? This can be done with a few lines of JavaScript – there are different ways to do this: We can create a document level JavaScript and install it in the one of Acrobat’s JavaScript folders (see here for more information about how to identify the folder where to install such a script), or we can create an Action that executes the JavaScript. In the past I’ve written about how to create folder level scripts (e.g. here), so let’s create an Action today.

Here is the script that we will be using:

// Iterates over all pages and find a given string and extracts all 
// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "Total";

for (var p = 0; p < this.numPages; p++) {
	// iterate over all words
	for (var n = 0; n < this.getPageNumWords(p); n++) {
		if (this.getPageNthWord(p, n) == stringToSearchFor) {
			pageArray.push(p);
			break;
		}
	}
}

if (pageArray.length > 0) {
	// extract all pages that contain the string into a new document
	var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done
	for (var n = 0; n < pageArray.length; n++) {
		d.insertPages( {
			nPage: d.numPages-1,
			cPath: this.path,
			nStart: pageArray[n],
			nEnd: pageArray[n],
		} );
	}

    // remove the first page
    d.deletePages(0);
    
}

The script is pretty straight forward: We are iterating over all pages, and on each page, we are looping over all words until we find the word that we are looking for. In that case, we are adding the page number to an array of page numbers.

If, after all this looping, we have information in this array of page numbers, we process that list by creating a new document (which will add a blank page – a PDF document always has to have at least one page), and then we add each page from the original document that we find in the array. All that’s left now is to remove that initial blank page.

So, let’s convert this into an Action. In Acrobat XI Pro (this will not work in Standard, it does not support Actions), select “Tools>Action Wizard>Create New Action”. This will create an empty action. Do add JavaScript to our Action, select the “Execute JavaScript” option under “More Tools” and move it to the right side (e.g. by clicking on the arrow button.

2014 04 25 12 50 51

Once the “Execute JavaScript” step is on the right side, click on the “Specify Settings” button and paste the script from above into the editor. Once the script is part of the Action, you can prevent the editor from popping up every time you run the Action by deselecting “Prompt User” for this action step.

Save the action, give it a meaningful name and you are ready to execute it.

You can download the action here: ExtractPagesWithString.sequ. Once downloaded, just double-click on it to install it in Acrobat Pro. Again, this will not work with Acrobat Standard or the free Adobe Reader.

This entry was posted in Acrobat, JavaScript, PDF, Tutorial and tagged Adobe Acrobat, JavaScript, PDF, tutorial. Bookmark the permalink.

122 Responses to Extract PDF Pages Based on Content

Elizabeth Celuck says:

June 10, 2014 at 2:24 pm

Your script is exactly what I have been searching for, so thank you for sharing it! I am getting an error message saying it is corrupt when I click on it from the download location. I also tried copying the code and pasting into notepad, saving it as an sequ, and then opening it, but still get a corrupt code error. I would appreciate any assistance you can offer. Thanks!
Joe Barry says:

October 16, 2014 at 5:49 pm

Hello,

Good code snippet.

How might we then save down the .tmp files that pop up ? We’d like this to be more of an operating system script that saves a new file with a name of “filenane+new”, suppress any preview and commit the files to the operating system as files.
Karl Heinz Kremer says:

October 30, 2014 at 5:37 pm

Elizabeth, you should be able to create a new action based on the instructions I’ve provided. You cannot just save the code snippet as a SEQU file, you will have to create a new Action, add a JavaScript step and then use the code from above for that JavaScript processing step.
Karl Heinz Kremer says:

October 30, 2014 at 5:40 pm

Joe, that’s not what an Acrobat Action is about: An Action will always run in Acrobat and will display the processed file. If you want to do this from outside of Acrobat, you will have to write an application that “remote controls” Acrobat e.g. via the IAC interface using VB. Take a look at my VBA and VBScript related posts for more information. You would have to use the JSObject to use the JavaScript interface from VB or VBScript.
JohnR says:

November 4, 2014 at 7:14 pm

Great idea on the posted code. I have implemented per your instructions, the code runs and says that it has executed successfully, but no document is created. The search words are correct and are simply replaced in the ‘Total’ text from the script, but nothing appears to happen. The debugger was no help either. Suggestions?
Nicola F. says:

January 5, 2015 at 11:39 am

Thank you so much for this, it opened me a whole new world!

I got a question: is there a simple command to highlight somehow the word after the script finds it!?
Something like:
this.highlightPageNthWord(p, n) !?

I just want my eyes to find it quickly when I look at the pdf after the script is executed.
Thanks in advance!
Karl Heinz Kremer says:

January 5, 2015 at 3:14 pm

Nicola, look at the Doc.selectPageNthWord() method in the API documentation.
Nicola F. says:

January 5, 2015 at 10:11 pm

Karl, thanks for the quick answer!

I checked the doc, but what you suggested seems no good for me. Or, I’m doing it wrong.

While reading the manual I found the addAnnot command to add a Highlight, so, I did my own script to do this:
1- Look for several words
2-When found, highlight them
3-Delete pages where there are no matching words
4-Save the modifed doc with another name

And, it works! But, it’s very slow.
A 10 page pdf where the script finds 12 matching words takes 180 sec to process, while it takes only 2 sec if I skip step 2! And I have hundreds pages to process 🙁

Could these few lines
this.addAnnot({
page: nth_page,
type: “Highlight”,
quads: this.getPageNthWordQuads(nth_page, nth_word)
});
repeated 12 times make such a huge difference!?

Thanks again
Nicola F. says:

January 6, 2015 at 12:07 am

Nevermind, for some reason I can’t understand, it didn’t like to “addNote” during the search, so, I stored the pages and quads into 2 vectors. At the end of the search, I did all the necessary addNotes together.
Now I process approx 2000 pages in 8 minutes. Sounds good enough to me! Thanks!
Stephanie A says:

January 29, 2015 at 6:39 pm

You are quite literally my favorite person today. You have taken hours off my work week. Thank you!!!!!
Jason Pretorius says:

March 20, 2015 at 7:26 am

I’m not a developer/coder at all, and this literally saved my life today.

If I could, I would be buying you a beer right now.

Thanks.
Karl Heinz Kremer says:

March 20, 2015 at 1:16 pm

Jason, just keep me in mind for any professional needs around PDF you may come across in the future. I can only write this blog because nice people are hiring me for PDF related consulting jobs 🙂
adrian says:

June 11, 2015 at 4:00 pm

Is there any way I can edit this so that it deletes the pages with the specified string?
Jeff B says:

June 15, 2015 at 5:07 pm

Hi Karl,

Your script is exactly what we were looking for but for some reason I can not get it to work. We have a 1622 page document in Acrobat Pro. Each page has either “page 1 of 1” or “page 1 of 2” or “page 2 of 2” at the bottom. We need to extract all the “page 1 of 1” pages from the document into a new document. I have copy and pasted your script and replaced where you have “title” with “page 1 of 1” . The script seems to run fine but the newly made document is the same as the previous document. Any ideas? Thanks.
Karl Heinz Kremer says:

June 15, 2015 at 5:24 pm

Jeff, the “word finder” is does just that, it returns one word at a time. You will have to do a bit more to get the full string containing all four parts (“page”, “1”, “of”, and “1”). There is a method to get the location of the “words”, you may have to use that to get things into the correct order.
Jeff B says:

June 15, 2015 at 6:23 pm

Thanks for the quick reply. Unfortunately your answer is out of my expertise. Unless you have a webpage to point me to. Thanks.
Karl Heinz Kremer says:

June 15, 2015 at 6:50 pm

Jeff, no, I don’t have any instructions that would cover that. However, if you need help implementing this, I am available. This is actually something I’ve done a few times for my customers. You can find my email address on the “About” page.
Praj says:

July 5, 2015 at 6:58 am

Hell. 🙂
I was searching the method to extract pdf pages having same words.
This method is very helpful.
Thank you very much. 🙂
Michael Harp says:

July 27, 2015 at 7:12 pm

Is it possible to DELETE all of the pages from a PDF document that includes a specific string of text? I have a 900+ page document that I don’t really need to extract every page that includes certain text, I need to delete the 200+ pages that includes one specific string of text that doesn’t appear anywhere else in the document. Any thoughts?
Karl Heinz Kremer says:

July 27, 2015 at 9:50 pm
Yes, it’s certainly possible. I would start to process the document from the last page to the first, and then whenever you find the string, you call
```
Doc.deletePages()
```
. See here for documentation for this API function: http://help.adobe.com/livedocs/acrobat_sdk/11/Acrobat11_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat11_HTMLHelp&file=JS_API_AcroJS.89.458.html
Cassia says:

July 30, 2015 at 2:33 pm

Hello Karl,
Thank you for this script! It is fantastic.
Could you please post the full script to save the extracted documents, with new filenames, in new folders? I see that there is some reference made to this above. However, I am not a programmer, and cannot figure out how to implement it.
Karl Heinz Kremer says:

August 4, 2015 at 3:05 pm

Cassia, that’s a bit too much to share in a free blog post. If you do need help implementing such a script, that’s what I do for a living 🙂 If you need professional help, feel free to get in touch with me via email. My email address is on my “About” page.
T. says:

August 6, 2015 at 1:54 pm

How would I search for all forms of “total” (e.g., “total” and “totaling”)?

Or, how would I search for two words (if easier than than the above), such as “total” and “totaling”?

Thank you!
Malcolm says:

September 22, 2015 at 11:05 am

thanks for creating this script – has saved me a few hours work.

just one question – is it possible to ignore case in the search ??
Jason says:

September 23, 2015 at 12:55 pm

Hi! Awesome script. In case anyone wants it, I adjusted the script as follows to prompt the user for the desired search term instead of it being hard-coded into the script:

Changed the below line
“var stringToSearchFor = “Total”;”

To this
“var stringToSearchFor = app.response(“Enter search term”);”

I also noticed that this search ignores characters like the $ character. I also figured out that it ‘IS’ case sensitive, and doesn’t work on strings inside or bumped up against other words without a space in between.

Is there a complete list somewhere that shows what combinations, characters, etc. it will or will not find? Or is there a bit of code that would adjust what it will or will not find?

Thanks! ~JTC~
Karl Heinz Kremer says:

September 23, 2015 at 1:41 pm

Jason, I am not aware of such a list. Take a look at the documentation: http://help.adobe.com/livedocs/acrobat_sdk/11/Acrobat11_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat11_HTMLHelp&file=JS_API_AcroJS.89.492.html

There is an option to not strip out punctuation marks and whitespace. That may give you want you want.
Dr. Stephanie Rollins says:

September 30, 2015 at 8:54 am

This script is awesome! How can I delete the pages instead of extracting them. I’ve searched everywhere and hoping you can help. Is there a delete command I can insert in this script?

When I run this script, the pages extract, however, the pages with the searched word still remain in the original document (no blank pages).

Hoping you can help!! I work for the government, so don’t have $$ to hire a consultant. Trying really hard to figure this out on my own, but I’m stumped!
Jason says:

September 30, 2015 at 11:42 am

Karl,

Thanks for your response. I’ll keep experimenting and post back if I figure anything additional out. I really appreciate the information you provide. Helps a lot of people out!

~JTC~
Brian Borgstrøm says:

November 21, 2015 at 6:06 am

Hi Karl,

Thank you so much for this script, it does almost everything I need it to.
Is there a way to get the script search for more than just one word? I specifically want it to search for two-word phrases but I can’t get the script to do that for me. I think it’s because it doesn’t include blank spaces in the search.

Thanks again,
Brian
Karl Heinz Kremer says:

November 24, 2015 at 9:03 am

Brian, the “word finder” can only search for one word at a time. You would have to implement your own method of searching for longer strings.
Sean Osterhout says:

December 1, 2015 at 9:34 am

Could this be modified to extract groups of pages? I have an 800 page report that we want separated into 3-page documents, every third page. I have a script that inserts blank pages for when we print:

/* Add blank pages every 3 */
/* To change number of pages between blank, change all “3” to the desired increment */

for (var i=this.numPages-3; i>=0; i-=3) {
var Rect = this.getPageBox(“Crop”, i);
this.newPage(i+3, Rect[2], Rect[1]);
}

Now I need to get the printed bills separated so I save them digitally to our customer’s files. Can you point me in the right direction?
Karl Heinz Kremer says:

December 1, 2015 at 12:37 pm

Sean, to extract pages you would use a different approach. The loop could be the same, but in the loop, you would use the “Doc.extractPages()” method. See here for more information: http://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/index.html#t=Acro12_MasterBook%2FJS_API_AcroJS%2FJavaScript_API.htm%23TOC_extractPagesbc-423&rhtocid=_6_1_8_39_32
Maria Majka says:

May 20, 2016 at 3:41 pm

This article has been an incredibly helpful tool for me! Thanks so very much for sharing your knowledge in a clear, concise manner (I know nothing about scripts and you made this simple). This is saving me hours of work in extracting multiple pages.
Duncan Marr says:

June 15, 2016 at 1:37 am

I have an 8000 page PDF. Every even page is addressed and every odd page is not addressed (These need to be kept together). I need to extract all those pages where there are multiples that share the same name and address (including the corresponding non-addressed page) into one file, in order, and all those that only appear once into another file. Is this possible via a script?
Karl Heinz Kremer says:

June 24, 2016 at 2:38 pm

Duncan, it may be possible to do this using a script, but it depends on the actual PDF file and how it was generated. Even if it is possible, it requires quite a bit of scripting. I’ve done similar projects where pages needed to be bundled and extracted as individual documents, so if you need professional help, feel free to contact me via email to ask about my consulting services. My email address is on my “About” page: http://khkonsulting.com/about
ash says:

July 20, 2016 at 4:05 am

page extracted but in different file. how can i combine pages. i selected different pdfs i want the result to be combined in one pdf.
Karl Heinz Kremer says:

July 20, 2016 at 12:24 pm

ash, you cannot do this in one operation, you need to extract first and then assemble – or, you can keep track of which pages you want, and then remove all pages from your document that you do not need.
Forrest says:

September 12, 2016 at 7:38 pm

Hi Karl

Thanks for this posting – although I’m having a very odd issue. I have Adobe Acrobat Pro XI, and for some reason when I use your script the “stringToSearchFor” must start with the letter V. If I try any other word, it does not work. Any ideas?

Thanks!
Karl Heinz Kremer says:

September 19, 2016 at 11:23 am

Forrest, sorry, I don’t have any ideas why that would be. Did you make changes to my code?
Josh says:

September 27, 2016 at 11:43 am

Karl, is there a way to modify this code to search for a partial word?
Kendra says:

September 27, 2016 at 6:58 pm

Is it possible to look to a certain location on the PDF for a word (in my case a loan number) and include that word in the filename when extracting/splitting?
Karl Heinz Kremer says:

September 28, 2016 at 5:31 pm

Kendra, you can try to extract a small portion of the document by cropping the page first to your target area, then getting all words in that target area while assembling e.g. your loan number, and then undoing the crop again to go back to your original page. This page has information about how to do that: https://answers.acrobatusers.com/Reverse-Crop-With-Javascript-q299707.aspx

You can only use that information as part of a filename if you are saving the document (or spitting it) via JavaScript.
Karl Heinz Kremer says:

September 28, 2016 at 5:32 pm

Josh, to match a partial word, you would need to provide your own matching algorithm. You can e.g. use regular expressions to do that. The word finder will always return one word, and you would have to implement the logic to match your partial word.
Vanessa says:

December 14, 2016 at 3:57 pm

Brilliant! But how do I get this to search for more than one string at a time and output all the pages in one shot?
Karl Heinz Kremer says:

December 14, 2016 at 6:08 pm

Vanessa, that’s just standard JavaScript programming. You need to use an “or” construct to search for one or another string (or a third a fourth or a fifth and so on):

var nthWord = this.getPageNthWord(p, n); if (nthWord == stringToSearchFor_1 || nthWord == stringToSearchFor_2 || nthWord == stringToSearchFor_3) { // ...

I’ve pulled out getting the nth word from the if statement so that I don’t have to call it multiple times. I assign it to a variable, and then just compare that variable to all the words I am looking for.
Vanessa says:

December 15, 2016 at 2:39 pm

YOU ARE AMAZING. THANK YOU!!!!!!!!!!!! you just saved me hours and hours of work <3
Brandon says:

February 9, 2017 at 12:38 am

Hi. I had to delete the Actions (Find and Highlight, Extract Highlighted) from my Adobe, but now I’m getting an error message stating “Unable to Import the Action “ExtractPagesWithString’. The file is either invalid or corrupt. I have a huge project that will require 4400 pages to be marked and extracted out of 14000. I can’t figure this out. Thank you in advance!
Karl Heinz Kremer says:

February 9, 2017 at 7:02 pm

Brandon, which version of Adobe Acrobat are you using? This should work without problems in any recent version.
srihari says:

March 20, 2017 at 5:27 am

Hi. I had to delete folios in PDF. I am currently using edit document text in Adobe X pro. If few pages I can do it manually but for more pages its tough. So can I have a script to remove the folios in pdf
Karl Heinz Kremer says:

March 20, 2017 at 8:26 pm

Srihari, with the information provided here and some basic JavaScript knowledge, you should be able to create this script yourself. If this is not enough, I can certainly help you via my professional consulting services. If you are interested in that, feel free to get in touch with me via email.
srihari says:

March 20, 2017 at 10:28 pm

Thank you Karl. If you could provide me basic script for folio I can develop it and use it.
Stanley J says:

April 26, 2017 at 11:43 am

Hi, this has really helped. Thank you so much. How would the script look to extract content using a date format (eg. 06JUL17)? Im having difficulty with this. Thanks
Karl Heinz Kremer says:

May 3, 2017 at 1:14 pm

Stanley, if you already know the exact string, you can just adjust this one line:

var stringToSearchFor = "06JUL17";

This should do the job. If the date is not fixed, you need to use the util.printd() method to create the string to search for. E.g. something like this for today’s date:

var today = new Date(); var stringToSearchFor = util.printd("ddmmmyy", today).toUpperCase();
Stanley J says:

May 8, 2017 at 9:48 am

Karl,

When I use :

var stringToSearchFor = “06JUL17”;

The javascript runs and states “completed’ but does not create a new (temp) file with the “06JUL17” pages. I’ve tried this several times. It seems, the only time the new extracted pages are created is if I use a purely alpahbetical search string and not a alphanumeric one like 06JUL17. Your thoughts.

Thanks
Hermie says:

May 26, 2017 at 7:02 pm

Hi Carl, Running the script on Acrobat Pro DC. It says completed, but where is the extracted file saved? What’s the default location? Thanks!
Rafique Khan says:

May 30, 2017 at 10:45 pm

Karl, how can I modify this file to use it on a folder and use the same file name with just appending an extra string. I would really appreciate your help.
Karl Heinz Kremer says:

June 13, 2017 at 1:22 pm

Rafique, can you please elaborate on what it is you want to do. From your short description, it’s not clear to me.
Karl Heinz Kremer says:

June 13, 2017 at 1:24 pm

Hermie, the file does not get saved, you need to do that. It will be open in Acrobat (you should have two files open after the script runs: The original PDF file and the one with the extracted pages).
elcartu says:

June 22, 2017 at 6:28 am

Hola, Are any posibility to extract pages with a variable content from an external csv/txt file or similar?

for example, extract pages who have the ref “X” from this csv/txt file… and inside the txt file are…
“f34″;”r45″;”k43”
Karl Heinz Kremer says:

June 22, 2017 at 10:08 am

elcartu, you can certainly do that. It’s just a matter of reading the text file into Acrobat via util.readFileIntoStream(), then processing the stream and parsing out your CSV data. Other than that, it’s just plain JavaScript. The actual implementation is a bit outside of the scope of what I can do here on my blog, so if you need help with this, you can contact me via email for my consulting services.
elcartu says:

June 22, 2017 at 1:23 pm

Thanks, I nearly do it… whit the action “Find, Highlight, and Extract Words” from https://acrobatusers.com/actions-exchange… I need to do some extra process but is enough for me for now.
Adam says:

July 18, 2017 at 5:14 pm

Karl, thanks for keeping up with these replies years after the original post. Let’s say I want to delete (instead of extract) all pages from a PDF that contain a certain string. Would it be easy for you to throw something together to do that?
Karl Heinz Kremer says:

July 19, 2017 at 10:22 am

Adam, the key for deleting pages is that you need to process the file in reverse page order (starting with the last page and ending with the first). Something like this should work (disclaimer: I did not try this, I just modified the original version of the script in a way I think will work):

var stringToSearchFor = "Total";
for (var p = this.numPages-1; p >= 0; p--) { // iterate over all words for (var n = 0; n < this.getPageNumWords(p); n++) { if (this.getPageNthWord(p, n) == stringToSearchFor) { if (this.numPages > 1) { this.deletePages(p); } else { app.alert("Cannot delete last remaining page in document"); } break; } } }
Jo-an says:

August 28, 2017 at 7:59 pm

Hi Karl,

Thanks for the post. It’s very useful. I took the liberty to modify the script to search for 5 strings. In my case, the 4th and 5th strings could have multiple occurrences in the PDF, and I only need one of the pages that contain those strings. I also know that the 1st, 2nd, and 3rd string always come before 4th and 5th. I am wondering, is there a way to modify the script that it only outputs 1 result when it comes to multiple pages containing the same string?

Thank you in advance.

The script I modified is as follows:

// Iterates over all pages and find a given string and extracts all
// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor1 = “A”;
var stringToSearchFor11 = “B”;
var stringToSearchFor12 = “C”;

var stringToSearchFor2 = “D”;
var stringToSearchFor21 = “E”;

var stringToSearchFor3 = “A”;
var stringToSearchFor31 = “B”;
var stringToSearchFor32 = “C”;
var stringToSearchFor33 = “D”;

var stringToSearchFor4 = “E”;
var stringToSearchFor41 = “F”;
var stringToSearchFor42 = “G”;
var stringToSearchFor43 = “H”;

var stringToSearchFor5 = “I”;
var stringToSearchFor51 = “J”;
var stringToSearchFor52 = “K”;
var stringToSearchFor53 = “L”;

for (var p = 0; p < this.numPages; p++)
{
// iterate over all words
for (var n = 0; n 0) {
// extract all pages that contain the string into a new document
var d = app.newDoc(); // this will add a blank page – we need to remove that once we are done
for (var n = 0; n < pageArray.length; n++)
{
d.insertPages( {
nPage: d.numPages-1,
cPath: this.path,
nStart: pageArray[n],
nEnd: pageArray[n],
} );
}
// remove the first page
d.deletePages(0);
}
Karl Heinz Kremer says:

September 29, 2017 at 10:18 am

Jo-an, you need to use a logical “AND” operation. This is standard JavaScript, and has nothing to do with Acrobat specifically. Lookup logical operations in the JavaScript documentation (e.g. here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Logical_Operators)
Soren Eustis says:

October 5, 2017 at 1:38 pm

Wonderful script! Is there an easy way to reference a text file (.csv) with a list of search words. I have a large PDF that is a group of class assignments. I mail-merged the assignment so that each student has their username on the top right of each page. I would like to have Acrobat find each of these usernames, extract the pages with the current index search term, and name the new PDF username.pdf. Thoughts?
Jhuanderson Maci says:

October 12, 2017 at 2:51 pm

Hey Karl,

This is great! I modified the code so I can add multiple string in an array. I was wondering if you could give advise to be able to extract based on two strings. Ex: instead of searching “Total”, it would search “Total Diff”.
Julian says:

October 17, 2017 at 12:37 pm

Hi Karl,

Love many of your posts, and thanks so much for posting this script – it’s getting me 99% of the way to where I want to be. My current challenge is saving the resulting file once it’s produced.

I’ve modified your script such that it looks through an array of unique IDs, extracting just those pages on which the various unique IDs appear. This works like a charm and the 4,000 page (20,405 KB) PDF I have is being filtered to exactly the 1,280 pages I need.

The challenge is that the .tmp file that is created won’t save. When I try to save or save as, I eventually get a pop up error message that says “Out of memory.” When I located the .tmp file in the local temp folder on my windows machine, I found that in addition to a file of the same name as what was appearing in the new file your script creates “A9RE459.tmp” (4 KB) there was another file with the same creation time with the name “A9RE45A.tmp” (2,534,408 KB).

Any thoughts on how to troubleshoot this would be greatly appreciated.
Karl Heinz Kremer says:

October 20, 2017 at 11:18 am

Julian, this is a hard one… Because we don’t know why exactly Acrobat is failing (the error message may not even be accurate), I would try to process the 4000 page document in multiple batches (e.g. 1000 pages at a time), and then concatenate the output files.
Karl Heinz Kremer says:

October 20, 2017 at 11:23 am

Jhuanderson, you will have to manually find the two strings that are in the same area of the page. If both are on the same line, it’s pretty straight forward, but when there is a line break between the two words, it gets very complex. Take a look at the Doc.getPageNthWordQuads() function: http://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/index.html#t=Acro12_MasterBook%2FJS_API_AcroJS%2FDoc_methods.htm%23TOC_getPageNthWordQuadsbc-55&rhtocid=_6_1_8_23_1_54
Karl Heinz Kremer says:

October 20, 2017 at 11:25 am

Soren, there is no easy way to do that. You would have to open the CSV file as a text stream, convert that stream to a string and then parse the CSV data. This is how you would start: http://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/index.html#t=Acro12_MasterBook%2FJS_API_AcroJS%2Futil_methods.htm%23TOC_readFileIntoStreambc-6&rhtocid=_6_1_8_78_0_5
Prakash Dara says:

November 2, 2017 at 3:14 am

Hi,
I Want to Extract Particular String from pdf and also particular Page

Prakash Dara
Karl Heinz Kremer says:

November 2, 2017 at 10:14 am

Prakash, if you need to search for a string containing multiple words, you need to do a lot more in order to test for proximity of words. That’s a much larger and much more complex project. If you need help with that, and you are interested in my professional services, feel free to get in touch with me via email. My email address is on the “About” page.
Adriana Rojas says:

November 22, 2017 at 3:18 pm

Hi Karl, thank you for posting this info. I’m not a coder or programmer, just a heads up. I’m close to getting the results I want (searching for pages with specific text and deleting them from PDF) – 1) regarding the string I’m searching for, can I use symbols? the exact term I want to search for is “(0)” (paren and number zero) and 2) I modified your script from page extraction to deletion as you recommended above however, it’s either not working or working VERY slowly (PDF is only 189 pgs). Any ideas what I can do to reduce the time?
Karl Heinz Kremer says:

November 23, 2017 at 12:21 pm

Adriana, this process is very slow. There is no way around it. It’s not using the “normal” search/find function, so you cannot compare the execution speed with how fast search/find would find something. Take a look at the “bStrip” parameter that controls if punctuation marks will be removed or not: https://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/index.html#t=Acro12_MasterBook%2FJS_API_AcroJS%2FDoc_methods.htm%23TOC_getPageNthWordbc-54&rhtocid=_6_1_8_23_1_53
Ysh says:

February 27, 2018 at 1:20 pm

Karl,

Thank you for this amazing script. I think I found the answer to my question in the comments but want to double check. If I am looking for a string that has multiple words (i.e. “No Change”) would this script not work?

I also saw solutions for nth word. I do not quite understand this, but will read more if it is a workaround for my issue.
Karl Heinz Kremer says:

February 27, 2018 at 1:47 pm

Ysh, this solution will only work for one word. If you want to use more than one word, you will have to implement a way to group words based on their location. Acrobat may not return two consecutive words in the document in that order when you iterate over all words.
Sylvia says:

March 5, 2018 at 2:00 pm

Hi Karl, Is there a way to create a folder for all of the extracted pages. I have about 400 separate pdfs, the extracted pdfs are basically stacked on each other on my desktop waiting for me to save. I would like to do that task at a later time, so I just want to store them in a folder on my desktop for now. If there is no easy way I guess I could always combine all the files then extract to have one extracted file instead of 400.
Alex says:

March 7, 2018 at 4:52 pm

Karl,
I have about 1,000 pdf files and each file has about 50 pages. I want to split/extract the pages out of each file onto it’s own file (should be 1-3 pages). The pdf file contains Contract Name. I want the file to print every time it finds a new contract name (The contract name is to the right of “contract name: “. It is usually 1 contract per page, but some contract may have up to 3 pages (could be more but that is what I found so far). I also want to use the contract name as part of the new file name. How can i do this? I have adobe acrobat and ms office 2010. I’m very familiar with vba but I am open to doing it with another language /technology. Any help is appreciated.
Drew says:

March 8, 2018 at 2:30 pm

So how do add the below code to your original code to enable searching for multiple strings?
var nthWord = this.getPageNthWord(p, n);
if (nthWord == stringToSearchFor_1 || nthWord == stringToSearchFor_2 || nthWord == stringToSearchFor_3) {
// …
Karl Heinz Kremer says:

March 8, 2018 at 2:48 pm

Drew, there is nothing to “enable” – the JavaScript API does not allow you to search for multiple words in one operation. You will always just get one word at a time. You will have to implement the logic to match n words on your own. The problem you will have to deal with is that words that seem to be right next to each other may not get reported in the correct sequence by Acrobat. This means that you will have to use the location of a word to verify that you are indeed dealing with e.g. the three words you are looking for.
Karl Heinz Kremer says:

March 8, 2018 at 2:49 pm

Alex, this is a little bit too complex for a reply to a blog post. I’ve done this a number of times, and what it boils down to is that you you create a folder level JavaScript function that does the heavy lifting in JavaScript, and you then call this function from your VBA application. If you need professional help with this, feel free to get in touch with me via email. My email address is on my “About” page.
Drew says:

March 8, 2018 at 2:52 pm

I am only looking for one word scripts. So basically, pull the pdf if it contains the word “Cat” or “Dog” or “Fish”.
Karl Heinz Kremer says:

March 9, 2018 at 8:26 am

Sylvia, you cannot create a folder from within Acrobat. You would need to create that folder before you run the script and then use that folder as part of your output path.
Karl Heinz Kremer says:

March 9, 2018 at 8:31 am

Drew, In that case, what you had in your first comment should work – just replace the original line that checks for a match (‘if (this.getPageNthWord(p, n) == stringToSearchFor) {‘) with the two lines you posted earlier.
Drew says:

March 9, 2018 at 1:46 pm

Something like this?

// Iterates over all pages and find a given string and extracts all
// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor_1 = “Dog”;
var stringToSearchFor_2 = “Cat”;

for (var p = 0; p < this.numPages; p++) {
// iterate over all words
for (var n = 0; n 0) {
// extract all pages that contain the string into a new document
var d = app.newDoc(); // this will add a blank page – we need to remove that once we are done
for (var n = 0; n < pageArray.length; n++) {
d.insertPages( {
nPage: d.numPages-1,
cPath: this.path,
nStart: pageArray[n],
nEnd: pageArray[n],
} );
}

// remove the first page
d.deletePages(0);

}
Drew says:

March 9, 2018 at 1:48 pm

// Iterates over all pages and find a given string and extracts all
// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor_1 = “EQTY”;
var stringToSearchFor_2 = “StkComp”;

for (var p = 0; p < this.numPages; p++) {
// iterate over all words
for (var n = 0; n 0) {
// extract all pages that contain the string into a new document
var d = app.newDoc(); // this will add a blank page – we need to remove that once we are done
for (var n = 0; n < pageArray.length; n++) {
d.insertPages( {
nPage: d.numPages-1,
cPath: this.path,
nStart: pageArray[n],
nEnd: pageArray[n],
} );
}

// remove the first page
d.deletePages(0);

}
Denis says:

March 21, 2018 at 11:37 am

Hi, I’m wondering if the “keyword” must be only letters as opposed to numbers or a combination of alphanumeric? I am searching documents for specific letter/number combinations… When I’ve used a specific word it works perfectly.

Or does this have something to do with my source document.

Thanks in advance!
Karl Heinz Kremer says:

March 21, 2018 at 1:22 pm

Denis, Acrobat uses different methods to determine where a “word” ends. It is possible that whatever combination of letters and numbers you have does not qualify for a word in Acrobat. Without seeing the document, it’s impossible not tell what the reason is. You can let Acrobat enumerate all words on a page to see what it actually considers to be a word. The following script will print all words to the console (this may overload the console window, depending on how much text you have on the page):

for (var i=0; i


	
		
			
								James Daniel says:			

			
			
				April 4, 2018 at 7:33 am			


			Hi Karl, many thanks for you effort on this script. I have a long list of terms to search for, and the results for each term need to be shown in a separate file. I have very little experience with JavaScript but thought storing the list of terms in an array and then looping through that array might do the trick. I came up with the following but seem to be having no luck – any insight much appreciated! 
// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.
var pageArray = [];
var stringsToSearchFor = [“W11804”,”W11707”,”W11003”];
for (var i = 0; i < stringsToSearchFor.length; i++)

{
for (var p = 0; p < this.numPages; p++) {

	// iterate over all words

	for (var n = 0; n  0) {

	// extract all pages that contain the string into a new document

	var d = app.newDoc();    // this will add a blank page – we need to remove that once we are done

	for (var n = 0; n < pageArray.length; n++) {

		d.insertPages( {

			nPage: d.numPages-1,

			cPath: this.path,

			nStart: pageArray[n],

			nEnd: pageArray[n],

		} );

	}
    // remove the first page

    d.deletePages(0);
}

}


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				April 4, 2018 at 12:26 pm			


			James, something like the following should work in your case:


var stringsToSearchFor = ["W11804", "W11707", "W11003"];

for (var i = 0; i < stringsToSearchFor.length; i++) {
	var d = app.newDoc(); // this will add a blank page - we need to remove that once we are done
	for (var p = 0; p < this.numPages; p++) {
		// iterate over all words
		for (var n = 0; n < this.getPageNumWords(p); n++) {
			// extract all pages that contain the string into a new document
			if (this.getPageNthWord(p, n) == stringsToSearchFor[i]) {
				d.insertPages({
					nPage: d.numPages - 1,
					cPath: this.path,
					nStart: p,
					nEnd: p,
				});
			}
		}
	}
	// remove the first page
	d.deletePages(0);
}



			
							

		


	

	
		
			
								VaibhavVyas says:			

			
			
				April 11, 2018 at 2:32 pm			


			Karl,
I need to remove pages from pdf if it matches certain strings. These may be multiple. Can you pls help putting code for same.


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				April 11, 2018 at 2:49 pm			


			VaibhavVyas – this simple solution can only search for one “word” at a time. If you need something that can identify multiple strings, the solution becomes much more complex – too complex for a blog post (or a comment to a blog post).


			
							

		


	

	
		
			
								Vaibhav Vyas says:			

			
			
				April 11, 2018 at 3:03 pm			


			Hey Karl,
I understand but what would be command to delete pages from pdf rather than pulling?
Appreciate your quick response


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				April 11, 2018 at 3:09 pm			


			Take a look at the documentation for Doc.deletePages(): https://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/index.html#t=Acro12_MasterBook%2FJS_API_AcroJS%2FDoc_methods.htm%23TOC_deletePagesbc-20&rhtocid=_6_1_8_23_1_19


			
							

		


	

	
		
			
								Vaibhav Vyas says:			

			
			
				April 11, 2018 at 3:10 pm			


			Thanks. Instead of single word. I have string which has multiple words. Can you help with that?


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				April 11, 2018 at 3:15 pm			


			That is where it gets complicated, and you are on your own (unless you want to pay me to work on a solution). You will have to look for the words in your string individually, and then, based on the location of each word (which you can get via Doc.getPageNthWordQuads – https://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/index.html#t=Acro12_MasterBook%2FJS_API_AcroJS%2FDoc_methods.htm%23TOC_getPageNthWordQuadsbc-55&rhtocid=_6_1_8_23_1_54 ) determine if they are in fact in the correct sequence, without anything else in between.


			
							

		


	

	
		
			
								Pasquale Ceglie says:			

			
			
				June 6, 2018 at 6:20 am			


			I’ve a PDF with 2300 pages. Executing script after 6 hours application is still in “not respondig” mode. Suggestions?

Thanks


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				June 6, 2018 at 8:45 am			


			Pasquale, break up the document into smaller documents (e.g. 500 pages each) and then process each of these smaller documents, and then combine the individual outputs into the final output document.


			
							

		


	

	
		
			
								Tess says:			

			
			
				June 18, 2018 at 4:20 pm			


			Karl. You are a hero among mere mortals. FOUR YEARS LATER and you are responding to comments?!?!?!!  If only the rest of the internets could be so awesome.
Oh, also – your scripts have made my life SO much easier. I first used your extract script a couple of years ago and have it bookmarked. I came back here today with the same question many had – how to DELETE the pages with the specified string. Sure enough, there it was. And it works beautifully, Karl. Beautifully.
You have not only saved me time and tedious page-extraction/deletion, you have gotten me interested in other things I could try to make Adobe’s PDFs do! But I know my limits, and will certainly work hard to advocate for you as a consultant should the time come.


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				June 18, 2018 at 4:26 pm			


			Hi Tess, I am glad that the information I provide is helpful to you. BTW: I also provide training, so if you want to learn more about scripting for Acrobat, or Acrobat and PDF in general, I can certainly help you with that as well 🙂


			
							

		


	

	
		
			
								shiva says:			

			
			
				September 21, 2018 at 12:44 am			


			How to put this code thru VBS? I have adobe pro, but i want to use VBS or VBA option.


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				September 27, 2018 at 8:25 am			


			Shiva, this is JavaScript, so will not work directly in VBA – you can however use the JSObject in the IAC API to execute JavaScript from within VBA. Please see my VBA/JSObject related posts.


			
							

		


	

	
		
			
								Abdul says:			

			
			
				October 21, 2018 at 3:27 am			


			Dear Karl,
Thank you for this article. it is excellent..

I wanted to break the pdf by the specific text and save as individual file. where to change in the Java script


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				October 22, 2018 at 4:39 pm			


			Abdul, this would require a lot of changes. If you are familiar with how to program for Acrobat, then this blog post should give you all the information that is needed to complete such a solution. If you think this is too complex, then unfortunately, somebody has to create this solution for you.


			
							

		


	

	
		
			
								Ian says:			

			
			
				October 23, 2018 at 4:33 pm			


			Hello Karl,

I am not a programmer. I happened across your page after hours of unsuccessful searching.  I have a singe large PDF containing about 50 resumes.  I am looking to extract each resume in a separate file based on specific ones I select.  Resumes typically range from 1 to 3 pages. The name on each pages is what I use to identify each resume.  So I may have “John Smith” on one resume, “Mary-Anne Jones” on another, and “Michael J. Jackson” on a third. Can you tell me what the code should look like please. 
Your time is appreciated!


			
							

		


	

	
		
			
								Ian Dey says:			

			
			
				October 24, 2018 at 2:52 pm			


			Hey Karl, 
Okay, I have played around with your code and realize that it only finds for single words, not multiple words entered in a single string.  I would like to search and find one or more email addresses. Can you help me figure out what changes would have to be made to the code to find “fname.lname@domain.com” please.  The code does not seem to recognize the special characters or spaces. 
Appreciate your help.


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				October 24, 2018 at 3:30 pm			


			Ian, resumes are usually not standardized, so you would not know where one starts and where it ends again. This makes this approach pretty much impossible. One option would be to “mark” every first page of a resume (e.g, by using an annotation). You would then still need to figure out what the name is. If you use the highlight annotation to mark the start of a resume, you can actually configure Acrobat to copy the highlighted text into the annotation. This would then allow you to extract resumes, using the applicants name as part of the filename. This is not a trivial task, and unfortunately, too complex to just whip up some code. If you are interested in my professional services to help you with this, feel free to contact me via email. My email address is on the “About” page.


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				October 24, 2018 at 3:33 pm			


			Ian, when you look a the API documentation for the getPageNthWord() function, you will see that Acrobat only allows us to compare one word at a time. To process more complex strings, you will need to handle this complexity in your code. As far as special characters go, you can specify if you want to receive them, or if Acrobat should drop them – this is a parameter to getPageNthWord().


			
							

		


	

	
		
			
								Ian Dey says:			

			
			
				October 25, 2018 at 2:42 pm			


			Thanks Karl!  In the case of special characters, if I use the “false” parameter with the getPageNthWord() function, will Acrobat then look at the email address “fname.lname@domain.com” as being one word in the following line: 
if (this.getPageNthWord(p, n, false) == stringToSearchFor)
When I try it, it returns nothing.


			
							

		


	

	
		
			
								Diane W says:			

			
			
				December 5, 2018 at 9:25 am			


			Is there any way to use the script above to extract the search word content and then the page proceeding it? I have a 30,000 page document and need it sorted by associate but each sub document is 2 pages in content. The second page however doesn’t have the search content words on it.


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				December 5, 2018 at 9:30 am			


			Diane, the “Doc.insertPages()” function takes a start page and an end page for the pages that will be inserted into the document. In the code I provided, both are set to “pageArray[n]” – if you set endPage to this plus one, you will extract the page with the search term on it, and the following page:
nEnd: pageArray[n]+1,


			
							

		


	

	
		
			
								Sitakanta says:			

			
			
				February 6, 2019 at 2:04 pm			


			Hi Karl,
Hope you are doing good.

I want little help in extracting highlighted pages from pdf to a separate pdf file.

In Acrobat pro dc, can this be done by a custom action or javascript.
Sitakanta


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				March 6, 2019 at 9:01 am			


			Sitakanta, no, you only have access to the page selection from within an custom plug-in, not JavaScript or an Action.


			
							

		


	

	
		
			
								Marvin says:			

			
			
				March 25, 2019 at 6:36 pm			


			Karl,

How can I add a blank page after the one highlighted in Page Thumbnails? Can I have the background of the new page(s) yellow? All I can find is how to add pages before, after and every other page.

Thanks!


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				April 25, 2019 at 1:36 pm			


			Marvin, you would need a custom plug-in for that, and that’s where it gets expensive.


			
							

		


	

	
		
			
								AC says:			

			
			
				September 6, 2019 at 10:59 am			


			Hello There,
Is there a way to use the code to split pages based on Character Count? I been trying to figure out with using this coding for a project but no success.
Thanks!


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				September 20, 2019 at 1:38 pm			


			Not really. You may be able to get all the text of a page and count the characters, but that will not be completely accurate (it does not account for spaces and other whitespace for example). If you just need a rough number than that should work.


			
							

		


	

	
		
			
								HABIB says:			

			
			
				September 24, 2019 at 3:15 am			


			Dear Karl Heinz Kermer,

how do I get search for more than one string and some strings are in brackets and some in CAPITAL letter {Example : want to search LAND LORD,landlord,land lord,(landlord) } at a time and output all the pages in one shot?

Kindly give complete script.

Thanks


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				September 27, 2019 at 1:28 pm			


			HABIB, searching for more than one string gets complicated. You can do this with regular expressions and some boolean constructs.


			
							

		


	

	
		
			
								Karlie Lachendro says:			

			
			
				October 18, 2023 at 9:18 am			


			Hello 🙂 Do you happen to have any scripts that would allow me to extract pages and title them by a certain feature in the pdf?  Example I have a large file with W2’s for all my employees that I am trying to separate by page and save by SSN so I can easily figure out which document belongs to which employee.
Am I in LaLa Land thinking this is a possibility?


			
							

		


	

	
		
			
								Karl Heinz Kremer says:			

			
			
				November 28, 2023 at 11:09 am			


			This can be done (as long as the document in question is willing to give up that information), but I don’t have anything that would do that out of the box. There is always custom development.




	

	
		Leave a Reply Cancel reply
Your email address will not be published. Required fields are marked *
Comment 
Name * 
Email * 
Website




		
			

Tip Jar
			If the information you found on my site helped you to solve a problem, please consider to hire me for your next PDF related project. If you just want to say “Thank You” for the tips and tricks I provided, you can leave a tip via PayPal (starting at $1):



		
				Recent Blog Posts
		
											
					EAN/UPC Barcodes in PDF Forms
									
											
					Page Splitter – For The 3rd Time – Splitting Tri-Fold Brochures
									
											
					The PDF Time Machine
									
											
					Connect to Database from PDF Form – This Time Without SOAP
									
											
					Remove Content from PDF Files Using Acrobat’s Preflight
									
					
		
Blog Archive
		Blog Archive
		



		
			    
		
			


		
			


		
Contact
			KHKonsulting LLC

khkonsulting@khk.net