Extract PDF Pages Based on Content

How would we identify pages in a PDF document that contain a certain word and extract those pages into a new document? This can be done with a few lines of JavaScript – there are different ways to do this: We can create a document level JavaScript and install it in the one of Acrobat’s JavaScript folders (see here for more information about how to identify the folder where to install such a script), or we can create an Action that executes the JavaScript. In the past I’ve written about how to create folder level scripts (e.g. here), so let’s create an Action today.

Here is the script that we will be using:

// Iterates over all pages and find a given string and extracts all 
// pages on which that string is found to a new file.

var pageArray = [];

var stringToSearchFor = "Total";

for (var p = 0; p < this.numPages; p++) {
	// iterate over all words
	for (var n = 0; n < this.getPageNumWords(p); n++) {
		if (this.getPageNthWord(p, n) == stringToSearchFor) {
			pageArray.push(p);
			break;
		}
	}
}

if (pageArray.length > 0) {
	// extract all pages that contain the string into a new document
	var d = app.newDoc();    // this will add a blank page - we need to remove that once we are done
	for (var n = 0; n < pageArray.length; n++) {
		d.insertPages( {
			nPage: d.numPages-1,
			cPath: this.path,
			nStart: pageArray[n],
			nEnd: pageArray[n],
		} );
	}

    // remove the first page
    d.deletePages(0);
    
}

The script is pretty straight forward: We are iterating over all pages, and on each page, we are looping over all words until we find the word that we are looking for. In that case, we are adding the page number to an array of page numbers.

If, after all this looping, we have information in this array of page numbers, we process that list by creating a new document (which will add a blank page – a PDF document always has to have at least one page), and then we add each page from the original document that we find in the array. All that’s left now is to remove that initial blank page.

So, let’s convert this into an Action. In Acrobat XI Pro (this will not work in Standard, it does not support Actions), select “Tools>Action Wizard>Create New Action”. This will create an empty action. Do add JavaScript to our Action, select the “Execute JavaScript” option under “More Tools” and move it to the right side (e.g. by clicking on the arrow button.

2014 04 25 12 50 51

Once the “Execute JavaScript” step is on the right side, click on the “Specify Settings” button and paste the script from above into the editor. Once the script is part of the Action, you can prevent the editor from popping up every time you run the Action by deselecting “Prompt User” for this action step.

Save the action, give it a meaningful name and you are ready to execute it.

You can download the action here: ExtractPagesWithString.sequ. Once downloaded, just double-click on it to install it in Acrobat Pro. Again, this will not work with Acrobat Standard or the free Adobe Reader.

This entry was posted in Acrobat, JavaScript, PDF, Tutorial and tagged , , , . Bookmark the permalink.

63 Responses to Extract PDF Pages Based on Content

  1. Elizabeth Celuck says:

    Your script is exactly what I have been searching for, so thank you for sharing it! I am getting an error message saying it is corrupt when I click on it from the download location. I also tried copying the code and pasting into notepad, saving it as an sequ, and then opening it, but still get a corrupt code error. I would appreciate any assistance you can offer. Thanks!

  2. Joe Barry says:

    Hello,

    Good code snippet.

    How might we then save down the .tmp files that pop up ? We’d like this to be more of an operating system script that saves a new file with a name of “filenane+new”, suppress any preview and commit the files to the operating system as files.

  3. Karl Heinz Kremer says:

    Elizabeth, you should be able to create a new action based on the instructions I’ve provided. You cannot just save the code snippet as a SEQU file, you will have to create a new Action, add a JavaScript step and then use the code from above for that JavaScript processing step.

  4. Karl Heinz Kremer says:

    Joe, that’s not what an Acrobat Action is about: An Action will always run in Acrobat and will display the processed file. If you want to do this from outside of Acrobat, you will have to write an application that “remote controls” Acrobat e.g. via the IAC interface using VB. Take a look at my VBA and VBScript related posts for more information. You would have to use the JSObject to use the JavaScript interface from VB or VBScript.

  5. JohnR says:

    Great idea on the posted code. I have implemented per your instructions, the code runs and says that it has executed successfully, but no document is created. The search words are correct and are simply replaced in the ‘Total’ text from the script, but nothing appears to happen. The debugger was no help either. Suggestions?

  6. Nicola F. says:

    Thank you so much for this, it opened me a whole new world!

    I got a question: is there a simple command to highlight somehow the word after the script finds it!?
    Something like:
    this.highlightPageNthWord(p, n) !?

    I just want my eyes to find it quickly when I look at the pdf after the script is executed.
    Thanks in advance!

  7. Karl Heinz Kremer says:

    Nicola, look at the Doc.selectPageNthWord() method in the API documentation.

  8. Nicola F. says:

    Karl, thanks for the quick answer!

    I checked the doc, but what you suggested seems no good for me. Or, I’m doing it wrong.

    While reading the manual I found the addAnnot command to add a Highlight, so, I did my own script to do this:
    1- Look for several words
    2-When found, highlight them
    3-Delete pages where there are no matching words
    4-Save the modifed doc with another name

    And, it works! But, it’s very slow.
    A 10 page pdf where the script finds 12 matching words takes 180 sec to process, while it takes only 2 sec if I skip step 2! And I have hundreds pages to process 🙁

    Could these few lines
    this.addAnnot({
    page: nth_page,
    type: “Highlight”,
    quads: this.getPageNthWordQuads(nth_page, nth_word)
    });
    repeated 12 times make such a huge difference!?

    Thanks again

  9. Nicola F. says:

    Nevermind, for some reason I can’t understand, it didn’t like to “addNote” during the search, so, I stored the pages and quads into 2 vectors. At the end of the search, I did all the necessary addNotes together.
    Now I process approx 2000 pages in 8 minutes. Sounds good enough to me! Thanks!

  10. Stephanie A says:

    You are quite literally my favorite person today. You have taken hours off my work week. Thank you!!!!!

  11. Jason Pretorius says:

    I’m not a developer/coder at all, and this literally saved my life today.

    If I could, I would be buying you a beer right now.

    Thanks.

  12. Karl Heinz Kremer says:

    Jason, just keep me in mind for any professional needs around PDF you may come across in the future. I can only write this blog because nice people are hiring me for PDF related consulting jobs 🙂

  13. adrian says:

    Is there any way I can edit this so that it deletes the pages with the specified string?

  14. Jeff B says:

    Hi Karl,

    Your script is exactly what we were looking for but for some reason I can not get it to work. We have a 1622 page document in Acrobat Pro. Each page has either “page 1 of 1” or “page 1 of 2” or “page 2 of 2” at the bottom. We need to extract all the “page 1 of 1” pages from the document into a new document. I have copy and pasted your script and replaced where you have “title” with “page 1 of 1” . The script seems to run fine but the newly made document is the same as the previous document. Any ideas? Thanks.

  15. Karl Heinz Kremer says:

    Jeff, the “word finder” is does just that, it returns one word at a time. You will have to do a bit more to get the full string containing all four parts (“page”, “1”, “of”, and “1”). There is a method to get the location of the “words”, you may have to use that to get things into the correct order.

  16. Jeff B says:

    Thanks for the quick reply. Unfortunately your answer is out of my expertise. Unless you have a webpage to point me to. Thanks.

  17. Karl Heinz Kremer says:

    Jeff, no, I don’t have any instructions that would cover that. However, if you need help implementing this, I am available. This is actually something I’ve done a few times for my customers. You can find my email address on the “About” page.

  18. Praj says:

    Hell. 🙂
    I was searching the method to extract pdf pages having same words.
    This method is very helpful.
    Thank you very much. 🙂

  19. Michael Harp says:

    Is it possible to DELETE all of the pages from a PDF document that includes a specific string of text? I have a 900+ page document that I don’t really need to extract every page that includes certain text, I need to delete the 200+ pages that includes one specific string of text that doesn’t appear anywhere else in the document. Any thoughts?

  20. Karl Heinz Kremer says:

    Yes, it’s certainly possible. I would start to process the document from the last page to the first, and then whenever you find the string, you call

    Doc.deletePages()

    . See here for documentation for this API function: http://help.adobe.com/livedocs/acrobat_sdk/11/Acrobat11_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat11_HTMLHelp&file=JS_API_AcroJS.89.458.html

  21. Cassia says:

    Hello Karl,
    Thank you for this script! It is fantastic.
    Could you please post the full script to save the extracted documents, with new filenames, in new folders? I see that there is some reference made to this above. However, I am not a programmer, and cannot figure out how to implement it.

  22. Karl Heinz Kremer says:

    Cassia, that’s a bit too much to share in a free blog post. If you do need help implementing such a script, that’s what I do for a living 🙂 If you need professional help, feel free to get in touch with me via email. My email address is on my “About” page.

  23. T. says:

    How would I search for all forms of “total” (e.g., “total” and “totaling”)?

    Or, how would I search for two words (if easier than than the above), such as “total” and “totaling”?

    Thank you!

  24. Malcolm says:

    thanks for creating this script – has saved me a few hours work.

    just one question – is it possible to ignore case in the search ??

  25. Jason says:

    Hi! Awesome script. In case anyone wants it, I adjusted the script as follows to prompt the user for the desired search term instead of it being hard-coded into the script:

    Changed the below line
    “var stringToSearchFor = “Total”;”

    To this
    “var stringToSearchFor = app.response(“Enter search term”);”

    I also noticed that this search ignores characters like the $ character. I also figured out that it ‘IS’ case sensitive, and doesn’t work on strings inside or bumped up against other words without a space in between.

    Is there a complete list somewhere that shows what combinations, characters, etc. it will or will not find? Or is there a bit of code that would adjust what it will or will not find?

    Thanks! ~JTC~

  26. Karl Heinz Kremer says:

    Jason, I am not aware of such a list. Take a look at the documentation: http://help.adobe.com/livedocs/acrobat_sdk/11/Acrobat11_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat11_HTMLHelp&file=JS_API_AcroJS.89.492.html

    There is an option to not strip out punctuation marks and whitespace. That may give you want you want.

  27. Dr. Stephanie Rollins says:

    This script is awesome! How can I delete the pages instead of extracting them. I’ve searched everywhere and hoping you can help. Is there a delete command I can insert in this script?

    When I run this script, the pages extract, however, the pages with the searched word still remain in the original document (no blank pages).

    Hoping you can help!! I work for the government, so don’t have $$ to hire a consultant. Trying really hard to figure this out on my own, but I’m stumped!

  28. Jason says:

    Karl,

    Thanks for your response. I’ll keep experimenting and post back if I figure anything additional out. I really appreciate the information you provide. Helps a lot of people out!

    ~JTC~

  29. Brian Borgstrøm says:

    Hi Karl,

    Thank you so much for this script, it does almost everything I need it to.
    Is there a way to get the script search for more than just one word? I specifically want it to search for two-word phrases but I can’t get the script to do that for me. I think it’s because it doesn’t include blank spaces in the search.

    Thanks again,
    Brian

  30. Karl Heinz Kremer says:

    Brian, the “word finder” can only search for one word at a time. You would have to implement your own method of searching for longer strings.

  31. Sean Osterhout says:

    Could this be modified to extract groups of pages? I have an 800 page report that we want separated into 3-page documents, every third page. I have a script that inserts blank pages for when we print:

    /* Add blank pages every 3 */
    /* To change number of pages between blank, change all “3” to the desired increment */

    for (var i=this.numPages-3; i>=0; i-=3) {
    var Rect = this.getPageBox(“Crop”, i);
    this.newPage(i+3, Rect[2], Rect[1]);
    }

    Now I need to get the printed bills separated so I save them digitally to our customer’s files. Can you point me in the right direction?

  32. Karl Heinz Kremer says:

    Sean, to extract pages you would use a different approach. The loop could be the same, but in the loop, you would use the “Doc.extractPages()” method. See here for more information: http://help.adobe.com/en_US/acrobat/acrobat_dc_sdk/2015/HTMLHelp/index.html#t=Acro12_MasterBook%2FJS_API_AcroJS%2FJavaScript_API.htm%23TOC_extractPagesbc-423&rhtocid=_6_1_8_39_32

  33. Maria Majka says:

    This article has been an incredibly helpful tool for me! Thanks so very much for sharing your knowledge in a clear, concise manner (I know nothing about scripts and you made this simple). This is saving me hours of work in extracting multiple pages.

  34. Duncan Marr says:

    I have an 8000 page PDF. Every even page is addressed and every odd page is not addressed (These need to be kept together). I need to extract all those pages where there are multiples that share the same name and address (including the corresponding non-addressed page) into one file, in order, and all those that only appear once into another file. Is this possible via a script?

  35. Karl Heinz Kremer says:

    Duncan, it may be possible to do this using a script, but it depends on the actual PDF file and how it was generated. Even if it is possible, it requires quite a bit of scripting. I’ve done similar projects where pages needed to be bundled and extracted as individual documents, so if you need professional help, feel free to contact me via email to ask about my consulting services. My email address is on my “About” page: http://khkonsulting.com/about

  36. ash says:

    page extracted but in different file. how can i combine pages. i selected different pdfs i want the result to be combined in one pdf.

  37. Karl Heinz Kremer says:

    ash, you cannot do this in one operation, you need to extract first and then assemble – or, you can keep track of which pages you want, and then remove all pages from your document that you do not need.

  38. Forrest says:

    Hi Karl

    Thanks for this posting – although I’m having a very odd issue. I have Adobe Acrobat Pro XI, and for some reason when I use your script the “stringToSearchFor” must start with the letter V. If I try any other word, it does not work. Any ideas?

    Thanks!

  39. Karl Heinz Kremer says:

    Forrest, sorry, I don’t have any ideas why that would be. Did you make changes to my code?

  40. Josh says:

    Karl, is there a way to modify this code to search for a partial word?

  41. Kendra says:

    Is it possible to look to a certain location on the PDF for a word (in my case a loan number) and include that word in the filename when extracting/splitting?

  42. Karl Heinz Kremer says:

    Kendra, you can try to extract a small portion of the document by cropping the page first to your target area, then getting all words in that target area while assembling e.g. your loan number, and then undoing the crop again to go back to your original page. This page has information about how to do that: https://answers.acrobatusers.com/Reverse-Crop-With-Javascript-q299707.aspx

    You can only use that information as part of a filename if you are saving the document (or spitting it) via JavaScript.

  43. Karl Heinz Kremer says:

    Josh, to match a partial word, you would need to provide your own matching algorithm. You can e.g. use regular expressions to do that. The word finder will always return one word, and you would have to implement the logic to match your partial word.

  44. Vanessa says:

    Brilliant! But how do I get this to search for more than one string at a time and output all the pages in one shot?

  45. Karl Heinz Kremer says:

    Vanessa, that’s just standard JavaScript programming. You need to use an “or” construct to search for one or another string (or a third a fourth or a fifth and so on):


    var nthWord = this.getPageNthWord(p, n);
    if (nthWord == stringToSearchFor_1 || nthWord == stringToSearchFor_2 || nthWord == stringToSearchFor_3) {
    // ...

    I’ve pulled out getting the nth word from the if statement so that I don’t have to call it multiple times. I assign it to a variable, and then just compare that variable to all the words I am looking for.

  46. Vanessa says:

    YOU ARE AMAZING. THANK YOU!!!!!!!!!!!! you just saved me hours and hours of work <3

  47. Brandon says:

    Hi. I had to delete the Actions (Find and Highlight, Extract Highlighted) from my Adobe, but now I’m getting an error message stating “Unable to Import the Action “ExtractPagesWithString’. The file is either invalid or corrupt. I have a huge project that will require 4400 pages to be marked and extracted out of 14000. I can’t figure this out. Thank you in advance!

  48. Karl Heinz Kremer says:

    Brandon, which version of Adobe Acrobat are you using? This should work without problems in any recent version.

  49. srihari says:

    Hi. I had to delete folios in PDF. I am currently using edit document text in Adobe X pro. If few pages I can do it manually but for more pages its tough. So can I have a script to remove the folios in pdf

  50. Karl Heinz Kremer says:

    Srihari, with the information provided here and some basic JavaScript knowledge, you should be able to create this script yourself. If this is not enough, I can certainly help you via my professional consulting services. If you are interested in that, feel free to get in touch with me via email.

  51. srihari says:

    Thank you Karl. If you could provide me basic script for folio I can develop it and use it.

  52. Stanley J says:

    Hi, this has really helped. Thank you so much. How would the script look to extract content using a date format (eg. 06JUL17)? Im having difficulty with this. Thanks

  53. Karl Heinz Kremer says:

    Stanley, if you already know the exact string, you can just adjust this one line:


    var stringToSearchFor = "06JUL17";

    This should do the job. If the date is not fixed, you need to use the util.printd() method to create the string to search for. E.g. something like this for today’s date:


    var today = new Date();
    var stringToSearchFor = util.printd("ddmmmyy", today).toUpperCase();

  54. Stanley J says:

    Karl,

    When I use :

    var stringToSearchFor = “06JUL17”;

    The javascript runs and states “completed’ but does not create a new (temp) file with the “06JUL17” pages. I’ve tried this several times. It seems, the only time the new extracted pages are created is if I use a purely alpahbetical search string and not a alphanumeric one like 06JUL17. Your thoughts.

    Thanks

  55. Hermie says:

    Hi Carl, Running the script on Acrobat Pro DC. It says completed, but where is the extracted file saved? What’s the default location? Thanks!

  56. Rafique Khan says:

    Karl, how can I modify this file to use it on a folder and use the same file name with just appending an extra string. I would really appreciate your help.

  57. Karl Heinz Kremer says:

    Rafique, can you please elaborate on what it is you want to do. From your short description, it’s not clear to me.

  58. Karl Heinz Kremer says:

    Hermie, the file does not get saved, you need to do that. It will be open in Acrobat (you should have two files open after the script runs: The original PDF file and the one with the extracted pages).

  59. elcartu says:

    Hola, Are any posibility to extract pages with a variable content from an external csv/txt file or similar?

    for example, extract pages who have the ref “X” from this csv/txt file… and inside the txt file are…
    “f34″;”r45″;”k43”

  60. Karl Heinz Kremer says:

    elcartu, you can certainly do that. It’s just a matter of reading the text file into Acrobat via util.readFileIntoStream(), then processing the stream and parsing out your CSV data. Other than that, it’s just plain JavaScript. The actual implementation is a bit outside of the scope of what I can do here on my blog, so if you need help with this, you can contact me via email for my consulting services.

  61. elcartu says:

    Thanks, I nearly do it… whit the action “Find, Highlight, and Extract Words” from https://acrobatusers.com/actions-exchange… I need to do some extra process but is enough for me for now.

  62. Adam says:

    Karl, thanks for keeping up with these replies years after the original post. Let’s say I want to delete (instead of extract) all pages from a PDF that contain a certain string. Would it be easy for you to throw something together to do that?

  63. Karl Heinz Kremer says:

    Adam, the key for deleting pages is that you need to process the file in reverse page order (starting with the last page and ending with the first). Something like this should work (disclaimer: I did not try this, I just modified the original version of the script in a way I think will work):


    var stringToSearchFor = "Total";

    for (var p = this.numPages-1; p >= 0; p--) {
    // iterate over all words
    for (var n = 0; n < this.getPageNumWords(p); n++) { if (this.getPageNthWord(p, n) == stringToSearchFor) { if (this.numPages > 1) {
    this.deletePages(p);
    }
    else {
    app.alert("Cannot delete last remaining page in document");
    }
    break;
    }
    }
    }

Leave a Reply

Your email address will not be published. Required fields are marked *