How To: Extract Highlighted Text from a PDF File

Some people enjoy reading on paper not only because they can make annotations and highlight text easily, but also because they actually like their handwriting. If you are not one of those, then the following guide may help you. I will show how to extract all the highlighted text and the annotations from a PDF using Acrobat Professional. I did an extensive research (i.e. I tried many different keywords in Google!) before understanding how to extract the annotations from a PDF file. I did not find too many useful articles on the internet. It turned to be an easier process than what most of the site I visited described, so I hope that Google ranks this page well!

The basic requirement

What Acrobat Pro can do is not to directly extract the highlighted text, but the text that is inside a comment box (i.e. annotations). Therefore, the first thing you have to do is to change one of the preferences of Acrobat Pro to copy the text that you highlight into a comment box or pop-up. To activate this feature:

  1. Open Acrobat Pro and go to Edit and then Preferences (or press CTRL + K).
  2. Choose the Commenting category.
  3. Activate the option “Copy selected text into Highlight, Cross-Out, and Underline comment pop-ups”.
  4. Click OK.

From now on, every time you highlight text in Acrobat Pro, a comment pop-up will be created. The highlighted text will be pasted on that pop-up without any additional action.

The script that does the job

Assuming that you have a PDF file in which you have highlighted text after you activated the basic requirement described above, now you are ready to get all the portion of the PDF file that is highlighted.

Adobe extended JavaScript to perform some specific actions on PDF files. Fortunately (for those who highlight in PDF files), one of those actions is the extraction of annotation from a PDF. A script to get annotation data from a PDF can be found in a Adobe’s guide to develop JavaScript applications (here). I made some minor changes to that script. The result is the following:

var annots = this.getAnnots({nSortBy: ANSB_ModDate, bReverse: true});
console.println("\nAnnot Report for document: " + this.documentFileName);
if ( annots != null ) {
console.show();
console.println("Number of Annotations: " + annots.length);
var msg = "$$%s";
for (var i = 0; i < annots.length; i++)
console.println(util.printf(msg, annots[i].contents));
console.println(" ");
} else
console.println(" No annotations in this document.");

This script gets all the annotations from a PDF. In the first line of the script, it is indicated that the annotation are going to be ordered by date and in reverse order. Basically, this means that you will get first the first text that you highlighted (based on the timestamp of the highlight). If you choose not to use the reverse order, then the first text that you get is the last one that you highlighted.

I modified line 6 to add two symbols before each portion of the text that is highlighted: $$. I use these two symbols in order to identify where each highlighted text (annotation) begins. Hence, afterwards I can automatically improve the format of the annotations using MS Word: I find and replace each $$ with a double line break (i.e., ^p^p), so annotations end separated by one line break (this is done using what I explained here).

I use this version of the script when I want to extract text from a double-column document. The timestamp of each annotation is the only thing that helps the script to identify the order of the annotations. Obviously, the limitation of this approach is that you have to be careful and highlight the text in order. That is, if after reading the last page you go back to the first to highlight a portion of the text, then that portion will appear next to the last highlighted text of the last page. The only thing you can do when you insert a new annotation in the middle of the document is to make all the annotations again (so the timestamps will be updated and in order).

When I am extracting text from single-column documents, I replace the first line of the script with the following line

var annots = this.getAnnots({nSortBy: ANSB_Page});

With this line, the annotation will be ordered by its vertical position and page. Therefore, you do not have to worry about the timestamp. Highlighting text in a single-column document is surely less stressful!

Extracting all the annotations

Now you are ready to get all the annotations. The last steps are as follows:

  1. Open the PDF (the one with highlighted text and annotations!) with Acrobat Pro.
  2. Press CTRL + J. This will open the JavaScript console.
  3. Paste the script discussed above.
  4. Select (highlight) the text of the script.
  5. Press CTRL + Enter.

After the console ends running the script, you will get all the annotations (highlighted text and also the comments you inserted using sticky notes) from the PDF.

What I do next is to copy the annotations into MS Word in order to eliminate the line breaks that may be splitting sentences because of the outline of columns. Also, I use the $$ I added to each annotation with the script (line 6) to identify where I should insert a line break. This line break helps me to easily identify each annotation when I am reading the summary of the document created automatically based on what I highlight.

Advertisements

23 thoughts on “How To: Extract Highlighted Text from a PDF File

  1. Florian

    Hey, you have written a really nice tutorial about the extraction of annotations.
    I tried to do it too, but i failed.
    Can you be more precise, how to execute the javascript on acrobat pro?
    when i press ctrl+j i get this: imageshack.us/photo/my-images/706/94976279.png
    by executing with ctrl+enter I get the annotations in the script window.
    is there a possible way to save them in a text-file.
    additionally i’ve detected, that i get many line breaks, even during my highlighted phrases. do you know how to reduce them?

    Reply
    1. Francisco Morales Post author

      Hello!
      I think you are doing it right. The text that you highlighted will appear just below the script in the JavaScript Console of Acrobat Pro. I do not know if you can take that text automatically to some other software or format. I just copy and paste the text into MS Word or some other text editor.

      Regarding the line breaks, I added the symbols “$$” to the script in order to solve that problem. Each annotation (i.e. highlighted text) will start with $$. Therefore, when I paste the text into MS Word I can identify where should I place a line break. This actually allows you to delete those line breaks that result from having the text in two-column format. In another tutorial I explain how to delete these “artificial” line breaks (https://franciscomoralesdotorg.wordpress.com/2012/09/28/copy-and-paste-2-column-text-from-a-pdf-to-ms-word/). After you delete the “artificial” breaks, you can insert a line break for every $$ in the text (i.e. find $$ and replace with ^p^p in MS Word).

      Let me know if you have any question!

      Reply
  2. Joe

    I have some documents that were highlighted before activating the “copy the text that you highlight into a comment box or pop-up” feature. Is there a way to copy the highlighted text into comments using a script?

    Reply
    1. Francisco Morales Post author

      Hi!

      I am not aware of anything like that. In my case, I just replace the old highlighted parts with new ones (with the named configuration turned on, of course). Assuming there is a script that can do what you want to do, I guess there would be a problem when the text is in two columns, because you use the timestamp to order the sequence of the highlighted text.

      Let me know if you find how to do this!

      Reply
  3. swert

    done this script with PDFXchange viewer free version; it attaches a file containing only the comments to the pdf; myCommentList may be .doc, .txt, .rtf, in UTF-8.
    comments are arranged in page order

    var annots = this.getAnnots();
    var cMyC = “Comment”;
    for ( var i=0; i<annots.length; i++ )
    cMyC += ( "\•" + annots[i].contents+"\"");

    this.createDataObject({cName: "myCommentList.doc", cValue: cMyC});
    this.exportDataObject({cName: "myCommentList.doc", nLaunch: 0});

    I am no scripter/developer, yet this function should have a dedicated button in any pdf reader.

    Reply
  4. Paul

    Hello,

    Excellent tool. I am facing one problem and that is that the script correctly indicates the number of annotations but it only returns $$, not the actual content of the annotation itself.

    Any help would be great!
    Paul

    Reply
    1. Francisco Morales Post author

      Hi Paul,
      Did you check the option “Copy selected text into Highlight, Cross-Out, and Underline comment pop-ups” before highlighting text? Is the only thing that I can think of regarding your problem.
      Francisco

      Reply
  5. Tom

    Hello! I use sumnotes to extract highlights and images from pdf documents. It really helps me do projects without getting stressed. I deal with a lot of pdf documents every day at university and sumnotes comes to help. It is available on http://www.sumnotes.net It is absolutely on cloud so no downloading is needed. It supports all the most diffused operating systems and devices. There are several videos on youtube which demonstrate how to use sumnotes. Just search for sumnotes and you will get them immediately. I hope you will like sumnotes!

    Reply
  6. Pingback: Copy highlighted text into comments from a PDF file | Scrambled ideas for the end of the week

  7. PdfHighlights

    http://www.pdfhighlights.com extracts all your highlights and annotations from multiple PDFs into a single report from which you can click through to the PDF containing each annotation.

    If you sync your tablet/ipad with your desktop using something like dropbox, then all your ipad pdf annotations will be reportable from your desktop.

    Reply
  8. S.T.Randall

    This is very cool, but when I copy this code in word for word, I get this error message:

    TypeError: Report is not a constructor
    1:Console:Exec
    undefined

    And I can find no explanation of this error anywhere. Admittedly JS is not my first language, but, as I understand it, when Javascript complains something is “not a constructor” it’s because functions are being called as procedures or vice versa … or some other kind of simple mistype in the call.

    All of the positive replies responding to this post suggest the code works as printed, so I’m thinking there must be some bit-flip in my version of the Acrobat or JS that’s choking on this otherwise flawless code.

    Any insights would be appreciated.

    (This utility is the kind of thing that is SO useful it really should be a built in feature of Acrobat instead of something we should have to build or pay extra for other software to do, IMHO.)

    Reply
  9. pdfcommentextractor

    Dear Fransisco,

    I am really thankful to you for this post and the linked one from Valerio Biscione for giving me some pointers to develop an app.

    I have been an ‘extensive highligher’ in my android tablet using xodo docs application :). But I was not able to find even a single application that can extract the comments from my highlights.
    Because In android, you can not set the values for ‘Copy selected text to highlight…..to comment pop ups option’. So I was browsing for past 6/7 months for a solution.

    Finally I decided to write one myself 🙂 hosted at https://pdfcommentextractor.wordpress.com/

    I have added the following features in it:

    1. Provision to copy old highlight texts to comment pop ups retroactively..(that is you had not made the setting explained above before making the comment.).
    2. Provision to copy highlight texts to comment pop ups for highlights made from a tablet.
    3. Provision to specify delimiters in the comment generator.
    Single file processing and bulk processing
    4. MY FAVORITE: Provision to split different colour highlights to different files

    If some one is still badly looking for this app, you can try it.

    Caveat: But it is not free… Not too costly either.

    I would appreciate if If Fransisco could review it. I can sent you a free app for review, if you want. Please mail me at wowpdfextractor@gmail.com

    Thanks,
    Alex

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s