Some people enjoy reading on paper not only because they can make annotations and highlight text easily, but also because they actually like their handwriting. If you are not one of those, then the following guide may help you. I will show how to extract all the highlighted text and the annotations from a PDF using Acrobat Professional. I did an extensive research (i.e. I tried many different keywords in Google!) before understanding how to extract the annotations from a PDF file. I did not find too many useful articles on the internet. It turned to be an easier process than what most of the site I visited described, so I hope that Google ranks this page well!
The basic requirement
What Acrobat Pro can do is not to directly extract the highlighted text, but the text that is inside a comment box (i.e. annotations). Therefore, the first thing you have to do is to change one of the preferences of Acrobat Pro to copy the text that you highlight into a comment box or pop-up. To activate this feature:
- Open Acrobat Pro and go to Edit and then Preferences (or press CTRL + K).
- Choose the Commenting category.
- Activate the option “Copy selected text into Highlight, Cross-Out, and Underline comment pop-ups”.
- Click OK.
From now on, every time you highlight text in Acrobat Pro, a comment pop-up will be created. The highlighted text will be pasted on that pop-up without any additional action.
The script that does the job
Assuming that you have a PDF file in which you have highlighted text after you activated the basic requirement described above, now you are ready to get all the portion of the PDF file that is highlighted.
Adobe extended JavaScript to perform some specific actions on PDF files. Fortunately (for those who highlight in PDF files), one of those actions is the extraction of annotation from a PDF. A script to get annotation data from a PDF can be found in a Adobe’s guide to develop JavaScript applications (here). I made some minor changes to that script. The result is the following:
var annots = this.getAnnots({nSortBy: ANSB_ModDate, bReverse: true}); console.println("\nAnnot Report for document: " + this.documentFileName); if ( annots != null ) { console.show(); console.println("Number of Annotations: " + annots.length); var msg = "$$%s"; for (var i = 0; i < annots.length; i++) console.println(util.printf(msg, annots[i].contents)); console.println(" "); } else console.println(" No annotations in this document.");
This script gets all the annotations from a PDF. In the first line of the script, it is indicated that the annotation are going to be ordered by date and in reverse order. Basically, this means that you will get first the first text that you highlighted (based on the timestamp of the highlight). If you choose not to use the reverse order, then the first text that you get is the last one that you highlighted.
I modified line 6 to add two symbols before each portion of the text that is highlighted: $$. I use these two symbols in order to identify where each highlighted text (annotation) begins. Hence, afterwards I can automatically improve the format of the annotations using MS Word: I find and replace each $$ with a double line break (i.e., ^p^p), so annotations end separated by one line break (this is done using what I explained here).
I use this version of the script when I want to extract text from a double-column document. The timestamp of each annotation is the only thing that helps the script to identify the order of the annotations. Obviously, the limitation of this approach is that you have to be careful and highlight the text in order. That is, if after reading the last page you go back to the first to highlight a portion of the text, then that portion will appear next to the last highlighted text of the last page. The only thing you can do when you insert a new annotation in the middle of the document is to make all the annotations again (so the timestamps will be updated and in order).
When I am extracting text from single-column documents, I replace the first line of the script with the following line
var annots = this.getAnnots({nSortBy: ANSB_Page});
With this line, the annotation will be ordered by its vertical position and page. Therefore, you do not have to worry about the timestamp. Highlighting text in a single-column document is surely less stressful!
Extracting all the annotations
Now you are ready to get all the annotations. The last steps are as follows:
- Open the PDF (the one with highlighted text and annotations!) with Acrobat Pro.
- Press CTRL + J. This will open the JavaScript console.
- Paste the script discussed above.
- Select (highlight) the text of the script.
- Press CTRL + Enter.
After the console ends running the script, you will get all the annotations (highlighted text and also the comments you inserted using sticky notes) from the PDF.
What I do next is to copy the annotations into MS Word in order to eliminate the line breaks that may be splitting sentences because of the outline of columns. Also, I use the $$ I added to each annotation with the script (line 6) to identify where I should insert a line break. This line break helps me to easily identify each annotation when I am reading the summary of the document created automatically based on what I highlight.
Hey, you have written a really nice tutorial about the extraction of annotations.
I tried to do it too, but i failed.
Can you be more precise, how to execute the javascript on acrobat pro?
when i press ctrl+j i get this: imageshack.us/photo/my-images/706/94976279.png
by executing with ctrl+enter I get the annotations in the script window.
is there a possible way to save them in a text-file.
additionally i’ve detected, that i get many line breaks, even during my highlighted phrases. do you know how to reduce them?
Hello!
I think you are doing it right. The text that you highlighted will appear just below the script in the JavaScript Console of Acrobat Pro. I do not know if you can take that text automatically to some other software or format. I just copy and paste the text into MS Word or some other text editor.
Regarding the line breaks, I added the symbols “$$” to the script in order to solve that problem. Each annotation (i.e. highlighted text) will start with $$. Therefore, when I paste the text into MS Word I can identify where should I place a line break. This actually allows you to delete those line breaks that result from having the text in two-column format. In another tutorial I explain how to delete these “artificial” line breaks (https://franciscomoralesdotorg.wordpress.com/2012/09/28/copy-and-paste-2-column-text-from-a-pdf-to-ms-word/). After you delete the “artificial” breaks, you can insert a line break for every $$ in the text (i.e. find $$ and replace with ^p^p in MS Word).
Let me know if you have any question!
What about PDF Highlights Extractor utility.
http://sourceforge.net/projects/pdfhex
I have some documents that were highlighted before activating the “copy the text that you highlight into a comment box or pop-up” feature. Is there a way to copy the highlighted text into comments using a script?
Hi!
I am not aware of anything like that. In my case, I just replace the old highlighted parts with new ones (with the named configuration turned on, of course). Assuming there is a script that can do what you want to do, I guess there would be a problem when the text is in two columns, because you use the timestamp to order the sequence of the highlighted text.
Let me know if you find how to do this!
Yes, if you still need it contact me and I will give you the script (for free)
here: valerio.biscione@gmail.com
Joe, I have just written a post with the code to do exactly what you ask code. I will put it here, hoping to not look like a spammer. I also thank Francisco Morales for good hint in writing the code.
http://biscionevalerio.wordpress.com/2014/07/22/copy-highlighted-text-into-comments-from-a-pdf-file/
Thanks for the script!
It worked just fine for me. Thanks a lot Francisco. This is really awesome.
I am glad to read that!
done this script with PDFXchange viewer free version; it attaches a file containing only the comments to the pdf; myCommentList may be .doc, .txt, .rtf, in UTF-8.
comments are arranged in page order
var annots = this.getAnnots();
var cMyC = “Comment”;
for ( var i=0; i<annots.length; i++ )
cMyC += ( "\•" + annots[i].contents+"\"");
this.createDataObject({cName: "myCommentList.doc", cValue: cMyC});
this.exportDataObject({cName: "myCommentList.doc", nLaunch: 0});
I am no scripter/developer, yet this function should have a dedicated button in any pdf reader.
Hello,
Excellent tool. I am facing one problem and that is that the script correctly indicates the number of annotations but it only returns $$, not the actual content of the annotation itself.
Any help would be great!
Paul
Hi Paul,
Did you check the option “Copy selected text into Highlight, Cross-Out, and Underline comment pop-ups” before highlighting text? Is the only thing that I can think of regarding your problem.
Francisco
There is a simple yet robust highlight extracting tool available for free at: http://www.sumnotes.net
Thanks Francisco, awesome tool! Is there a way to indicate also the page number of the annotation?
Martin
Hello! I use sumnotes to extract highlights and images from pdf documents. It really helps me do projects without getting stressed. I deal with a lot of pdf documents every day at university and sumnotes comes to help. It is available on http://www.sumnotes.net It is absolutely on cloud so no downloading is needed. It supports all the most diffused operating systems and devices. There are several videos on youtube which demonstrate how to use sumnotes. Just search for sumnotes and you will get them immediately. I hope you will like sumnotes!
Works great! Thanks from Mexico City
Pingback: Copy highlighted text into comments from a PDF file | Scrambled ideas for the end of the week
http://www.pdfhighlights.com extracts all your highlights and annotations from multiple PDFs into a single report from which you can click through to the PDF containing each annotation.
If you sync your tablet/ipad with your desktop using something like dropbox, then all your ipad pdf annotations will be reportable from your desktop.
Thank you so much for this, I use it all the time!
This is very cool, but when I copy this code in word for word, I get this error message:
TypeError: Report is not a constructor
1:Console:Exec
undefined
And I can find no explanation of this error anywhere. Admittedly JS is not my first language, but, as I understand it, when Javascript complains something is “not a constructor” it’s because functions are being called as procedures or vice versa … or some other kind of simple mistype in the call.
All of the positive replies responding to this post suggest the code works as printed, so I’m thinking there must be some bit-flip in my version of the Acrobat or JS that’s choking on this otherwise flawless code.
Any insights would be appreciated.
(This utility is the kind of thing that is SO useful it really should be a built in feature of Acrobat instead of something we should have to build or pay extra for other software to do, IMHO.)
Dear Fransisco,
I am really thankful to you for this post and the linked one from Valerio Biscione for giving me some pointers to develop an app.
I have been an ‘extensive highligher’ in my android tablet using xodo docs application :). But I was not able to find even a single application that can extract the comments from my highlights.
Because In android, you can not set the values for ‘Copy selected text to highlight…..to comment pop ups option’. So I was browsing for past 6/7 months for a solution.
Finally I decided to write one myself 🙂 hosted at https://pdfcommentextractor.wordpress.com/
I have added the following features in it:
1. Provision to copy old highlight texts to comment pop ups retroactively..(that is you had not made the setting explained above before making the comment.).
2. Provision to copy highlight texts to comment pop ups for highlights made from a tablet.
3. Provision to specify delimiters in the comment generator.
Single file processing and bulk processing
4. MY FAVORITE: Provision to split different colour highlights to different files
If some one is still badly looking for this app, you can try it.
Caveat: But it is not free… Not too costly either.
I would appreciate if If Fransisco could review it. I can sent you a free app for review, if you want. Please mail me at wowpdfextractor@gmail.com
Thanks,
Alex
Zot file (free) http://zotfile.com/ with Zotero (free) does a very good job of extracting annotations. This vid shows how to do it https://www.youtube.com/watch?v=4aDvAPLZwCY
very clean and simple no need for home brew javascript
thanks man this awesome code just did the job! thanks for sharing 🙂