pdf-urls - Extract all URLs from a PDF file.

Age	Commit message (Collapse)	Author
2019-11-02	Add license (GNU GPLv3+)	Teddy Wing

2019-11-02	get_urls_from_pdf: Extract link annotation check to a function	Teddy Wing
	Give this condition a more descriptive name.
2019-11-02	get_urls_from_pdf: Add a short doc string	Teddy Wing

2019-11-02	main: Accept a file path as a command line argument	Teddy Wing

2019-11-02	get_urls_from_pdf: Allow out of order URLs in test	Teddy Wing
	For now I'm going to allow URLs to be printed out of their apparent visual order. Change the test so that it passes.
2019-11-02	get_urls_from_pdf: Remove duplicate URLs	Teddy Wing

2019-11-02	get_urls_from_pdf: Test extracted URLs	Teddy Wing
	Add a test with a simple text-only PDF with three URLs. Currently I'm getting the following failure, so visibly the order is not necessarily the same as the visible order, and multi-line hyperlinks can be encoded as two link areas: ---- tests::get_urls_from_pdf_extracts_urls_from_pdf stdout ---- thread 'tests::get_urls_from_pdf_extracts_urls_from_pdf' panicked at 'assertion failed: `(left == right)` left: `["http://www.gutenberg.org/ebooks/11", "https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h/images/i002.jpg", "https://science.nasa.gov/news-article/black-hole-image-makes-history"]`, right: `["http://www.gutenberg.org/ebooks/11", "https://science.nasa.gov/news-article/black-hole-image-makes-history", "https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h/images/i002.jpg", "https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h/images/i002.jpg"]`', src/lib.rs:65:9
2019-11-02	get_urls_from_pdf: Return a `Vec<String>` instead of printing	Teddy Wing
	Facilitate testing by returning a vec of URLs instead of printing them directly to STDOUT.
2019-11-02	main: Handle error from `get_urls_from_pdf`	Teddy Wing

2019-11-02	get_urls_from_pdf: Remove `return`s to fix URL output	Teddy Wing
	Turns out when I removed the `unwrap`s in 92f8f57b76b32c3d3e52d4b61dcdf25969f47ab7, the `return`s I added to the `match` expressions caused the loops to exit early without iterating over all the objects in the PDF. Remove the `return`s and fix up the expression return types to get URLs printing again.
2019-11-02	get_urls_from_pdf: Remove `unwrap`s and replace with an error type	Teddy Wing
	Create a custom error type to use instead of the `unwrap`s.
2019-11-01	lib: Use `std::str`	Teddy Wing
	Get rid of `::str`-prefixed calls.
2019-11-01	get_urls_from_pdf: Change argument type to `AsRef<Path>`	Teddy Wing

2019-11-01	get_urls_from_pdf: Take PDF path as an argument	Teddy Wing

2019-11-01	main: Move URL extraction code into lib.rs	Teddy Wing

2019-11-01	main: Remove unused `id` variable	Teddy Wing

2019-11-01	main: Remove `dbg!` statements	Teddy Wing

2019-11-01	main: Output URLs to STDOUT	Teddy Wing

2019-11-01	Find the PDF object that URLs are stored in	Teddy Wing
	Thanks to plinth (https://stackoverflow.com/users/20481/plinth) on Stack Overflow, learned that URLs are stored in /A entries in a PDF: > To get the link to go somewhere you'll need either a /Dest or an /A > entry in the link annot (but not both). /Dest is an older artifact for > page-level navigation - you won't use this. Instead, use the /A entry > which is an action dictionary. So if you wanted to navigate to the url > http://www.google.com, you would make your annotation look like this: > > << /Type /Annot /Subtype /Link /Rect [ x1 y1 x2 y2 ] > /A << /Type /Action /S /URI /URI (http://www.google.com) >> > >> https://stackoverflow.com/questions/19492229/add-a-hyperlink-into-a-pdf-document/19496996#19496996 To extract URLs, find the /A objects and get the text value of their `URI` fields.
2019-10-30	Inspect an example PDF file using the 'lopdf' crate	Teddy Wing
	Walk the different objects in the PDF to discover how hyperlinks are stored and how I can access them.
2019-10-30	New Rust 1.38.0 project	Teddy Wing
	$ rustc --version rustc 1.38.0 (625451e37 2019-09-23) $ cargo init --bin