aboutsummaryrefslogtreecommitdiffstats
path: root/src
AgeCommit message (Collapse)Author
2019-11-02Add license (GNU GPLv3+)Teddy Wing
2019-11-02get_urls_from_pdf: Extract link annotation check to a functionTeddy Wing
Give this condition a more descriptive name.
2019-11-02get_urls_from_pdf: Add a short doc stringTeddy Wing
2019-11-02main: Accept a file path as a command line argumentTeddy Wing
2019-11-02get_urls_from_pdf: Allow out of order URLs in testTeddy Wing
For now I'm going to allow URLs to be printed out of their apparent visual order. Change the test so that it passes.
2019-11-02get_urls_from_pdf: Remove duplicate URLsTeddy Wing
2019-11-02get_urls_from_pdf: Test extracted URLsTeddy Wing
Add a test with a simple text-only PDF with three URLs. Currently I'm getting the following failure, so visibly the order is not necessarily the same as the visible order, and multi-line hyperlinks can be encoded as two link areas: ---- tests::get_urls_from_pdf_extracts_urls_from_pdf stdout ---- thread 'tests::get_urls_from_pdf_extracts_urls_from_pdf' panicked at 'assertion failed: `(left == right)` left: `["http://www.gutenberg.org/ebooks/11", "https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h/images/i002.jpg", "https://science.nasa.gov/news-article/black-hole-image-makes-history"]`, right: `["http://www.gutenberg.org/ebooks/11", "https://science.nasa.gov/news-article/black-hole-image-makes-history", "https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h/images/i002.jpg", "https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h/images/i002.jpg"]`', src/lib.rs:65:9
2019-11-02get_urls_from_pdf: Return a `Vec<String>` instead of printingTeddy Wing
Facilitate testing by returning a vec of URLs instead of printing them directly to STDOUT.
2019-11-02main: Handle error from `get_urls_from_pdf`Teddy Wing
2019-11-02get_urls_from_pdf: Remove `return`s to fix URL outputTeddy Wing
Turns out when I removed the `unwrap`s in 92f8f57b76b32c3d3e52d4b61dcdf25969f47ab7, the `return`s I added to the `match` expressions caused the loops to exit early without iterating over all the objects in the PDF. Remove the `return`s and fix up the expression return types to get URLs printing again.
2019-11-02get_urls_from_pdf: Remove `unwrap`s and replace with an error typeTeddy Wing
Create a custom error type to use instead of the `unwrap`s.
2019-11-01lib: Use `std::str`Teddy Wing
Get rid of `::str`-prefixed calls.
2019-11-01get_urls_from_pdf: Change argument type to `AsRef<Path>`Teddy Wing
2019-11-01get_urls_from_pdf: Take PDF path as an argumentTeddy Wing
2019-11-01main: Move URL extraction code into lib.rsTeddy Wing
2019-11-01main: Remove unused `id` variableTeddy Wing
2019-11-01main: Remove `dbg!` statementsTeddy Wing
2019-11-01main: Output URLs to STDOUTTeddy Wing
2019-11-01Find the PDF object that URLs are stored inTeddy Wing
Thanks to plinth (https://stackoverflow.com/users/20481/plinth) on Stack Overflow, learned that URLs are stored in /A entries in a PDF: > To get the link to go somewhere you'll need either a /Dest or an /A > entry in the link annot (but not both). /Dest is an older artifact for > page-level navigation - you won't use this. Instead, use the /A entry > which is an action dictionary. So if you wanted to navigate to the url > http://www.google.com, you would make your annotation look like this: > > << /Type /Annot /Subtype /Link /Rect [ x1 y1 x2 y2 ] > /A << /Type /Action /S /URI /URI (http://www.google.com) >> > >> https://stackoverflow.com/questions/19492229/add-a-hyperlink-into-a-pdf-document/19496996#19496996 To extract URLs, find the /A objects and get the text value of their `URI` fields.
2019-10-30Inspect an example PDF file using the 'lopdf' crateTeddy Wing
Walk the different objects in the PDF to discover how hyperlinks are stored and how I can access them.
2019-10-30New Rust 1.38.0 projectTeddy Wing
$ rustc --version rustc 1.38.0 (625451e37 2019-09-23) $ cargo init --bin