Scraping Book Prices

I have many out-of-print books. Therefore, I’m planning a project to query abeBooks.com or other booksellers for current collectible book prices. I’ve found one article on this subject by Ricardo Avila.

According to Avila, ABE does not have a documented API, so I’m probably looking at web-scraping. Does anyone have experience or advice in the area of web-scraping, especially used book prices?

Check the legal stuff first. Some sites prohibit scraping. It's such a contentious area, legally, I've never actually gone ahead and done it. (My two-cents: get permission in writing first from any site you want to scrape, but check with your lawyer first too!)

For tools, there are free tools for both Python (for example, BeautifulSoup) and Java (for example htmlunit) and others.

1 Like

If a book price has been used is it harder to scrape off the page? :stuck_out_tongue_winking_eye:

You are really hoping that the pages are generated in a way that allows you to locate the information consistently. If it does then your job is to slurp the page, suck out the good bits, spit the rest away. Then do that all over again. Tedious but do-able.

Good regex savvy tools will make this job easier.

2 Likes

Although I use RegEx all the time, I don't see a good reason to use RegEx with all the free and widely-used HTML libraries available that understand HTML structure.

Assuming you're willing to take the legal risk and you have a Mac, you could also use "PDFPenPro", the PDF Utility, and not write a lick of code or a regex. You can point PDFPen Pro to a URL and it will scan the entire site (to the depth you specify) and create a PDF for every page.

2 Likes