15
Oct

Scrape Data from Google Patents Website

Posted By admin

Project Title: Scrape Data from Google Patents Website

Project Description:
A friend of mine, Alminas, had been in touch with you earlier. I have a request related to his earlier query with a slightly enlarged scope of work.

I need your services to scrape data from Google Patents website. I have 1.34 million patents for which I have the patent number e.g. 3930295

The data to be scraped can be obtained for the above example in the following URL: http://www.google.com/patents/US3930295

(Note that this is “http://www.google.com/patents/US” concatenated to the patent number.)

Now coming to the data that I would like to be scraped.

There are two different types of data which will go into two different files.

First file:
I would like the Publication Date, Filing Date, and Priority Date for this patent. I have myself done the scraping for around 200,000 cases and know that sometimes the table is slightly different and may miss one or more of this information. For example, sometimes the Filing Date may be missing. The scraping algorithm must take this into account. In such a case Filing Date should be empty, but all other information should be grabbed.

For the above example, from the above webpage the following should be written to the output file:

“93025|Jan 6, 1976|Mar 4, 1974|Mar 4, 1974

A different format is also fine, as long as it is usable. Ultimately this file should have exactly as many rows as the number of patents that I give you.

Second file: If you go to the bottom of the webpage by clicking on the previously mentioned URL, you will see a table under “REFERENCED BY”

I would like to grab all the rows of this table. For the columns, I only need the following: Citing Patent, Filing Date, Publication Date.

For the above example, the following entries should be created:
Patent       Citing       Patent       HasStar      Filing date      Publication date
393025      US4066204      YES       Jun 14, 1976      Jan 3, 1978
393025      US4342090      YES       Jun 27, 1980      Jul 27, 1982
393025      US4346874      YES       May 27, 1980      Aug 31, 1982
393025      US4587703       YES       Jan 16, 1985      May 13, 1986
393025      US5034802      YES       Dec 11, 1989      Jul 23, 1991
393025      US6698088      YES       Feb 1, 2001      Mar 2, 2004
393025      US6864570      NO       Jun 8, 2001      Mar 8, 2005
393025      US7181835      NO       Jan 15, 2004      Feb 27, 2007
393025      US7727804      NO       Jun 6, 2007      Jun 1, 2010
393025      US8318579      NO       Dec 1, 2011      Nov 27, 2012

(The first row is just the header, included for reference.) Again, the exact format of this file can be different. Please note a few things. There are multiple entries in this file corresponding to each patent (or webpage). The number of entries is equal to the number of rows in the REFERENCED BY table. Sometimes the patent numbers have an asterisk at the end of the number. That has to be removed. When there is an asterisk, an indicator (HasStar) should capture that it was there. (In my example above, I write YES or NO. In fact 1 or 0 is preferred.)

For some of the patents (about 20%) the REFERENCED BY table will not exist on the webpage. For those patents there will be no entry in the second file. In an additional 70% of the cases, the number of rows in the REFERENCED BY table will be less than 20. In the remaining 10% of the cases, it will be more than 20. The estimated number of rows in the second file is 12 million. (I know this because I have a good estimate on the number of citing patents for each patent.)

Could you please let me know how much it will cost for this work? To repeat, I need this for 1.34 million patents and there should be two different files created for the two different types of data scraped. Please let me know if you have any questions.

For smililar work requirement feel free to email us on info@webscrapingexpert.com

Comments
  • 5 years ago Henry Mills

    Retrieve business listings from Google Maps of some 12-15 cities with Company Name | Address | Contact information | etc.

    Website: https://maps.google.com/

    How does your pricing structure works?

    Reply
  • 5 years ago BEATRICE ARNOLD

    I need database from Google to scrape universities listed on UK. What is the cost?

    Please advise.

    Reply
  • 4 years ago FLORENCE Jackson

    Do you scrape Google Reviews? Please give brief on your capabilities.

    Reply

Add a comment