Scrape Data from Google Patents Website
Project Title: Scrape Data from Google Patents Website
A friend of mine, Alminas, had been in touch with you earlier. I have a request related to his earlier query with a slightly enlarged scope of work.
I need your services to scrape data from Google Patents website. I have 1.34 million patents for which I have the patent number e.g. 3930295
The data to be scraped can be obtained for the above example in the following URL: http://www.google.com/patents/US3930295
(Note that this is “http://www.google.com/patents/US” concatenated to the patent number.)
Now coming to the data that I would like to be scraped.
There are two different types of data which will go into two different files.
I would like the Publication Date, Filing Date, and Priority Date for this patent. I have myself done the scraping for around 200,000 cases and know that sometimes the table is slightly different and may miss one or more of this information. For example, sometimes the Filing Date may be missing. The scraping algorithm must take this into account. In such a case Filing Date should be empty, but all other information should be grabbed.
For the above example, from the above webpage the following should be written to the output file:
“93025|Jan 6, 1976|Mar 4, 1974|Mar 4, 1974
A different format is also fine, as long as it is usable. Ultimately this file should have exactly as many rows as the number of patents that I give you.
Second file: If you go to the bottom of the webpage by clicking on the previously mentioned URL, you will see a table under “REFERENCED BY”
I would like to grab all the rows of this table. For the columns, I only need the following: Citing Patent, Filing Date, Publication Date.
For the above example, the following entries should be created:
Patent Citing Patent HasStar Filing date Publication date
393025 US4066204 YES Jun 14, 1976 Jan 3, 1978
393025 US4342090 YES Jun 27, 1980 Jul 27, 1982
393025 US4346874 YES May 27, 1980 Aug 31, 1982
393025 US4587703 YES Jan 16, 1985 May 13, 1986
393025 US5034802 YES Dec 11, 1989 Jul 23, 1991
393025 US6698088 YES Feb 1, 2001 Mar 2, 2004
393025 US6864570 NO Jun 8, 2001 Mar 8, 2005
393025 US7181835 NO Jan 15, 2004 Feb 27, 2007
393025 US7727804 NO Jun 6, 2007 Jun 1, 2010
393025 US8318579 NO Dec 1, 2011 Nov 27, 2012
(The first row is just the header, included for reference.) Again, the exact format of this file can be different. Please note a few things. There are multiple entries in this file corresponding to each patent (or webpage). The number of entries is equal to the number of rows in the REFERENCED BY table. Sometimes the patent numbers have an asterisk at the end of the number. That has to be removed. When there is an asterisk, an indicator (HasStar) should capture that it was there. (In my example above, I write YES or NO. In fact 1 or 0 is preferred.)
For some of the patents (about 20%) the REFERENCED BY table will not exist on the webpage. For those patents there will be no entry in the second file. In an additional 70% of the cases, the number of rows in the REFERENCED BY table will be less than 20. In the remaining 10% of the cases, it will be more than 20. The estimated number of rows in the second file is 12 million. (I know this because I have a good estimate on the number of citing patents for each patent.)
Could you please let me know how much it will cost for this work? To repeat, I need this for 1.34 million patents and there should be two different files created for the two different types of data scraped. Please let me know if you have any questions.
For smililar work requirement feel free to email us on firstname.lastname@example.org