Hit_scraper with hit export script added CUZ IT'S MORE CONVENIENT! Here's a few guides, one on mturkforum.com and one on mturkgrind.com.
Additionally, there is a script by clickhappier located here that uses scraper's blocklist to block hits on the regular mturk search results interface as well.
v2.0 Major Update
I've been doing a lot of work (with copious help from clickhappier and others, and I figure enough's been done to go for a major version release. You'll find the changelog down below, but a major rundown of all features follows.
What is Hit Scraper WITH EXPORT and why should I download it?
Hit Scraper WITH EXPORT (hereafter referred to has HS) at its core is really just a different way of looking at mturk pages. Its purpose was to take the place of several other scripts people were using every day, and to make a unified, easy-to-understand interface that everyone can use with minimal training. That being said, HS still has a ton of features to enhance your turking and make your life a lot easier.
How do I use HS?
To use HS, you need to visit This URL. Bookmark it so you don't forget. If HS doesn't load right away, try refreshing a few times. If it still doesn't load, there might be an issue and I'll try to see if I can figure it out.
When you get to that page, you'll see the main interface. This should be pre-populated with some default data...You can start going right away by clicking "Start", or you can customize it as shown below.
Option | Default | Description |
---|
|
|
|
Auto-refresh delay | 0 | How many seconds will elapse before the page starts scraping again. 0 is manual scrape only. EG 10 = scrape 10 seconds after the last scrape finished |
|
|
|
Pages to scrape | 3 | How many pages you want HS to look at. Default is 3 pages |
|
|
|
Correct for skips | No | If you have a lot of hits on your blocklist, you might end up blocking a lot of hits. "Correct for skips" will search additional pages to "fill up" your results. If correct for skips is off, it will ONLY search the number of pages you select in "pages to scrape" |
|
|
|
Minimum batch size | 100 (not specified) | For searching for batches. This does not matter unless you sort by most available. |
|
|
|
Minimum reward | None | Minimum dollar reward you want HS to show. EG 1 = don't show hits under $1; .2 = don't show hits under $0.20 |
|
|
|
Qualified | Yes if logged in, No if logged out | If yes, only show hits you're qualified for. If no, show all hits regardless of whether you qualify |
|
|
|
Masters Require | No | If yes, only show masters hits. If no, show all hits |
|
|
|
Masters Show | Show | If set to "Show", it will show both masters and non-masters hits (not applicable if you don't have masters and have "qualified" checked). If set to "hide", it will remove masters hits from the results |
|
|
|
Sort types | Latest | Latest sorts by time created, earliest first. Most available is by number of hits available, most first. Reward is by monetary reward, highest first. Title is alphabetical by title, A first |
|
|
|
Invert | No | Reverses the order of the sort type. Latest = oldest hits first; Most available = fewest hits available first; Reward = lowest reward first; Title = Z first |
|
|
|
New HIT Highlighting | 300 | Hits that are new to the scrape show up in bold. This number determines how long they will remain that way, in seconds. |
|
|
|
Sound on new hit | No | Play a sound when a new hit is discovered. The sound is only played once for each "screen" of new hits. For example, if two new hits are found in one scrape, the sound will play once. If one of the hits goes away, but the other remains, and it's still new based on the New HIT Highlighting number, the sound will not play because it already has. |
|
|
|
Ding | Ding | Which sound you want to hear, the old-style "Ding", or the new-style "Squee" best pony approved |
|
|
|
Sort by To | No | Sorts hits by TO with lowest numbers on top, highest numbers on the bottom, and "No TO" requesters on the bottom most. I've tried altering the order of this, but I can't, it doesn't seem like it's working properly. If I get it working, I'll add it into 2.0.1 |
|
|
|
Minimum To | None | Allows you to set a minimum TO threshold (0-5). Any hits with a "Pay" TO below that threshold will be hidden. You can click on the "Show hits below TO threshold" button to see them. This button only appears if you're using this option. |
|
|
|
Hide no To | No | Hides requesters who do not have a TO (not recommended) |
|
|
|
Search Terms | None | Allows you to search mturk for given terms. This is the same as searching the mturk interface. All results will contain one or more of your terms. |
|
|
|
Use includelist | No | Allows you to only show requesters on your "include list". You must have an include list set before using this option or you will get no results. It will do normal searches, but any requester NOT on your include list will be ignored. |
|
|
|
Use blocklist | Yes | Enables/disables the blocklist. If you are not using the blocklist, any hits that WOULD have been blocked are outlined in red. |
|
|
|
Start | Button | Starts scraping |
|
|
|
Hide Settings | Button | Hides everything above the buttons to give you more room. It's a toggle, so clicking it once will hide, once will show. |
|
|
|
Edit Blocklist | Button | Opens the blocklist for manual editing if you need to remove a name or something. Blocklist and include list items are delimited by the ^ symbol. |
|
|
|
Edit Includes | Button | Opens the include list for manual editing to add or remove requesters. Blocklist and include list are delimited by the ^ symbol |
|
|
|
Show hits below TO threshold | Hidden Button | See Minimum TO |
|
|
|
Stopped | Status message | Shows you the status of hit scraper, if it's stopped, scraping, running, waiting, etc |
|
|
|
Status messages: None | Status message | Very "dumb" status message indicator attempting to shed some light into why some things work and others don't...Also why hit scraper's doing something it "shouldn't be". |
|
|
|
Some of the elements in the settings list have informative mouseover text as well.
The hit table comes under the status information. It's laid out like so:
Column | Links to | Description | Mouseover |
---|
|
|
|
|
Requester | Requester Page | Shows the requester name and links to their page. R and T buttons allow for blocking Requester and Title respectively | None |
|
|
|
|
Title | Hit preview page OR requester page | Shows the hit preview page if one can be created/viewed, OR the requester page if one cannot. Will note if the requester link is substituted. VB and IRC buttons open the hit exporter for forums and IRC respectively | Description of hit |
|
|
|
|
Reward | None | Shows how much the hit pays | None |
|
|
|
|
HITs Available | None | Shows how many hits are available at the time the page was scraped | None |
|
|
|
|
TO pay | Requester TO page | Shows the TO value for "pay" for that requester | Shows all TO ratings, number of reviews, and number of TOS flags for that requester |
|
|
|
|
Accept HIT | Requester Preview and Accept (PANDA) page OR requester page | Similarly to the "title", it shows the panda link OR the requester page. See "title" to know if the requester link is substituted | None |
|
|
|
|
M? | None | N means a non-masters hit, Y means a masters hit | Shows all qualifications for the hit |
|
|
|
|
R | HitDB search for requester OR nothing | If green, you've done a hit that matches that requester name, click it to view. If red, you haven't, and clicking does nothing | None |
|
|
|
|
T | HitDB search for title OR nothing | If green, you've done a hit that matches that title, click it to view. If red, you haven't, and clicking does nothing | None |
|
|
|
|
Not Qualified | None | Shows hits you are not qualified for. Only shows up if there are non-qual'd hits | None |
|
|
|
|
v2.1: Fixed a bug with the IRC export
v2.2: Added failover to TO with IRC export
That should give you the info you need to get started. Below is the changelog from v1.6 to v2.0:
Changelog:
- Added in "use blocklist" feature to use/ignore blocklist
- Fixed a MAJOR bug that's been around forever where using when logged out resulted in unpredictable hit linking. Hopefully that's fixed for good
- Separated out the new "Squee" and the old "Ding" so that people can pick-and-choose
- Added in "save state", where HS will remember the values you had entered last time you hit "start", and bring them up next time you load. Note: You may need to set up your defaults on your first run
- Not using blocklist results in hits that would have been blocked being highlighted in red
Major thanks goes out to clickhappier, my main bug tester/fixer/motivator/pain-in-my-side-trying-to-get-me-to-fix-stuff, for keeping me working on this even when I was ready to quit turking altogether, and for learning along with me about all of this stuff :). Also Kerek, who contributed the code to hide the settings, as well as a lot of the back-end stuff.
Older updates and such are below.
v1.6:
- Ponified the "ding" (most important)
- Changed "Sort Types" to a dropdown instead of radio buttons
- Split "masters" into "require" and "show". Require will require that all hits will be Masters (same as checking the box on mturk). "Show" will elect to show masters hits or hide them if they come up in the search (for when logged out, thanks to Kerek for the suggestion)
- Added "hide" button to hide the interface (thanks to Kerek for the code and suggestion)
- Added a very preliminary sort by TO, due to extremely popular demand. See below for notes
- Added a very preliminary minimum TO, due to extremely popular demand. See below for notes.
Notes:
- For the "sort by TO", again it's very preliminary. It will sort the table as the TO results come in, which can result in the table changing after it's been populated if your system is slow. Keep that in mind, there's no way around that if you want to sort by TO.
It also places the requesters with no TO data on the bottom. I couldn't put them on the top for some reason, so they're down there. - For the "Minimum TO": It operates as you'd expect. Put in a number between 0 and 5, and it will remove all items which are below that number, except for "No data" or "TO down", which are not removed. If you want to see the items again, click the "Show hits below TO threshold" button. That will bring them back, but won't hide them again. I didn't think that would be necessary.
There are still apparently some issues with items duplicating and such, I can't seem to duplicate these issues so I can't test for them. They're fringe cases regardless as far as I can tell. They shouldn't really matter to the majority of people, so I'll solve them as they come up but I'm not gonna spend a huge amount of time troubleshooting.
v1.5:
- Added a new column, "M?", which shows if hits are Masters or not (more useful for those of us with Masters, but useful for both)
- Added qual listing, mouseover the "M?" column to see the quals for that hit.
- Added TO listing, mouseover TO link to see all the ratings
- Minor tweaking, some verbiage fixes, stuff you probably won't see/notice
v1.5.1: Fixed "notqualified" link processing
v1.5.2: Hopefully fixed the firefox issue, changed the way values are stored/recalled. This will have the unfortunate effect of clearing everyone's blocklist, but hopefully this will not change in the future.
v1.5.3: Fixed storage to properly handle requesters with commas in their names, I didn't realize.
v1.5.4: Updated to fix it not scraping when you're logged out, because apparently that's a thing people do.
v1.5.5: Updated to fix 1.5.4 again, because Amazon changed the way links work. Now, if you're not qualified (IE logged out), clicking the title (and/or exporting) should give you a link to the requester page, instead of a non-functioning link.
v1.5.6: Fixed 1.5.5 again, hopefully now it'll work when you're logged out again. Made it so "qualified" is not default when not logged in, hopefully fixed "duplicate" issue (where a hit will sometimes show up twice)
v1.5.7: https://www.youtube.com/watch?v=ipADNlW7yBM
v1.5.9: added in IRC exporter based on clickhappier's and cristo's work, changed colors of M? columns to make a little more sense.
v1.4:
- Ability to block by title and requester (so you can block individual hits you've done)
- Ability to view only certain requesters with Include list (Must add requesters to list individually for the moment, if there's a desire I'll add in a button like the blocklist)
- Ability to make scraper make a "ding" noise when it finds new work.
- Tied in with HitDB so clicking the R/T at the end will show you the work you've done for that requester (only for green items, might not work on firefox)
- Added A-Z sort
- Added inverse sort
- Added checkbox for "Correct For Skips" (mouseover the checkbox to see what it does, or try it out! On by default, will change to off by default if necessary).
- Re-organized a bit of the header section with some | characters to separate things
- Added some helpful "status" messages to explain some things a bit (IE why it's scraping more than the pages you told it to)
- Moved the status messages to below the header
- Made it pull the blocklist every time you run it so you can have multiple instances and they'll work together properly.
v1.4.1: Initial themeing support. Put all the color values up at the top of the code, with descriptors, so they can be changed easily
v1.4.2: Nothing really. Just a bit for some of our...friends...You shouldn't see anything different really.
v1.4.3:Reverted v1.4.2
v1.4.4: Added another descriptive status message.
Older update logs:
Updated to fix an issue with the export not getting the proper quals for the proper hit.
Updated so it wouldn't clobber the normal hit export script
Updated to fix a bug, and now the requester list is case insensitive.
Added description as mouseover text for title link. Hold the mouse over the title to see it.
1.3.0.10: Added ability to block requesters dynamically, and revert to the blocklist set in the code. Default blocklist contains:
"oscar smith", "Diamond Tip Research LLC", "jonathon weber", "jerry torres", "Crowdsource", "we-pay-you-fast", "turk experiment", "jon brelig"
To clear any of those from the default, just remove them from the code (line 18, remove the " marks and comma as well). To add a requester to the block list, click the "BLOCK" button next to their name. To reset to default, click "Reset blocklist" at the top.
1.3.0.11: Added a line (line 24) to change the hit export to text symbol to whatever you'd like.
1.3.0.12: Changed such that the "reset blocklist" is now a confirm dialog in case you misclick.
1.3.0.13: Updated an error with no TO hits.
1.3.0.14: Initial method of editing the existing blocklist to add/remove requesters manually. I'd like a better way of doing it, expect that to be coming.
1.3.0.15: Added "hits available" to default template per request.
1.3.1.0: Major release because of all the changes so far. This one has logical updating of the block list. What's that mean? It means when you click "Edit Blocklist" you'll get a textarea you type in. Remove requesters, add requesters, whatever you'd like. Then just click save and it saves.
1.3.1.1: Updated with Miku's new API link.
1.3.1.2: Fixed correct for skips to accurately reflect the pages you select.