Outsourcing Data Annotation? Watch out for these traps

Watch out for these traps when choosing a Data Annotation partner for your Machine Learning projects.

Traffic - Object Detection
Bounding Box labels on cars and pedestrians

Outsourcing Data Annotation is a proven way for teams to boost productivity, decrease development time and stay ahead of the competition. If you are looking to outsource your Data Annotation work and concerned with picking the right partner, you are not alone.

Data is key when it comes to training a ML/AI model. While building computer vision models, knowing what data to annotate and how to annotate it can vastly increase the efficiency of the trained model.

Good data makes a world of difference.

Why outsource at all?

There are many reasons why outsourcing Data Annotation might be good for you. Here are some of the big ones: 

Lower costs

Labor costs can vary wildly between different countries. By outsourcing, you take advantage of a global pool of talent while keeping your costs low. You also do not have to worry about overhead costs like renting office space, procuring equipment among others.

Project Management

Getting huge volumes of data annotated correctly and without delays requires a lot of experience and expertise. You already have a lot to worry about in your own business. Why add another potential pain point?

Access to a Larger Talent Pool

By outsourcing your data annotation work, you get access to experienced high quality candidates from all over in the world. Countries like India and South America can offer talented youth waiting for the right opportunity.

Scale

While you may be able to get small volumes of data labeled in-house, if you want to scale out and get your ML models working really well, outsourcing your data annotation work might be the only option for you.

Crowdsourcing your Data Annotation work

Crowdsourcing solutions like Amazon Mechanical Turk allow you to distribute your data annotation work to the public, mostly part-time workers trying to supplement their regular income. While these are cheap as compared to managed Data Annotation solutions, there are a few things to consider before you take that route.

Label Mismatch
Most crowdsourced platforms will have a high turnover rate of their workers. More workers working on small volumes of data means that there is a high probability of label mismatch. Inconsistent labels is one of the biggest reasons why a ML model might under-perform.

Data Annotation Instructions
With crowdsourced workforce it can be difficult to properly explain your requirements for a data annotation project.

Quality Assurance
Because the workforce is composed of independent workers, it is impossible to ensure a required level of label quality. Managed solutions usually offer guaranteed Quality Assurance which can be the difference between a good and an average quality dataset.

Security
If your data is sensitive, it is legally very difficult for a crowdsourcing platform to execute NDAs with all of its workers. It’s also feasible for potential competitors to infiltrate a crowdsourced platform and get an idea of what is being annotated in the industry. Security is a hot topic, and if you are concerned about the safety of your data, crowdsourcing might not be suitable for you.

Having said that, if your data volumes are low, your thresholds for data label quality are not that high, and your data is not very sensitive, Crowdsourcing might be a good solution for you. Specially considering they are the cheapest solution available out there.

Managed Solutions are more expensive than crowdsourcing, but might be worth the extra penny.

Managed Data Annotation Solutions

If your requirements are not suitable for crowdsourcing, Managed Data Annotation is the way to go. You do not need to worry about security as you can tie down all stakeholders with NDAs. You do not need to worry about label quality because you can hold the vendor responsible by writing a minimum Quality requirement into an SLA. And you do not need to worry about explaining your requirements to new labelers because the same team of annotators works on your entire project.

Any partner you choose should allow you do all of the above. In addition, keep the following in mind when selecting a Data Annotation vendor.

Data Annotation Tool

What annotation tool does the vendor use? The best annotation tools are user-friendly, minimize human involvement, and maximize efficiency while maintaining data quality.

Our own Annotation tool here at Mindkosh makes it easy for us to easily manage projects, monitor labeler productivity, all while providing simple tools to label the data and mark issues with labeled data.

Communication

One of the most common problems that beget a data annotation project is unclear communication. This could be requirements and instructions for label classes, or further clarifications when labelers face problems. Does the vendor talk to you often? How does it facilitate communication between you and the labelers?

A good way to solve this problem can be to maintain Slack channels with all stakeholders as members. This is the approach we adopt here at Mindkosh.

Mindkosh annotation tool workspace
Visionkosh - our internal Data Annotation tool for images keeps communication lines open between all stakeholders.

Quality Assurance

When looking at a company’s expertise, make sure that they will be able to do the job right the first time. How a platform performs their QA is a very good indication of whether they will be able to deliver your labeled data properly. How is the data reviewed? How is the Quality determined?

Scalability

Projects are not set in stone. You might need more data annotated than you originally anticipated. Your service provider should be able to scale up, without sacrificing the quality. Just like there might be a need to scale up, you might also want to scale down. Your service provider should give you control over the scale.

We recently worked on a project in the autonomous industry, where the requirements were fuzzy. They needed to scale up the labeling close to their demos, and scale down at other times. We helped them label hundreds of thousands of images at varying scales over a period of a few months.

Security

One of the biggest reasons companies feel reluctant to outsource is because of security. Your vendor should have proper IT processes in place to make sure there are no data leaks in the organization. Ask them how they actually label data - who comes into contact with it, in what ways can they potentially leak it and if they keep backups of the data.

If you have to comply with certain standards within your business, like GDPR, PCI DSS etc. you might have specific requirements about where your data lives. Your vendor should be capable of setting up their data annotation tool in a variety of architectures.

We recently worked on a data annotation project for a partner who could not allow their data to be moved out of their data centers. We worked with the partner to setup our tool on their server, and let our labelers access it via Remote Desktop. In addition we put our labelers behind a VPN with strict network requirements to add an extra layer of security.

Mindkosh Cost Estimator Tool
You can use our online Cost Estimator tool to estimate your project costs.

Pricing

Pricing for data annotation projects can be complicated. The amount of effort required can vary a lot between projects. Which is why it is impossible to accurately predict the cost of a project before-hand. The two most common pricing models are the following.

Per data-point model - You pay a fixed amount of money for each label and each image. If you know your requirements beforehand and do not expect the amount of work to vary wildly between different data-points, this can be a good option, because it tells you exactly how much your project will cost you.

You can get a general idea about your project costs by entering some basic information into our Cost Estimator Tool.

Per hour model - You pay for the amount of work done by the vendor on your project. The vendor will usually give you quotes for the per-hour rate as well as the expected average time to label each data-point. This can be a good idea if your project is long and your requirements are not set in stone.

Outsourcing your Data Annotation for the first time can feel daunting because of all the different aspects you have to keep in mind. The most important quality of a good Data vendor is that it will help you throughout this process. We hope that this article helped quell some of your doubts.

We are always happy to hear from you. So if you have any questions or are feeling uneasy about your data annotation project reach out to us - write me an email at psingh@mindkosh.com, chat to us through the chat-bot on Mindkosh.com, or give us a call at the numbers given on the contact page !


RSS Feed
Follow us to receive interesting updates from the world of AI.