Web crawling dilemma: Build or Buy?

Web crawling dilemma: Build or Buy?

Last couple of years have been the years of Data Revolution. While 2015 set the platform, 2016 saw intrigue and early adoption, and 2017 has started to see the penetration of Data Analytics from Enterprise to SMB’s and Startups. The year ahead and the one to come will see it’s penetration to the last miles. Every Business today is thinking about leveraging data to steer ahead, and if the business needs to crawl web data, then the question ‘build vs buy’ web crawling solution is virtually a no brainer.

Crawled web data has become a critical component of any big data analytics operations, from Enterprise to a Startup. Web Crawling, cleaning and structuring the data at scale poses it’s own complex challenges from both legal and technical perspectives.
Many would argue that web crawling is an illegal and unethical practice, but that is a topic of discussion for another day (a US court just allowed a Startup to crawl Linkedins publically available data).
Technically, the approach to developing crawling capabilities internally and maintaining them is not as easy as it sounds. Unless you’re doing it at a massive scale (min GB’s of data every day), building your own web data crawling solution is not only inefficient but also not cost effective.

 

Building your own Web Crawling Solution

You need to calculate the true cost of ownership of developing and maintaining an internal web crawling solution before you start building one.

  • Development

In a world driven by technology, when more and more people and companies are contributing to open source, the temptation to develop a proprietary web crawling solution is irresistible. Building a crawling framework from scratch or modifying an existing solution means hiring at least a full-time employee or diverting an existing employee from current projects. The goal is to deliver a stable and robust solution with comprehensive coverage, granularity, and scalability and this usually takes a couple of months and signifies a substantial cost.

  • Maintainance

The Internet keeps on changing and web crawlers need constant updates. Requirements keep changing, new data sources for data crawling keep adding up and one has to constantly redesign the crawlers to meet one’s specific needs. A full-time person is required to take care of the maintenance of the crawling infrastructure.

  • Computing Infrastructure

The crawler has to be constantly running on a dedicated machine 24/7. There will be a need to store and crunch the data crawled. The Data or Analytics have to be made highly available. Depending on your requirements, it can easily set you back by 100’s of dollars.

  • Scaling

If you’re not in the web crawling business, chances are that the approach to building the web crawlers has rendered them hard to scale. With time, you’ll have to add more crawlers, add more sources to crawl, add more data points per crawl, filter content and the list goes on. Servers and databases have to be replicated and more processes have to be scheduled and automated. Data cleaning and structuring alone is 70% of the work. Indexing and data backups add to the job. Over time, you’ll need to hire even more developers to develop a more robust solution that can keep up with the dynamic nature of the web.

 

Buy Crawled data

It makes sense to rely on a vendor for efficient and scalable web data acquisition. Still, the cost could remain prohibitive even if it’s more affordable than developing the capability internally.

  • Web crawling Service Providers

There are many companies that provide custom web crawling services. Make sure that you decide on the value of the data before setting a price for the crawler. You will have an option to buy the web crawling scripts and then use and maintain them yourself, or just buy the data. Also make sure that you get a structured stream of data to your database, or application.

  • Datasets

Market research firms, data providers, web crawling service providers also sell crawled datasets to companies and individuals.

  • API Data Access

Using API enables you to use custom data streams and filtered data points. API provider has an engineering team that develops and maintains the crawler stability and coverage. API service providers typically crawl data in high demand at scale and sell to multiple sources thereby reducing the prices. There are tier’s with segmented data access or pay per use plans, which enable you to choose a plan that you can afford.

  • Pricing

Depending on the complexity of scraping and depth of data, the data sets may be cheap or costly. The seller might have upfront one-time pricing, annual charges for API or a pay per use plan.

  • Support

Many companies endure the painful process of selecting the right web crawling service – only to discover the cost of support and inadequate response time make the entire endeavor impractical.

 

StartupFlux API

Here’s what you can do with StartupFlux API’s (Demo coming soon) :
Build powerful applications or integrate StartupFlux into your system, processes, web and mobile applications with the REST API. The StartupFlux API is a read-only RESTful service that enables developers to leverage the same data that powers https://startupflux.com

API Company Endpoint

  • Extracted and Analysed 40+ data points
  • Search by Keywords, Company name, and URL
  • Search by Technology Stack
  • Filter By Business Analytics (Business Model, Value Chain, Business Lifecycle, Customer Category, Product type)
  • Narrow Down Results by Industry, Sub-sector, location, employee count, and Funding Details
  • Sort by Web Ranking, Social Stats, Company Age
  • Benchmark Ratings (Growth Score, Sustainability Score, Market momentum, Competitive Index)
  • Find Media mentions and Job postings
  • Find Market Sentiment of Companies
  • View Social Media Engagement and Stats
  • Keep track of Blogs a company is publishing

API Content Endpoint

  • Extracted & Analyzed from 400+ sources daily
  • Search by Keywords, Entities(People & Organizations) & Topics
  • Filter By Concepts(Acquisition, Funding, PR, Blog Post)
  • Narrow Down Results to a specific Location
  • Read complete news/articles without leaving the tab
  • Stay updated on the topics, categories, companies that matter to you
  • Filter Stories by Sentiment towards People & Organizations
  • View Social Media Engagement, Hashtags, and Influencers
  • Keep track of News outlets, Blogs & Writers to see who is publishing what
  • Stay updated on the topics, categories, companies that matter to you

 

Interested in Building Amazing apps with our data?
Need specific Web Crawlers?

Write to [email protected]