Web Scraping: Gather that data 101

Scrape it till you make it.

Jan 30, 2022

Sunday, January 30, 2022 // Contact: Bob Bragg-IG //Weekly Sponsor: T&R

Web Scraping: Gather that data 101

The internet is a sea of data, and we can go about our travels picking up pieces of information by hand (copy & paste) or with a good ole fashion fishing trawler (Scraper). The key in collection (data) is speed, accuracy and quantity. Is there a learning curve - sure, but take a class (Intro to Web Scraping w/ Python) and or strap yourself in front of “YouTube University”. When it comes to python coding tutorials you’re bound to have library dependency issues (github comments and google are golden), be patient, humble, and ask questions. Browser extensions and premade extractors are available - those are options too. This is a skill a majority can learn and will save you hours in data collection and allow you to create a scalable solution.

Web scraping is a method to extract data from formatted sources. You’ll receive a large amount of raw information that needs to be parsed and formatted.

For whatever reason you need data. Cool story - how do I get it? A majority of you will open up a browser, find what you want, and copy & paste. That works - but it will be repetitive, you’ll probably get bored out of your mind, and it’s time-consuming. In my opinion this is fine for small and quick projects. When it comes to multiple day tasks though an automated solution makes more sense.

We briefly discussed that automation is “the way” but what are we after? Having a series of reproducible processes in general is good practice. When you establish the need for a scraper, you’ll have to focus on specific elements structured within the place of extraction. Depending on the method you can set a schedule - set it, forget it, and collect it.

Again, what sort of information can be useful after extraction and parsing? A lot, our world is digital and heavy on e-commerce, opinions (social media), news, and brokering of accumulated personal data. We will refer to these different types of information as data sets. All of these data sets are ripe for the picking. It’s really up to your creativity. Some use cases are: sentiment of users (product reviews), news articles, statistics, journalism, SEO, competitor analysis, risk management, real estate, academic research, and everything else.

What is this “data extraction” or “scraping”? Plain and simple, its when code grabs specific fields within a website, database, form or directory, it does so in a efficient, timely, and scalable way. Webpages are for humans but code is for software. Any product meant to present information will be organized to some degree and have some sort of theme or style. The fields within the syntax of how it’s presented is where we point the scrapers to retrieve the data from. Think of fields as a bucket and the syntax as the order in which the buckets go. So, if you know, which buckets to get your data form then you can go directly to that place and snag that data.

There are a bunch of ways to go about this. Scrapers come in all different shapes, sizes, and price points. Four types of commonly used scrapers are - Ready-made or homebrewed, Browser extension, and cloud. Open-source scraping tools are available via GitHub and YouTube is loaded with tutorials. Search “Web Scraping Online” and you find a list of companies offering online solutions - try them and read their tutorials.

Okay team - this is a good start. Go take the class and get dangerous. I’ll release a series of these going into more depth in stages. Enjoy.

About this Product

These open source products are reviewed from analysts at InfoDom Securities and provide possible context about current media trends in regard to the realm of cyber security. The stories selected cover a broad array of cyber threats and are intended to aid readers in framing key publicly discussed threats and overall situational awareness. InfoDom Securities does not specifically endorse any third-party claims made in their original material or related links on their sites, and the opinions expressed by third parties are theirs alone. Contact InfoDom Securities at dominanceinformation@gmail.com

Social Media IO Roundup

Discussion about this post