Skip to main content Link Menu Expand (external link) Left Arrow Right Arrow Document Search Copy Copied

Lesson 1 - Finding Data

Lesson Objectives

  • Recognize the importance of ethical considerations in data collection, especially concerning privacy and anonymization.
  • Learn to identify and utilize various data repositories and search platforms for sourcing data relevant to specific research areas.
  • Develop skills in crafting search strategies that incorporate multiple search terms and filters to refine source selection.
  • Gain proficiency in selecting and evaluating data from diverse sources to ensure a balanced and comprehensive dataset.
  • Master techniques in organizing and analyzing data to highlight relevant patterns and terms that will inform the sonification process.

Part 1: Collecting Data

Brief caveat: The information we used to construct our use-case was collected from secondary sources, and therefore contained no identifying information for individual victims affected. Only companies whose information was compromised were named. If you are collecting any kind of sensitive data, it is imperative you undertake the appropriate ethical considerations. This can involve discussing privacy concerns and/or anonymizing data where possible.

Use case: tracing discourse on data breaches

Choosing Your Data Source

Start with pinpointing where you’ll source your data from. This will heavily depend on the kind of data you want to work with. There are a number of searchable data repositories available for public use, such as Dryad, FigShare, Kaggle, Google Dataset Search, and many more that are just a web search away. These can be sourced for all kinds of different datasets that could produce novel and varied understandings when sonified.

If you, like we were, are interested in datasets related to breaches specifically, you can skip straight to these datasets that we found:

  1. Data from: Health IT, hacking, and cybersecurity: national trends in data breaches of protected health information

  2. For-profit versus non-profit cybersecurity posture: breach types and locations in healthcare organisations

  3. Data Breaches Dataset

Why Reuse Data? Reusing datasets can save time and resources if you’re just looking to get a handle on the process on sonification itself. It can also lead to new discoveries that were not envisioned by the original collectors, particularly because sonification is not a commonly deployed data presentation technique.

However, if you are feeling ambitious and do want to start from scratch, you will likely need to source your material from academic databases or perform some primary source web-scraping.

For our example study, the research team used Factiva, a versatile platform known not only for its business news but also for its extensive international newspaper, trade journal, blog, and website archives. For our purposes, Factiva offered 2 main benefits:

  1. Broad international coverage.

  2. Useful for diverse topics like data breach events and geopolitical crises.

The main takeaway: Choose a source that aligns with your research objectives and provides vast and relevant coverage.

Part 2: Crafting Your Search Strategy

In our use-case, this step required us to develop a framework to identify key terms and phrases in order to filter the most relevant coverage from our database.

We also took the approach of combining search terms, for instance terms like “MyFitnessPal” and “data breach” were combined to further narrow down our potential sources.

It also became pertinent to restrict the date range in order to capture the year of coverage from the public release of the data breach information.

Because our use-case relied on choosing source coverage that was diverse, we were careful to allow room for reporting that provided minority and globally-informed perspectives. Some of these reports did not agree with the scientific majority, but we felt they were useful to include in order to bring a variety of perspectives to bear in our collection.

Part 3: Filtering and Selecting Relevant Data

From your search results, examine the most frequently published sources about your topic within your specified timeframe. While renowned sources like the Associated Press, BBC, and The New York Times are great, don’t shy away from including other international and local sources when they add value.

Aim for a diverse set of views rather than just focusing on the reputability of sources. Manually review the articles to ensure they are 1) relevant to your case, 2) contain substantial information, 3) are original or add additional information if they are a republish.

For our study, the team selected around 5 articles from each source, amassing a total of 320 primary sources. Your strategy may look different, but it should generally follow a set of guidelines that will enable you to collect the most relevant and useful sources possible.

Part 4: Organizing and Analyzing Your Data

Create summaries for each primary source. Note down key descriptions and metaphors that qualify specific categories relevant to your study. Our team looked into perpetrators, breach framing, perceived risk, victims, and data, among others. Track observations in a structured format, like an Excel spreadsheet. This aids in capturing language patterns, expressions, and specific terminologies. For nuanced understanding, our team preferred manual methods over automated ones, emphasizing the depth and interconnections between sources.