* . *
  • About
  • Advertise
  • Privacy & Policy
  • Contact
Saturday, July 19, 2025
Earth-News
  • Home
  • Business
  • Entertainment
    Theater at Santa Fe’s San Isidro Plaza will be converted into IMAX, family entertainment venue – Santa Fe New Mexican

    Santa Fe’s San Isidro Plaza Theater Transforms into Exciting IMAX Family Entertainment Venue

    B&B Theatres will open massive entertainment complex in Texas – The Business Journals

    B&B Theatres will open massive entertainment complex in Texas – The Business Journals

    Rough times for broadcast networks illustrate changing media landscape – New Haven Register

    Broadcast Networks Confront Turbulent Times in a Rapidly Changing Media Landscape

    Black River Entertainment Adds Traci Hite As Director Of Promotion, Southeast – MusicRow.com

    Black River Entertainment Welcomes Traci Hite as New Director of Southeast Promotion

    Entertainment Business Master’s Grad Launched Nonprofit to Nurture Emerging Artists – Full Sail University

    Entertainment Business Master’s Grad Launched Nonprofit to Nurture Emerging Artists – Full Sail University

    Review: At the Huntington, the New Hollywood String Quartet recalls legendary studio musicians – Los Angeles Times

    Review: At the Huntington, the New Hollywood String Quartet recalls legendary studio musicians – Los Angeles Times

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    A major AI training data set contains millions of examples of personal data – MIT Technology Review

    A major AI training data set contains millions of examples of personal data – MIT Technology Review

    Simpson College to purchase medical simulation technology with grant funds – Iowa Capital Dispatch

    Simpson College to purchase medical simulation technology with grant funds – Iowa Capital Dispatch

    SailGP Technologies officially launches new center of excellence in technology & innovation – Sail-World.com

    SailGP Technologies officially launches new center of excellence in technology & innovation – Sail-World.com

    Victorville’s new gunfire-detecting technology already making strides, city says – NBC Los Angeles

    Victorville’s New Gunfire-Detecting Technology Is Already Making a Difference, City Officials Say

    Guest columnist: China cutting corners on technology – The State Journal

    China’s Rapid Tech Advances Spark Worries About Cutting Corners

    Sentrycs’ Cyber Over RF technology integrated into Rafael’s combat-proven Drone Dome system – Defence Industry Europe

    Sentrycs’ Cyber Over RF Technology Boosts Rafael’s Battle-Tested Drone Dome System

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
  • Home
  • Business
  • Entertainment
    Theater at Santa Fe’s San Isidro Plaza will be converted into IMAX, family entertainment venue – Santa Fe New Mexican

    Santa Fe’s San Isidro Plaza Theater Transforms into Exciting IMAX Family Entertainment Venue

    B&B Theatres will open massive entertainment complex in Texas – The Business Journals

    B&B Theatres will open massive entertainment complex in Texas – The Business Journals

    Rough times for broadcast networks illustrate changing media landscape – New Haven Register

    Broadcast Networks Confront Turbulent Times in a Rapidly Changing Media Landscape

    Black River Entertainment Adds Traci Hite As Director Of Promotion, Southeast – MusicRow.com

    Black River Entertainment Welcomes Traci Hite as New Director of Southeast Promotion

    Entertainment Business Master’s Grad Launched Nonprofit to Nurture Emerging Artists – Full Sail University

    Entertainment Business Master’s Grad Launched Nonprofit to Nurture Emerging Artists – Full Sail University

    Review: At the Huntington, the New Hollywood String Quartet recalls legendary studio musicians – Los Angeles Times

    Review: At the Huntington, the New Hollywood String Quartet recalls legendary studio musicians – Los Angeles Times

  • General
  • Health
  • News

    Cracking the Code: Why China’s Economic Challenges Aren’t Shaking Markets, Unlike America’s” – Bloomberg

    Trump’s Narrow Window to Spread the Truth About Harris

    Trump’s Narrow Window to Spread the Truth About Harris

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    Israel-Gaza war live updates: Hamas leader Ismail Haniyeh assassinated in Iran, group says

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    PAP Boss to Niger Delta Youths, Stay Away from the Protest

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Court Restricts Protests In Lagos To Freedom, Peace Park

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Fans React to Jazz Jennings’ Inspiring Weight Loss Journey

    Trending Tags

    • Trump Inauguration
    • United Stated
    • White House
    • Market Stories
    • Election Results
  • Science
  • Sports
  • Technology
    A major AI training data set contains millions of examples of personal data – MIT Technology Review

    A major AI training data set contains millions of examples of personal data – MIT Technology Review

    Simpson College to purchase medical simulation technology with grant funds – Iowa Capital Dispatch

    Simpson College to purchase medical simulation technology with grant funds – Iowa Capital Dispatch

    SailGP Technologies officially launches new center of excellence in technology & innovation – Sail-World.com

    SailGP Technologies officially launches new center of excellence in technology & innovation – Sail-World.com

    Victorville’s new gunfire-detecting technology already making strides, city says – NBC Los Angeles

    Victorville’s New Gunfire-Detecting Technology Is Already Making a Difference, City Officials Say

    Guest columnist: China cutting corners on technology – The State Journal

    China’s Rapid Tech Advances Spark Worries About Cutting Corners

    Sentrycs’ Cyber Over RF technology integrated into Rafael’s combat-proven Drone Dome system – Defence Industry Europe

    Sentrycs’ Cyber Over RF Technology Boosts Rafael’s Battle-Tested Drone Dome System

    Trending Tags

    • Nintendo Switch
    • CES 2017
    • Playstation 4 Pro
    • Mark Zuckerberg
No Result
View All Result
Earth-News
No Result
View All Result
Home Technology

Lessons Learned from Scaling to Multi-Terabyte Datasets

June 20, 2024
in Technology
Lessons Learned from Scaling to Multi-Terabyte Datasets
Share on FacebookShare on Twitter

This post is meant to guide you through some of the lessons I’ve learned while working with multi-terabyte datasets. The lessons shared are focused on what someone may face as the size of their dataset scales up and some of the things I’ve done to overcome them. I hope you’re waiting for something to finish running while reading this!

Remember, this is not a rigid guide. It’s about introducing concepts and explaining why you should start applying them. Numerous other tools can surpass the ones I’ve used, and I strongly encourage you to take the initiative and explore them independently. Your active exploration is key to your professional growth.

I’ve divided this post into two sections: scaling on single machines and multi-machine scaling. The goal is to maximize your available resources and reach your goals as quickly as possible.

Lastly, I want to emphasize that no optimization or scaling can compensate for a flawed algorithm. Before scaling up, it’s crucial to evaluate your algorithm. This should always be your first step, providing a confident guide for your work.

Scaling on a Single Machine

Joblib

Compute is the first bottleneck that comes to mind when scaling. Scaling up computations can be done in several different practical ways. If you’re a data scientist or a machine learning engineer, you might already be familiar with Joblib, a library used to run code in parallel (among other things). It is often used in other libraries, such as scikit-learn or XGBoost.

The process of parallelizing something using Joblib is simple, as follows (modified for clarity from the Joblib docs):

from joblib import Parallel, delayed
from math import sqrt

parallel_mapper=Parallel(n_jobs=-1)
delayed_func=delayed(sqrt)
jobs=[
delayed_func(x**2)
for x in range(10)
]
parallel_mapper(jobs)
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

Joblib is a great way to scale up your parallel workloads. It’s used in scikit-learn and other tools, proving reliable for many workloads. This isn’t even considering its other excellent features regarding memoization or Fast Compressed Persistence. Joblib is helpful for just making a function parallelizable across all your CPU cores.

GNU Parallel

GNU Parallel is a powerful tool for preprocessing or extracting data in the CLI. It differs from Joblib as it can be used outside a script and is versatile. You can even run other Python scripts in parallel. One of the most common use cases is decompressing many files simultaneously. Here’s how I would do it:

> ls
random_0.zip  random_2.zip  random_4.zip  random_6.zip  random_8.zip
random_1.zip  random_3.zip  random_5.zip  random_7.zip  random_9.zip
…

> mkdir output
> ls | parallel –eta –bar “unzip -q {} -d output/”
100% 10:0=0s random_9.zip

> ls output/
random_0.txt  random_2.txt  random_4.txt  random_6.txt  random_8.txt
random_1.txt  random_3.txt  random_5.txt  random_7.txt  random_9.txt
…

These commands are pretty straightforward if you have used a Linux terminal before. The main part to focus on is piping the file names to parallel so that unzip can decompress them.

For any task, once you have a bash command set to run on a single file, you can parallelize it by modifying your command slightly. By default, parallel uses all available CPU cores and can execute commands on multiple machines using ssh, meaning that it could be used as an ad-hock computing cluster.

Another use case is downloading a large number of files. With wget and parallel and a list of files to download, writing a quick one-liner to download all the files in parallel is easy. Other tools, such as axel and aria2c, can do this just as well, but I’ve found this to work better when I need to download many smaller files.

A quick note: While you can use this to download many files, be aware that this can cause strain on servers by creating multiple connections, leading to network congestion and reduced performance for other users or even being seen as a DOS attack. This increased server load can be particularly problematic for smaller websites or servers with limited bandwidth. Famously, aria2c has rejected proposals to increase the maximum number of connections from 16, even though computers have gotten faster, and network bandwidth has increased dramatically. Given their position, I agree with their decisions and ask you to act responsibly when downloading.

Another point I’d like to mention is that while you can get things working quicker with Parallel, it may be challenging to manage bash commands, especially for a beginner where the rest of the team might be more Python/traditional programming language focused. Due to this, I generally recommend keeping Parallel for one-off tasks rather than writing complex ETL pipelines in bash. Maintainable code is second to only no code at all.

Scaling to Multiple Machines

When to Start Using Multiple Machines

One key identifier for when it makes sense to switch to using multiple machines (think Spark or, my favourite, Dask) is when computing is taking too long for your use cases. This could be experiments, data processing, or whatever. The worst timeframe I’ve estimated is some jobs taking months or a year to finish computing if I were to stick to a single instance, even on AWS’s u-24tb1.112xlarge (a beast of a machine). I’m against the waste of any kind, and the better you can utilize the resources available, the better, in my opinion.

By switching to multiple smaller machines, you leverage several performance benefits over a more prominent instance. Depending on your scaling solution, horizontal scaling allows for almost linear scaling across your CPU, memory, and networking with the number of instances you use.

Most reasonably large EC2 instances offer up to 10 GBit internet speeds, which can help alleviate IO bottlenecks, especially if you’re rapidly streaming data to or from S3. If your workload requires data coming in at 50 Gbit/s, you’ve got the option to either use a m7i.48xlarge instance, which costs $9.6768 hourly and runs at 50 GBit, or four m7i.8xlarge instances, which costs $1.6128 hourly per instance or $6.4512 hourly for the same network bandwidth.

I selected networking speeds and cost as the two metrics to focus on here, but if you’re looking to maximize your memory and CPU usage, we can compare the previously mentioned u-24tb1.112xlarge. For the exact cost, you can rent out 135 m7i.8xlarge instances. That gives you 4320 CPUs (10x the instance), 17.28TB of RAM, and 1687.5 Gigabit internet speed (~17x the instance)! While RAM is less, I’ve used a general-purpose instance here to scale, not a memory-optimized one. Using the memory-optimized equivalent, we can get 34.56 TB of RAM, with all the other benefits of using multiple machines (redundancy, finer control for the instance size, etc).

Moreover, with the correct backend, I can scale to as many instances as my use case, orchestration tool, or accounting department will allow. This level of scalability is a crucial advantage, enabling you to meet the demands of your workload without being limited by the capabilities of a single instance.

As with everything, there are benefits to your different approaches. It’s your job to evaluate the pros and cons of each solution and determine what works best for your use case. Minimizing cost while maximizing performance is a good exercise in building intuition for these tasks.

However, given these incredible benefits, I only recommend using multiple instances once you’ve understood the bottlenecks you face. I have seen teams start to scale and over-engineer their approach to computing before understanding their use case. I may have even been a part of those teams before learning my lesson. In some instances, well-written cli tools could process data faster than an entire spark cluster.

Different Computing Models

For Embarrassingly Parallel Workloads

Embarrassingly Parallel Workloads are generally the easiest to scale compared to other types of workloads. We’ve already touched on how to scale up computing using Joblib or Parallel, but what about scaling to multiple machines? There are quite a few tools to scale up computation. I would recommend using AWS Batch or AWS Lambda for embarrassingly parallel workloads that are one-off. Batch is scalable, and with spot pricing, you can get most of your tasks done at a fraction of the cost of using on-demand instances in a fraction of the time it would take to run them in parallel on a single machine. There are other tools you can use (GCP’s Cloud Run, for example), but I can only recommend AWS Batch for longer-running tasks since that’s what I’ve used in the past.

Since setting up the cluster can be time-consuming and is out of the scope of this post, I’ve included a link here incase you’re interested in exploring this yourself.

One caveat worth mentioning is that the general throughput of your job will be limited by your read and write speeds more so than the compute speed. If you’re reading from/writing to a database, then the database is likely to be a bottleneck (or even crash). S3 is a viable option for reading and writing, given it’s designed to scale better, but it still has its limits. 3,500 writes and 5,500 reads per second per partitioned prefix. S3 is designed to be invisible when scaling to the user, so you may have little control over how it adapts to the increased throughput.

Once the data is in S3 (or whatever service you use), you can transfer it wherever needed.

This setup is quite tedious but scales well for one-off tasks. With a few iterations, you can reduce the setup time to a few minutes, depending on how well you’ve automated the process and your team’s needs. Generally, I’ve found that the setup time is worth it for the computing and engineering time saved, but you can understand my hesitation in using this for every task.

Analytical Workloads

Analytical workloads are a bit more challenging to scale. This is because you’re generally working with a single dataset and trying to do a lot of operations on that dataset. You may also have an element of interactivity, such as things running in a Jupyter Notebook. My go-to tool for scaling up analytical workloads is Dask, with an alternative being Spark. Dask and Spark are open-source tools that allow you to scale up your workloads to multiple machines, with their pros and cons. Both these tools can also be used locally, and their implementations of DataFrames (Dask DataFrame and Spark Dataframe) can be used to scale up existing workloads.

Dask is much easier to set up and install. I can get Dask running locally in a few minutes with a single command (pip install “dask[complete]” by the way). On the other hand, Spark requires a bit more setup, and I’ve found that running on my local machine is a bit more challenging. Dask also comes with the benefit that any data scientist using Pandas or Numpy can get used to it quickly while knowing Spark is an entirely different skill set. Dask is also better integrated with several PyData tools, meaning you can take advantage of them immediately. However, given all this, Spark and the Spark ecosystem are much more mature by comparison, and it’s likely that your team already has invested time into getting a Spark cluster up and running. I run into the occasional bug or performance issue with Dask, while Spark is known to be much more stable due to its maturity. Dask is also not suited for longer-running computations.

Given this, my general recommendation is:

If you’re a small team or startup with no infrastructure for big data or distributed computing. In that case, I recommend at least experimenting with Dask, regardless of the team’s experience with Spark. In the time you take to get Spark running locally, you could’ve validated your use case with Dask, and your team will be able to leverage other tools in the PyData space.

If you’re already part of a larger organization that uses Spark or some other significant data infrastructure. In that case, it makes sense to stick with it unless you have a compelling reason not to. I recommend watching Eric Dill’s talk on Is Spark Still Relevant? for why larger organizations prefer to use Spark over more modern tools. It is five years old, so some talking points may be outdated. That said, you should still try Dask since you can use both.

Conclusion

In conclusion, managing and scaling multi-terabyte datasets requires a deep understanding of both your data and the tools at your disposal. By leveraging Joblib and GNU Parallel for single-machine scaling, you can maximize the efficiency of your computational resources. When scaling beyond a single machine is necessary, AWS Batch, Dask, and Spark provide robust solutions for various workloads, from embarrassingly parallel tasks to complex analytical operations.

The key takeaway is to start by optimizing your algorithms before scaling, ensuring you’re not merely amplifying inefficiencies. Actively exploring and adapting new tools can significantly enhance your performance and cost-effectiveness. Successful scaling is as much about strategic planning and resource management as raw computational power. Embrace the learning curve; you’ll be well-equipped to handle even the largest datasets confidently and skillfully.

>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Hacker News – https://v2thegreat.com/2024/06/19/lessons-learned-from-scaling-to-multi-terabyte-datasets/

Tags: learnedLessonstechnology
Previous Post

Oldest white wine in the world found in a first-century tomb in Spain

Next Post

Ask HN: Is Firefox better than Chrome when it comes to user security?

A major AI training data set contains millions of examples of personal data – MIT Technology Review

A major AI training data set contains millions of examples of personal data – MIT Technology Review

July 18, 2025
Dave Portnoy unites FOX Sports and Barstool with new football, basketball coverage deal – Fox News

Dave Portnoy unites FOX Sports and Barstool with new football, basketball coverage deal – Fox News

July 18, 2025
To see and not be seen: Carangids hide behind sharks to prey on fish – Cattano – 2025 – Ecology – ESA Journals

Mastering the Art of Stealth: How Carangids Turn Sharks into Living Shields to Outsmart Their Prey

July 18, 2025
Science labs move ‘a hammer blow’, says MP – Yahoo

Science labs move ‘a hammer blow’, says MP – Yahoo

July 18, 2025
American science to soon face its largest brain drain in history – Big Think

America’s Science Community Confronts Its Biggest Brain Drain in History

July 18, 2025
Longhorns Daily News: Austin’s fast-paced lifestyle helped inspire former Texas WR Brenen Thompson to transfer to new school – Burnt Orange Nation

Longhorns Daily News: Austin’s fast-paced lifestyle helped inspire former Texas WR Brenen Thompson to transfer to new school – Burnt Orange Nation

July 18, 2025
China’s path to world dominance is being laid by President Trump’s policies, including in Alaska – Alaska Beacon

China’s path to world dominance is being laid by President Trump’s policies, including in Alaska – Alaska Beacon

July 18, 2025
Wall Street’s Big 7 Are Now Bigger Than China’s Entire Economy – MSN

Wall Street’s 7 Biggest Titans Now Outpace the Entire Chinese Economy

July 18, 2025
Bad Bunny, Travis Scott, Saweetie, and All the Songs You Need to Know This Week – Yahoo

Bad Bunny, Travis Scott, Saweetie, and This Week’s Must-Listen Hits

July 18, 2025
‘I just couldn’t stop crying’: How prison affects Black men’s mental health long after they’ve been released – The Conversation

I Just Couldn’t Stop Crying’: The Deep and Lasting Impact of Prison on Black Men’s Mental Health

July 18, 2025

Categories

Archives

July 2025
MTWTFSS
 123456
78910111213
14151617181920
21222324252627
28293031 
« Jun    
Earth-News.info

The Earth News is an independent English-language daily published Website from all around the World News

Browse by Category

  • Business (20,132)
  • Ecology (727)
  • Economy (750)
  • Entertainment (21,635)
  • General (15,970)
  • Health (9,788)
  • Lifestyle (758)
  • News (22,149)
  • People (752)
  • Politics (760)
  • Science (15,967)
  • Sports (21,248)
  • Technology (15,733)
  • World (734)

Recent News

A major AI training data set contains millions of examples of personal data – MIT Technology Review

A major AI training data set contains millions of examples of personal data – MIT Technology Review

July 18, 2025
Dave Portnoy unites FOX Sports and Barstool with new football, basketball coverage deal – Fox News

Dave Portnoy unites FOX Sports and Barstool with new football, basketball coverage deal – Fox News

July 18, 2025
  • About
  • Advertise
  • Privacy & Policy
  • Contact

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

No Result
View All Result

© 2023 earth-news.info

Go to mobile version