- Big Data News Weekly
- Posts
- 🤖 Generative AI for Data Science
🤖 Generative AI for Data Science
🦾Plus: Tesla robotaxi launch 🚕, 👀 Apple and Meta blast EU fines

Hey folks! Let’s get into Big Data and AI craziness…
In today's edition: What's Shaping the Future of Data?
🧪How (not) to do a data science project
📊 Choosing the Right ML Model for Your Data
🤖A practical guide to fine-tuning embedding models
❄️Apache Iceberg Internals Dive Deep On Performance
🤖 Microsoft’s new AI agents, workplace AI research
🎨 OpenAI unlocks powerful image creation via API
👀 Apple and Meta blast EU fines
💡 AI Tutorial:Use AI to stop AI-assisted cheating during remote interviews
🤖 AI Tools and Data Tools to checkout

In order for people to find value in your work in data science (and want to hire you!), you need to show them what you've worked on. In practice, this will mostly be school projects or side projects. But when someone starts working on their first data science projects, they inevitably fall into basic traps that make their project look boring/useless/ugly, even if the project is an amazing idea. Here, I want to share what I think would make a great data science project (not enterprise production-level, but mostly side projects)…
HubSpot’s AI-powered ecosystem presents a global opportunity projected to reach $10.2 billion by 2028. To capitalize on that growth potential, we are opening our platform more, starting with expanded APIs, customizable app UI, and tools that better support a unified data strategy.

In this detailed guide, we shall examine the basic steps and requirements that should be considered when selecting the most appropriate machine learning model. From a beginner’s viewpoint and with the furthering of one’s knowledge through a machine-learning course, this guide will serve as a helpful resource to aid in making informed decisions when it comes to model choice.

Should one teach coding in a required introductory statistics and data science class for non-major students?…With the release of large language models that write code, we saw an opportunity for a middle ground, which we tried in Fall 2023 in a required introductory data science course in our school’s full-time MBA program. We taught students how to write English prompts to the artificial intelligence tool GitHub Copilot that could be turned into R code and executed.

In this report, we try to answer questions like - If/when should you fine-tune embedding models, and what are the qualities of a good fine-tuning dataset. We'll deal with embedding part of the retrieval pipeline, which means any changes or updates will require re-ingestion of the data, unlike reranking.

In this blog I will go over how Apache Iceberg contributes to performance of compute engine. Apache Iceberg is an ACID table format designed for large-scale analytics workloads. While its consistency and schema evolution features are covered in previous blog, its impact on query performance can be equally transformative.
The end-to-end encrypted password manager that's taking on the Big Tech companies that sell your data. Sign up for today to Proton Pass and save, store, and autofill passwords without compromising on your online security.
👨💻 Data Tools, Libraries
VectorChord (GitHub Repo)
VectorChord is a PostgreSQL extension designed for scalable, high-performance, and disk-efficient vector similarity search.
Sequin (GitHub Repo)
Sequin is a tool for change data capture in Postgres that supports native sinks, making it easy to stream Postgres rows and changes to streaming platforms and queues.
Skia Canvas (GitHub Repo)
Skia Canvas is a browser-less implementation of the HTML Canvas drawing API for Node.js. It produces very similar results to Chrome's canvas element.
AI News:

Microsoft just released two new Copilot agents, Researcher and Analyst, alongside its 2025 Work Trend Index report — which maps out the rise of AI-centric, human-led “Frontier Firms” set to reshape the workplace. Researcher and Analyst bring deep reasoning to M365 Copilot for complex research and data science tasks like forecasting.

OpenAI just launched its advanced image generation model, gpt-image-1, to developers via API — bringing the viral success of ChatGPT's image capabilities to third-party applications and platforms. The gpt-image-1 model powers ChatGPT's image generation feature, which produced over 700 million images in just one week after its launch in March.

More than 30 AI experts and ex-OpenAI staffers published an open letter urging the attorneys general of Delaware and California to block OpenAI’s restructuring, warning it would undermine its original mission to benefit humanity. 9 former OpenAI employees joined notable figures like AI ‘godfather’ Geoffrey Hinton in calling to block the startup’s transition from nonprofit to for-profit.

Tesla has started testing its autonomous ride-hail service with employees in Austin and the Bay Area ahead of the company’s planned robotaxi launch this summer. “FSD Supervised ride-hailing service is live for an early set of employees in Austin & San Francisco Bay Area,” the company posted Wednesday on X.
Most hearing aids have one processor. These bad boys have two. They process speech and noise separately. What does this mean? It means speech gets clearer and crisper – more than ever before. Conversations and listening become effortless. Oh, and they’re so tiny, they’re practically invisible. No wonder over 425,000 customers love them.

The European Commission has slapped Apple and Meta with a combined fine of around $800M for violating their digital market policies (referred to as the DMA), making them the first tech companies to face punishment for failing to deploy fair business practices, which is what the DMA has been designed to prevent.
AI Tutorial
Use AI to stop AI-assisted cheating during remote interviews

You can leverage AI-powered detection to ensure fair and reliable remote interviews in 3 easy steps:
Go to the Sherlock website and sign up
Add your meeting link (Zoom, Teams, or Google Meet)
Share the secure Sherlock interview link with the candidate
The tool monitors audio, video, and screen activity, providing real-time alerts and detailed reports—saving you time and resources wasted on fraudulent candidates, and helping you gain confidence in your hiring process and make better hiring decisions.
Through Squarespace’s cutting-edge features that combine automation, design presets, creative guidance, and generative AI, Design Intelligence makes it easy to build a beautiful and impactful website. With just a few pieces of information, Blueprint AI generates an entire website customized based off your brand’s goals, name, and personality. It’s AI speed, with Squarespace’s 20+ years of design expertise in website building.
🔥Top AI tools to increase productivity:
Trolly AI: Revolutionizing SEO Content Creation with Advanced AI Technology.
pre.dev accelerates idea to development.
AskGPT extension enhances web browsers by providing AI-powered summaries and insights directly on web pages.
Editby - Create content for your blog, newspaper, newsletter, press notes, social networks etc. with AI.
Data Analyst AI connects Google Analytics with ChatGPT, delivering AI-powered eCommerce insights and automated weekly reports.
Videotok - Create viral TikToks and Reels from text to Video with AI
Maze Guru is a conversational AI tool for generating videos and images
View our database of all the best AI tools for your needs: aitoolsup.com
Have cool resources to share? Submit AI tool
A.I. Generated Image of the Day
👀 If Godzilla arose in Arrakis..

Recommended reading
SPONSOR US
Get your product in front of Big Data & AI enthusiasts
Our newsletter is read by thousands of tech professionals, investors, engineers, managers, and business owners around the world.
Interested in Sponsoring the Big Data News Weekly Newsletter?Get in touch today
What did you think of today's email?Your feedback helps me create better emails for you! |