
We are building a cutting-edge data transformation platform designed to convert unstructured data into structured, machine-readable formats for AI and analytics workflows. This platform processes vast volumes of heterogeneous data, applying advanced parsing methods, enrichment techniques, and LLM-powered extraction to produce high-quality datasets for downstream applications. A key component of this system is the Document Retrieval REST service, which serves as the interface for accessing the transformed data. This service ensures seamless integration with data pipelines and provides performant, scalable document retrieval capabilities.
As a cornerstone of the organization’s AI ecosystem, the system ensures reliable data ingestion, model-ready representations, and seamless information flow across multiple products and internal services. You will play a pivotal role in developing and optimizing next-generation data processing pipelines and maintaining the Document Retrieval REST service that powers intelligent applications and automated decision-making at scale.
Your Role and Impact
As a Senior Python Engineer, you will be instrumental in designing, developing, and optimizing high-performance data transformation pipelines and the Document Retrieval REST service. You will leverage your expertise in software engineering and AI/NLP methodologies to deliver scalable, accurate, and robust solutions. This role offers the opportunity to apply advanced language-model technologies in regulated, data-intensive environments while contributing to the broader AI ecosystem.
Key Responsibilities
Data Pipelines: Design, build, and maintain modular, high-throughput pipelines for ingesting and transforming diverse data types using both traditional parsing methods and AI-powered techniques.
Document Retrieval Service: Own the development and maintenance of the Document Retrieval REST service, ensuring seamless integration with data pipelines and high-performance document retrieval.
Database Optimization: Manage and optimize the MongoDB database, ensuring scalability, performance, and efficient document retrieval through techniques like index optimization.
Statistical Analysis & Governance: Perform statistical analysis on ingested and retrieved data, enforce governance policies, and manage entitlements for data pipelines.
Metadata Normalization: Ensure metadata structures remain normalized while accommodating the breadth of diverse data sources.
AI/NLP Integration: Engineer and optimize prompts, extraction logic, and workflows using modern LLM orchestration frameworks for accurate interpretation of complex data.
Algorithm Development: Implement algorithms for parsing, segmentation, vectorization, and structured output generation using standard data-processing libraries.
Cloud Deployment: Build and deploy solutions on major cloud platforms, ensuring scalability and reliability.
Production-Grade Code: Write resilient, production-ready code for Linux-based environments.
Cross-Functional Collaboration: Work within an Agile team alongside analysts, domain experts, and platform engineers to deliver high-quality solutions.
Code Quality: Maintain high standards for code quality, testing, and documentation.
Required Qualifications (Must-Haves)
Python Expertise: 5+ years of professional Python experience with strong OOP principles, design patterns, and large-scale system development.
Data Processing: Solid experience with data-processing and numerical libraries (e.g., Pandas, NumPy).
AI/LLM Integration: Hands-on experience with LLM-integrated application frameworks and AI-powered workflows.
Data Transformation: Proven expertise in parsing and transforming data across multiple formats (structured, semi-structured, unstructured).
Cloud Platforms: Experience deploying solutions on major cloud providers (e.g., AWS, Azure, GCP).
Database Skills: Strong DB proficiency and experience integrating with optimizing databases (NoSQL preferred).
Linux Proficiency: Comfortable working in Linux/Unix environments with shell scripting.
Preferred Qualifications (Nice-to-Haves)
Domain Experience: Background in data-intensive or financial industries.
Business Data Familiarity: Experience working with structured and semi-structured business data.
NLP Expertise: Familiarity with NLP toolkits, embedding techniques, and text-processing pipelines.
ETL Systems: Experience building and maintaining large-scale ETL or data-processing systems.
Enterprise Infrastructure: Exposure to enterprise tools like schedulers, monitoring systems, and authentication frameworks.
Testing Frameworks: Experience with automated testing frameworks for data pipelines and APIs.
Soft Skills
Communication: Clear and structured communication skills to collaborate effectively with cross-functional stakeholders.
Problem-Solving: Strong analytical mindset with a focus on ownership and accountability.
Team Collaboration: Ability to work collaboratively in Agile teams with diverse stakeholders.
Why Join Us?
This is an exciting opportunity to work on a transformative AI-driven platform that directly impacts intelligent decision-making at scale. You will collaborate with a talented team, tackle complex technical challenges, and contribute to the future of AI-powered data processing.
The estimated base salary range for this position is $175,000 to $250,000, which is specific to New York and may change in the future. Millennium pays a total compensation package which includes a base salary, discretionary performance bonus, and a comprehensive benefits package. When finalizing an offer, we take into consideration an individual’s experience level and the qualifications they bring to the role to formulate a competitive total compensation package.