Dive into a curated selection of the most influential research papers on Data Engineering. This collection covers groundbreaking approaches, methodologies, and applications that are shaping the future of this critical field. Expand your knowledge and keep up with the latest trends and innovations in Data Engineering.
Looking for research-backed answers?Try AI Search
Patrick Petersen, Hanno Stage, Jacob Langner + 4 more
2022 IEEE International Symposium on Systems Engineering (ISSE)
This paper aims to take a step towards the introduction of a data engineering process in data-driven automotive systems engineering by putting a spotlight on developing well-designed data sets as the central element for training and validating AI-based software.
This work outlines three diverse applications to the economics of information; to life-cycle employment, earnings, and spending; and to public policy analysis and provides a general overview of the engineering process.
Yao Fu, Rameswar Panda, Xinyao Niu + 4 more
ArXiv
It is demonstrated that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K, which outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.
This article shows that open-source data sets are the rocket fuel for research and innovation at even some of the largest AI organizations, and analysis of nearly 2000 research publications from Facebook, Google and Microsoft over the past five years shows the widespread use and adoption of open data sets.
Andrew Caplin
Journal of Economic Literature
Cognitive economics studies imperfect information and decision-making mistakes. A central scientific challenge is that these can’t be identified in standard choice data. Overcoming this challenge calls for data engineering, in which new data forms are introduced to separately identify preferences, beliefs, and other model constructs. I present applications to traditional areas of economic research, such as wealth accumulation, earnings, and consumer spending. I also present less traditional applications to assessment of decision-making skills, and to human–AI interactions. Methods apply both t...
This research presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and cataloging individual pieces of data to provide insights about their owners.
Claudia M. Eckert
journal unavailable
Important shortcomings in the sociotechnical processes that undergo changes as digitalization is brought into mature engineering organizations are revealed and a lack of knowledge on multiple levels of the data analysis process and the ethical implications this could have are pointed to.
F. Giorgi, Carmine Ceraolo, D. Mercatelli
Life
An historical chronicle of how R became what it is today is provided, describing all its current features and capabilities, and the role of R in science in general as a driver for reproducibility is discussed.
Y. Papakonstantinou, Michael Armbrust, A. Ghodsi + 23 more
journal unavailable
1 ETL and ELT are required to prepare comprehensive, clean, and correct derived data that can fuel successful analytics and ML. Based on our observations from thousands of customers processing data in the cloud at Databricks, the preparation of derived data typically involves a complex DAG of transformations, which are split into two activities: (a) Ingestion: At the sources of the DAG, raw data are fetched from streaming platforms, like Apache Kafka TM and Amazon Kinesis, and from cloud storage that stages incoming data. This data is typically in blob stores such as AWS S3. The majority of ou...
Santhosh Bussa
Journal of Sustainable Solutions
This discussion summarizes the challenges implicated, including scale and security, outlines strategies for workflow optimization, and elaborates on some findings using data tables and practical code snippets, which brings actionable insights for both practitioners and researchers.
Qiao Dong, Xueqin Chen, S. Dong + 1 more
IEEE Transactions on Intelligent Transportation Systems
This paper summarized and discussed more than 40 types of data analysis methods including statistical tests, experimental design, regressions, count data model, survival analysis, stochastic process models, supervised learnings, unsupervised learningings, reinforcement learnings and Bayesian analysis applied in pavement engineering.
Yu Wang, Wengang Zhang, Xiaohui Qi + 1 more
Georisk: Assessment and Management of Risk for Engineered Systems and Geohazards
Various topics are covered in this special issue, including Bayesian learning of unconfined compressive strength of rock, machine learning of geological details from borehole logs for the development of the high-resolution subsurface geological profile, determination of optimal sampling locations using Gaussian Process Regression (GPR), and Gaussian mixture model for estimating debris-flow exceedance probability.
E. Zeydan, J. Mangues-Bafalluy
IEEE Access
There is still a significant gap in applying recent developments in the evolving data engineering world to the telecommunication domain, and several recommendations for early adoption of these technologies and frameworks in telecommunication infrastructures and platforms are proposed.
Ismail Setiawan
Widya Accarya
Seseorang yang ahli dalam keterampilan analisis data hanyalah keterampilan dasar seorang insinyur data. Keahlian statistik digunakan untuk memproses data baca dan tag, serta untuk mengkategorikan data. Karena erat kaitannya dengan pemodelan yang dibuat untuk menguji algoritma pada level data scientist. Model yang dibuat pada fase data scientist digunakan sebagai alat dalam fase business intelligence. Pada tahap akhir ini, eksekusi yang akan dilakukan harus memberikan dampak positif dan keuntungan yang besar bagi sebuah instansi.
N. Kilaru
International Journal for Research Publication and Seminar
The paper explores using WMS to incorporate ADP and AFT when implementing the entire data science pipeline, from data acquisition to deployment of the final model, to suggest that using data engineering approaches saves time and resources while performing data pre-processing and analysis, improves the quality and reliability of analytics findings and outputs, and is an essential component of contemporary analytical pipelines.
J. Hellerstein, Aditya G. Parameswaran
1st International Workshop on Data Systems Education
In the Spring of 2021, a pilot edition of a new Data Engineering course at Berkeley was launched, targeted at the authors' burgeoning Data Science major, focusing on fluency of data models, languages and transformation tasks.
M. Daradkeh, Shadi Atalla
2023 International Conference on Information Technology (ICIT)
A data science and data engineering approach for automated generation of data stories and an engineering development approach to drive the growth of data storytelling tools and industry ecosystem are presented.
Jolanta Brzozowska, Jakub Pizoń, Gulzhan Baytikenova + 3 more
Applied Computer Science
The paper presents the CRISP-DM based model for data mining in the process of predicting assembly cycle time and will be a part of methodology that allows to estimate the assembly time of a finished product at the quotation stage, without the detailed technology of the product being known.
S. Glotzer
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
This talk discusses the applications of data science and data-driven thinking to molecular and materials simulation and presents applications of machine learning to automated, structure identification of complex colloidal crystals, high-throughput mapping of phase diagrams, the study of kinetic pathways between fluid and solid phases, and the discovery of previously elusive design rules and structure-property relationships.
J. Greenhalgh, Apoorv Saraogee, Philip A. Romero
Protein Engineering
Instead of physically modeling the relationships between protein sequence, structure, and function, data-driven methods use ideas from statistics and machine learning to infer these complex relationships from data.
Suraj Gupta, D. Aga, A. Pruden + 2 more
Environmental science & technology
An overview of data analytics frameworks suitable for various Environmental Science and Engineering research applications is provided and a path to advance incorporation of data Analytics approaches in ESE research and application is proposed.
Ioannis Foufoulas, A. Simitsis
2023 IEEE 39th International Conference on Data Engineering (ICDE)
This tutorial presents recent advancements in the problem of efficient UDF execution in modern data engines, involving a broad scope of solutions ranging from algebraic, cost-based optimization to low level, physical query optimization, compilation, and execution.
N. K. Sahu, M. Patnaik, Itu Snigdh
journal unavailable
This chapter discusses various data types and their techniques for applying to feature engineering and focuses on the implementation of various data techniques for feature extraction.
N. Polyzotis, M. Zaharia
ArXiv
This paper discusses several lessons from data and ML engineering that could be interesting to apply in data-centric AI, based on the experience building data andML platforms that serve thousands of applications at a range of organizations.
Jianxun Xing
E3S Web of Conferences
For the time being, China is encouraging engineering application and promotion of BIM technique in engineering construction industry and effectively improved the design, construction, operation and maintenance technology and quality of large construction projects.
Daniel Tebernum, Marcel Altendeitering, F. Howar
journal unavailable
A data engineering reference model (DERM) is developed, which outlines the important building-blocks for handling data along the data lifecycle and derived six research gaps that need further attention for establishing a practically-grounded engineering process.
F. Chiarello, E. Coli, Vito Giordano + 2 more
Proceedings of the Design Society
Insight shows that ED studies have a great potential in the usage of many data sources, but also that there exist some gaps to be solved in order to reach a more effective data usage in the context of ED.
F. Chirigati, Rémi Rampin, Aécio Santos + 2 more
Proc. VLDB Endow.
This work describes the system architecture and how users can explore datasets through a rich set of queries and presents case studies which show how Auctus supports data augmentation to improve machine learning models as well as to enrich analytics.
B. Yang, R. Nazari, D. Elmo + 2 more
IOP Conference Series: Earth and Environmental Science
This paper aims to fill the gap by providing a set of guidelines on the necessary data preparation steps for applying machine learning to rock engineering problems, thereby helping rock engineers improve the performance of their machine learning models.
Wentai Zhang, Quan Chen, Can Koz + 6 more
Volume 3A: 48th Design Automation Conference (DAC)
A constrained data synthesis method to generate an arbitrarily large set of synthetic training drawings using only a handful of labeled examples based on the randomization of the dimension sets subject to two major constraints to ensure the validity of the synthetic drawings is presented.
Wentai Zhang, Joe Joseph, Quan Chen + 7 more
J. Comput. Inf. Sci. Eng.
A constrained data synthesis method to generate an arbitrarily large set of synthetic training drawings using only a handful of labeled examples based on the randomization of the dimension sets subject to two major constraints to ensure the validity of the synthetic drawings is presented.
Christian C. Nadell, Gregory P. Spell, Mark Jeiran + 1 more
journal unavailable
This work explores critical axes of variation using standard CNN architectures, evaluating a large UE training set on a real IR validation set, and provides guidelines for variation in many of these critical dimensions for multiple machine learning problems.
R. Torkar, Carlo A. Furia, R. Feldt
2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)
n/a
Xiangyan Sun, Ke Liu, Yuquan Lin + 9 more
ArXiv
An end-to-end, retrosynthesis system that can propose complete retrosyNThesis routes for organic compounds rapidly and reliably is developed, showing satisfying functionality and a potential productivity boost in real-life use cases.
Rastislav Funta
Masaryk University Journal of Law and Technology
The purpose of this paper is to provide answers on whether a regulation aimed at preventing abuses is necessary or whether an obligation to publish the search algorithm may be advocated.
Aleksandrov Aleksandr Anatolyevich, Pavlov Andrey Mikhailovich
journal unavailable
The problem of maximum simplification of the process of information visualization and automation of design and production processes using "big data," which helps to analyze trends based on analytics and forecast stocks, production volumes, service life, and cycles of equipment operation, etc.
Lorna M. Smith
journal unavailable
This guide to finding scientific and engineering raw data for analysis, comparison and research helps scientists and engineers find the best sources of data for their research.
F. Chirigati, Rémi Rampin, Aécio S. R. Santos + 2 more
ArXiv
This demo presents the ongoing effort to develop a dataset search engine tailored for data augmentation, named Auctus, which automatically discovers datasets on the Web and, from existing dataset search engines, infers consistent metadata for indexing and supports join and union search queries.
Vitor Pinheiro de Almeida, Júlio Gonçalves Campos, Elvismary Molina de Armas + 4 more
journal unavailable
INSIDE is presented, a system that enables Semantic Interoperability for Engineering Data Integration and represents queries to one or multiple databases through the concept of data services, where each service is defined using an ontology.
Jiacheng Li
Financial Engineering and Risk Management
This paper briefly introduces the concepts of financial statistics and big data technology, and analyzes the current situation of the application and the strategy of effective application to promote the enhancement of the effect of theApplication of bigData technology in financial statistics.
Xinze Li, Baixi Zou
ArXiv
An implementation of an automated data engineering pipeline for anomaly detection of IoT sensor data is studied and proposed and involves the use of IoT sensors, Raspberry Pis, Amazon Web Services, and multiple machine learning techniques with the intent to identify anomalous cases for the smart home security system.
Navodini Wijethilake, D. Meedeniya, Charith D. Chitraranjan + 3 more
IEEE Access
The prognostic parameters acquired are explored, utilizing diagnostic imaging techniques and genomic platforms for survival or risk estimation of glioma patients and the techniques, learning and statistical analysis algorithms used for prognosis prediction are reviewed.
Yu-Fei Ao, Mark Dörr, Marian J. Menke + 3 more
ChemBioChem
This concept article reviews machine learning models that have been developed to assess enzyme‐substrate‐catalysis performance relationships aiming to improve enzymes through data‐driven protein engineering and prospect the future development of this field to provide additional strategies and tools for achieving desired activities and selectivities.
Vignesh Gopakumar, S. Pamela, D. Samaddar
ArXiv
It is demonstrated how PINNs can be forced to converge better towards the solution, by way of feeding in sparse or coarse data as a regulator.
Xiong Qiu, Pengtong Fan, Bingfeng Xie
Journal of Electronics and Information Science
This project focuses on software engineering technology in the context of big data to improve the work efficiency of enterprises, but also play a significant role in the economic development of society.
Steven Herbert
Data-Centric Engineering
This work focuses on quantum Monte Carlo integration as a likely source of (relatively) near-term quantum advantage, but also discusses some other ideas that have garnered widespread interest.
The Bulletin of the Technical Committee on Data Engineering is published quarterly and is distributed to all TC members. Its scope includes the design, implementation, modelling, theory and application of database systems and their technology. Letters, conference information, and news should be sent to the Editor-in-Chief. Papers for each issue are solicited by and should be sent to the Associate Editor responsible for the issue. Opinions expressed in contributions are those of the authors and do not necessarily reflect the positions of the TC on Data Engineering, the IEEE Computer Society, or...
Sendong Zhao, Aobo Wang, Bing Qin + 1 more
Bioinformatics
A BERT-based evidence extraction model is proposed to extract evidence from literature in response to queries, and a dataset with 1 million examples of biomedical evidence is created, 10,000 of which are manually annotated.
Lipsa Das, Laxmi Ahuja, V. Chauhan + 1 more
2022 2nd International Conference on Innovative Practices in Technology and Management (ICIPTM)
This paper is presenting feature engineering concept to predict the behavior of customer, which includes the different stage of process to create feature and after creating the feature in any domain machine learning algorithm can be applied.
Mike Tamir, Steven Miller, A. Gagliardi
Labor: Personnel Economics eJournal
Businesses are quickly realizing that data scientists can only go so far without the team in place to support their day-to-day work, but more importantly to operationalize their work.