Friday, February 26, 2010
oose pdf's
Wednesday, February 17, 2010
What are the different architecture of datawarehouse
2.Bottom up - (Ralph kimbol)
What are the steps to build the data warehouse?
what are the types of dimension tables
what is dimension table?
components of data warehouse
Overall Architecture
The data warehouse architecture is based on a relational database management system server that functions as the central repository for informational data. Operational data and processing is completely separated from data warehouse processing. This central information repository is surrounded by a number of key components designed to make the entire environment functional, manageable and accessible by both the operational systems that source data into the warehouse and by end-user query and analysis tools.
Typically, the source data for the warehouse is coming from the operational applications. As the data enters the warehouse, it is cleaned up and transformed into an integrated structure and format. The transformation process may involve conversion, summarization, filtering and condensation of data. Because the data contains a historical component, the warehouse must be capable of holding and managing large volumes of data as well as different data structures for the same database over time.
The next sections look at the seven major components of data warehousing:
Data Warehouse Database
The central data warehouse database is the cornerstone of the data warehousing environment. This database is almost always implemented on the relational database management system (RDBMS) technology. However, this kind of implementation is often constrained by the fact that traditional RDBMS products are optimized for transactional database processing. Certain data warehouse attributes, such as very large database size, ad hoc query processing and the need for flexible user view creation including aggregates, multi-table joins and drill-downs, have become drivers for different technological approaches to the data warehouse database. These approaches include:
- Parallel relational database designs for scalability that include shared-memory, shared disk, or shared-nothing models implemented on various multiprocessor configurations (symmetric multiprocessors or SMP, massively parallel processors or MPP, and/or clusters of uni- or multiprocessors).
- An innovative approach to speed up a traditional RDBMS by using new index structures to bypass relational table scans.
- Multidimensional databases (MDDBs) that are based on proprietary database technology; conversely, a dimensional data model can be implemented using a familiar RDBMS. Multi-dimensional databases are designed to overcome any limitations placed on the warehouse by the nature of the relational data model. MDDBs enable on-line analytical processing (OLAP) tools that architecturally belong to a group of data warehousing components jointly categorized as the data query, reporting, analysis and mining tools.
Sourcing, Acquisition, Cleanup and Transformation Tools
A significant portion of the implementation effort is spent extracting data from operational systems and putting it in a format suitable for informational applications that run off the data warehouse.
The data sourcing, cleanup, transformation and migration tools perform all of the conversions, summarizations, key changes, structural changes and condensations needed to transform disparate data into information that can be used by the decision support tool. They produce the programs and control statements, including the COBOL programs, MVS job-control language (JCL), UNIX scripts, and SQL data definition language (DDL) needed to move data into the data warehouse for multiple operational systems. These tools also maintain the meta data. The functionality includes:
- Removing unwanted data from operational databases
- Converting to common data names and definitions
- Establishing defaults for missing data
- Accommodating source data definition changes
The data sourcing, cleanup, extract, transformation and migration tools have to deal with some significant issues including:
- Database heterogeneity. DBMSs are very different in data models, data access language, data navigation, operations, concurrency, integrity, recovery etc.
- Data heterogeneity. This is the difference in the way data is defined and used in different models - homonyms, synonyms, unit compatibility (U.S. vs metric), different attributes for the same entity and different ways of modeling the same fact.
These tools can save a considerable amount of time and effort. However, significant shortcomings do exist. For example, many available tools are generally useful for simpler data extracts. Frequently, customized extract routines need to be developed for the more complicated data extraction procedures.
Meta data
Meta data is data about data that describes the data warehouse. It is used for building, maintaining, managing and using the data warehouse. Meta data can be classified into:
- Technical meta data, which contains information about warehouse data for use by warehouse designers and administrators when carrying out warehouse development and management tasks.
- Business meta data, which contains information that gives users an easy-to-understand perspective of the information stored in the data warehouse.
Equally important, meta data provides interactive access to users to help understand content and find data. One of the issues dealing with meta data relates to the fact that many data extraction tool capabilities to gather meta data remain fairly immature. Therefore, there is often the need to create a meta data interface for users, which may involve some duplication of effort.
Meta data management is provided via a meta data repository and accompanying software. Meta data repository management software, which typically runs on a workstation, can be used to map the source data to the target database; generate code for data transformations; integrate and transform the data; and control moving data to the warehouse.
As user's interactions with the data warehouse increase, their approaches to reviewing the results of their requests for information can be expected to evolve from relatively simple manual analysis for trends and exceptions to agent-driven initiation of the analysis based on user-defined thresholds. The definition of these thresholds, configuration parameters for the software agents using them, and the information directory indicating where the appropriate sources for the information can be found are all stored in the meta data repository as well.
Access Tools
The principal purpose of data warehousing is to provide information to business users for strategic decision-making. These users interact with the data warehouse using front-end tools. Many of these tools require an information specialist, although many end users develop expertise in the tools. Tools fall into four main categories: query and reporting tools, application development tools, online analytical processing tools, and data mining tools.
Query and Reporting tools can be divided into two groups: reporting tools and managed query tools. Reporting tools can be further divided into production reporting tools and report writers. Production reporting tools let companies generate regular operational reports or support high-volume batch jobs such as calculating and printing paychecks. Report writers, on the other hand, are inexpensive desktop tools designed for end-users.
Managed query tools shield end users from the complexities of SQL and database structures by inserting a metalayer between users and the database. These tools are designed for easy-to-use, point-and-click operations that either accept SQL or generate SQL database queries.
Often, the analytical needs of the data warehouse user community exceed the built-in capabilities of query and reporting tools. In these cases, organizations will often rely on the tried-and-true approach of in-house application development using graphical development environments such as PowerBuilder, Visual Basic and Forte. These application development platforms integrate well with popular OLAP tools and access all major database systems including Oracle, Sybase, and Informix.
OLAP tools are based on the concepts of dimensional data models and corresponding databases, and allow users to analyze the data using elaborate, multidimensional views. Typical business applications include product performance and profitability, effectiveness of a sales program or marketing campaign, sales forecasting and capacity planning. These tools assume that the data is organized in a multidimensional model.
A critical success factor for any business today is the ability to use information effectively. Data mining is the process of discovering meaningful new correlations, patterns and trends by digging into large amounts of data stored in the warehouse using artificial intelligence, statistical and mathematical techniques.
Data Marts
The concept of a data mart is causing a lot of excitement and attracts much attention in the data warehouse industry. Mostly, data marts are presented as an alternative to a data warehouse that takes significantly less time and money to build. However, the term data mart means different things to different people. A rigorous definition of this term is a data store that is subsidiary to a data warehouse of integrated data. The data mart is directed at a partition of data (often called a subject area) that is created for the use of a dedicated group of users. A data mart might, in fact, be a set of denormalized, summarized, or aggregated data. Sometimes, such a set could be placed on the data warehouse rather than a physically separate store of data. In most instances, however, the data mart is a physically separate store of data and is resident on separate database server, often a local area network serving a dedicated user group. Sometimes the data mart simply comprises relational OLAP technology which creates highly denormalized dimensional model (e.g., star schema) implemented on a relational database. The resulting hypercubes of data are used for analysis by groups of users with a common interest in a limited portion of the database.
These types of data marts, called dependent data marts because their data is sourced from the data warehouse, have a high value because no matter how they are deployed and how many different enabling technologies are used, different users are all accessing the information views derived from the single integrated version of the data.
Unfortunately, the misleading statements about the simplicity and low cost of data marts sometimes result in organizations or vendors incorrectly positioning them as an alternative to the data warehouse. This viewpoint defines independent data marts that in fact, represent fragmented point solutions to a range of business problems in the enterprise. This type of implementation should be rarely deployed in the context of an overall technology or applications architecture. Indeed, it is missing the ingredient that is at the heart of the data warehousing concept -- that of data integration. Each independent data mart makes its own assumptions about how to consolidate the data, and the data across several data marts may not be consistent.
Moreover, the concept of an independent data mart is dangerous -- as soon as the first data mart is created, other organizations, groups, and subject areas within the enterprise embark on the task of building their own data marts. As a result, you create an environment where multiple operational systems feed multiple non-integrated data marts that are often overlapping in data content, job scheduling, connectivity and management. In other words, you have transformed a complex many-to-one problem of building a data warehouse from operational and external data sources to a many-to-many sourcing and management nightmare.
Data Warehouse Administration and Management
Data warehouses tend to be as much as 4 times as large as related operational databases, reaching terabytes in size depending on how much history needs to be saved. They are not synchronized in real time to the associated operational data but are updated as often as once a day if the application requires it.
In addition, almost all data warehouse products include gateways to transparently access multiple enterprise data sources without having to rewrite applications to interpret and utilize the data. Furthermore, in a heterogeneous data warehouse environment, the various databases reside on disparate systems, thus requiring inter-networking tools. The need to manage this environment is obvious.
Managing data warehouses includes security and priority management; monitoring updates from the multiple sources; data quality checks; managing and updating meta data; auditing and reporting data warehouse usage and status; purging data; replicating, subsetting and distributing data; backup and recovery and data warehouse storage management.
Information Delivery System
The information delivery component is used to enable the process of subscribing for data warehouse information and having it delivered to one or more destinations according to some user-specified scheduling algorithm. In other words, the information delivery system distributes warehouse-stored data and other information objects to other data warehouses and end-user products such as spreadsheets and local databases. Delivery of information may be based on time of day or on the completion of an external event. The rationale for the delivery systems component is based on the fact that once the data warehouse is installed and operational, its users don't have to be aware of its location and maintenance. All they need is the report or an analytical view of data at a specific point in time. With the proliferation of the Internet and the World Wide Web such a delivery system may leverage the convenience of the Internet by delivering warehouse-enabled information to thousands of end-users via the ubiquitous world wide network.
In fact, the Web is changing the data warehousing landscape since at the very high level the goals of both the Web and data warehousing are the same: easy access to information. The value of data warehousing is maximized when the right information gets into the hands of those individuals who need it, where they need it and they need it most. However, many corporations have struggled with complex client/server systems to give end users the access they need. The issues become even more difficult to resolve when the users are physically remote from the data warehouse location. The Web removes a lot of these issues by giving users universal and relatively inexpensive access to data. Couple this access with the ability to deliver required information on demand and the result is a web-enabled information delivery system that allows users dispersed across continents to perform a sophisticated business-critical analysis and to engage in collective decision-making.
core sofwre engg principles
REQUIREMENT ELICITATION
INITIATION Ask Qs
PLANNING PRACTICES
communicn practices PRINCILPES
Tuesday, February 16, 2010
prototype model
waterfall model
FrameWork activites
Difference Between IPv4 and IPv6
- Source and destination addresses are 32 bits (4 bytes) in length.
- IPSec support is optional.
- IPv4 header does not identify packet flow for QoS handling by routers.
- Both routers and the sending host fragment packets.
- Header includes a checksum.
- Header includes options.
- Address Resolution Protocol (ARP) uses broadcast ARP Request frames to resolve an IP address to a link-layer address.
- Internet Group Management Protocol (IGMP) manages membership in local subnet groups.
- ICMP Router Discovery is used to determine the IPv4 address of the best default gateway, and it is optional.
- Broadcast addresses are used to send traffic to all nodes on a subnet.
- Must be configured either manually or through DHCP.
- Uses host address (A) resource records in Domain Name System (DNS) to map host names to IPv4 addresses.
- Uses pointer (PTR) resource records in the IN-ADDR.ARPA DNS domain to map IPv4 addresses to host names.
- Must support a 576-byte packet size (possibly fragmented).
IPv6
- Source and destination addresses are 128 bits (16 bytes) in length.
- IPSec support is required.
- IPv6 header contains Flow Label field, which identifies packet flow for QoS handling by router.
- Only the sending host fragments packets; routers do not.
- Header does not include a checksum.
- All optional data is moved to IPv6 extension headers.
- Multicast Neighbor Solicitation messages resolve IP addresses to link-layer addresses.
- Multicast Listener Discovery (MLD) messages manage membership in local subnet groups.
- ICMPv6 Router Solicitation and Router Advertisement messages are used to determine the IP address of the best default gateway, and they are required.
- IPv6 uses a link-local scope all-nodes multicast address.
- Does not require manual configuration or DHCP.
- Uses host address (AAAA) resource records in DNS to map host names to IPv6 addresses.
- Uses pointer (PTR) resource records in the IP6.ARPA DNS domain to map IPv6 addresses to host names.
- Must support a 1280-byte packet size (without fragmentation).
Virtual circuit switching - adv and disadv
Virtual circuit switching is a packet switching methodology whereby a path is established between the source and the final destination through which all the packets will be routed during a call. This path is called a virtual circuit because to the user, the connection appears to be a dedicated physical circuit. However, other communications may also be sharing the parts of the same path.
Before the data transfer begins, the source and destination identify a suitable path for the virtual circuit. All intermediate nodes between the two points put an entry of the routing in their routing table for the call. Additional parameters, such as the maximum packet size, are also exchanged between the source and the destination during call setup. The virtual circuit is cleared after the data transfer is completed.
Virtual circuit packet switching is connection orientated. This is in contrast todatagram switching, which is a connection less packet switching methodology.
Advantages of virtual circuit switching are:
- Packets are delivered in order,
since they all take the same route; - The overhead in the packets is smaller,
since there is no need for each packet to contain the full address; - The connection is more reliable,
network resources are allocated at call setup so that even during times of congestion, provided that a call has been setup, the subsequent packets should get through; - Billing is easier,
since billing records need only be generated per call and not per packet.
Disadvantages of a virtual circuit switched network are:
- The switching equipment needs to be more powerful,
since each switch needs to store details of all the calls that are passing through it and to allocate capacity for any traffic that each call could generate; - Resilience to the loss of a trunk is more difficult,
since if there is a failure all the calls must be dynamically reestablished over a different route.
Monday, February 15, 2010
SONET Layers :
SONET has four optical interface layers. They are:
¨ Path Layer,
¨ Line Layer,
¨ Section Layer,
¨ Photonic Layer.
Path Layer : The Path Layer deals with the transport of services between the PTE. The main function of the Path Layer is to map the signals into a format required by the line layer . Its main functions are :
· Reads ,
· Interprets ,
· Modifies the path overhead for the performance and automatic protection switching.
Line Layer : The line layer deals with the transport of the path layer payload and its overhead across the physical medium. The main function of the line layer is to provide synchronization and to perform multiplexing for the path layer . Its main functions are :
· Protecting Switching ,
· Synchronization ,
· Multiplexing ,
· Line maintenance ,
· Error Monitoring .
Section Layer : The section layer deals with the transport of an STS-N frame across the physical medium. Its main functions are :
· Framing ,
· Scrambling ,
· Error Monitoring ,
· Section Maintenance.
Photonic Layer : The Photonic layers mainly deals with the transport of bits across the physical medium. Its main functions are :
· Wavelength ,
· Pulse Shape ,
· Power Levels.
Introduction to SONET:
Synchronous optical network (SONET) is a standard for optical telecommunications transport. It was formulated by the ECSA for ANSI, which sets industry standards in the
The increased configuration flexibility and bandwidth availability of SONET provides significant advantages over the older telecommunications system. These advantages include the following:
· Reduction in equipment requirements and an increase in network reliability.
· Provision of overhead and payload bytes-the overhead bytes permit management of the payload bytes on an individual basis and facilitate centralized fault sectionalization
· Definition of a synchronous multiplexing format for carrying lower level digital signals (such as DS-1,DS-3) and a synchronous structure that greatly simplifies the interface to digital switches, digital cross-connect switches, and add-drop multiplexers
· Availability of a set of generic standards that enable products from different vendors to be connected
· Definition of a flexible architecture capable of accommodating future applications, with a variety of transmission rates
In brief, SONET defines optical carrier (OC) levels and electrically equivalent synchronous transport signals (STSs) for the fiber-optic-based transmission hierarchy.
ATM Congestion control
Due to the unpredictable traffic pattern, congestion is unavoidable. When the total input rate is greater than the output link capacity, congestion happens. Under a congestion situation, the queue length may become very large in a short time, resulting in buffer overflow and cell loss. So congestion control is necessary to ensure that users get the negotiated QoS.
In this study, two major congestion algorithms are focused, which are especially for ABR source. Binary Feedback scheme (EFCI) uses a bit to indicate congestion occurs. A switch may detect congestion in the link if the queue level exceeds a certain level. Accordingly, the switch sets the congestion bit to 1. When the destination receives these data cells with EFCI bit set to 1, the destination sets the CI bit of the backward RM cell to 1 indicating congestion occurs. When the source receives a backward RM cell with CI bit as 1, the source has to decrease its rate. The EFCI only told the source increase or decrease the rate and hence the method converges slowly. The Explicit Rate Indication for Congestion Avoidance (ERICA) algorithm solves the problem by allowing each switch to explicitly tell the desired rate to the passing RM cells, the source adjusts the rate according to the backward RM cells.
ATM QoS Priority Scheme
Each service category in ATM has its own queue. There are mainly two schemes for queue service. In round-robin scheme, all queues have the same priority and therefore have the same chance of being serviced. The link’s bandwidth is equally divided amongst the queues being serviced. Another scheme is weighted round-robin scheme, which is somehow similar to WFQ in IP networks: queues are serviced depending on the weights assigned to them. Weights are determined according to the Minimum Guaranteed Bandwidth attribute of each queue parameter in each ATM switch. This scheme ensures that the guaranteed bandwidth is reserved for important application such as CBR service category.
ATM Service Categories
Providing desired QoS for different applications is very complex. For example, voice is delay-sensitive but not loss-sensitive, data is loss- sensitive but not delay-sensitive, while some other applications may be both delay-sensitive and loss-sensitive.
To make it easier to manage, the traffic in ATM is divided into five service classes accorcing to various combination requested QoS:
· CBR: Constant Bit Rate
CBR is the service category for traffic with rigorous timing requirements like voice, and certain types of video. CBR traffic needs a constant cell transmission rate throughout the duration of the connection.
· rt-VBR: Real-Time Variable Bit Rate
This is intended for variable bit rate traffic for e.g. certain types of video with stringent timing requirements.
· nrt-VBR: Non-Real-Time Variable Bit Rate
This is for bursty sources such as data transfer, which do not have strict time or delay requirements.
· UBR: Unspecified Bit Rate
This is ATM’s best-effort service, which does not provide any QoS guarantees. This is suitable for non-critical applications that can tolerate or quickly adjust to loss of cells.
· ABR: Available Bit Rate
ABR is commonly used for data transmissions that require a guaranteed QoS, such as low probability of loss and error. Small delay is also required for some application, but is not as strict as the requirement of loss and error. Due to the burstiness, unpredictability and huge amount of the data traffic, sources implement a congestion control algorithm to adjust their rate of cell generation. Connections that adjust their rate in response to feedback may expect a lower CLR and a fair share of available bandwidth.
The available bandwidth at an ABR source at any point of time is dependant on how much bandwidth is remaining after the CBR and VBR traffic have been allocated their share of bandwidth. Figure 1 explains this concept.
ATM Traffic Descriptors
The ability of a network to guarantee QoS depends on the way in which the source generates cells (Uniformly or in a bursty way) and also on the availability of network resources for e.g. buffers and bandwidth. The connection contract between user and network will thus contain information about the way in which traffic will be generated by the source. A set of traffic descriptors is specified for this purpose. Policing algorithms check to see if the source abides by the traffic contract. The network only provide the QoS for the cells that do not violate these specifications.
The following are traffic descriptors specified for an ATM network.
· Peak Cell Rate (PCR):
The maximum instantaneous rate at which the user will transmit.
· Sustained Cell Rate (SCR):
The average rate as measured over a long interval.
· Burst Tolerance (BT):
The maximum burst size that can be sent at the peak rate.
· Maximum Burst Size (MBS):
The maximum number of back-to-back cells that can be sent at the peak cell rate.
· Minimum Cell Rate (MCR):
The minimum cell rate desired by a user.
ATM QoS parameter
Primary objective of ATM is to provide QoS guarantees while transferring cells across the network. There are mainly three QoS parameters specified for ATM and they are indicators of the performance of the network
- Cell Transfer Delay (CTD):
The delay experienced by a cell between the first bit of the cell is transmitted by the source and the last bit of the cell is received by the destination. This includes propagation delay, processing delay and queuing delays at switches. Maximum Cell Transfer Delay (Max CTD) and Mean Cell Transfer Delay (Mean CTD) are used.
- Peak-to-peak Cell Delay Variation (CDV):
The difference of the maximum and minimum CTD experienced during the connection. Peak-to-peak CDV and Instantaneous CDV are used.
- Cell Loss Ratio (CLR):
The percentage of cells lost in the network due to error or congestion that are not received by the destination. CLR value is negotiated between user and network during call set up process and is usually in the range of 10-1 to 10-15.
How ATM Works
· ATM network uses fixed-length cells to transmit information. The cell consists of 48 bytes of payload and 5 bytes of header. Transmitting the necessary number of cells per unit time provides the flexibility needed to support variable transmission rates.
· ATM network is connection-oriented. It sets up virtual channel connection (VCC) going through one or more virtual paths (VP) and virtual channels (VC) before transmitting information. The cells is switched according to the VP or VC identifier (VPI/VCI) value in the cell head, which is originally set at the connection setup and is translated into new VPI/VCI value while the cell passes each switch.
· ATM resources such as bandwidth and buffers are shared among users, they are allocated to the user only when they have something to transmit. The bandwidth is allocated according to the application traffic and QoS request at the signaling phase. So the network uses statistical multiplexing to improve the effective throughput.