Achieving Performance Speed-up in FPGA Based Bit-Parallel Multipliers using Embedded Primitive and Macro support

The MPT network layer multipath communication library is a novel solution for several problems including IPv6 transition, reliable data transmission using TCP, real-time transmission using UDP and also wireless network layer routing problems. MPT can provide an IPv4 or an IPv6 tunnel over one or more IPv4 or IPv6 communication channels. MPT can also aggregate the capacity of multiple physical channels. In this paper, the channel aggregation capability of the MPT library is measured up to twelve 100Mbps speed channels. Different scenarios are used: both IPv4 and IPv6 are used as the underlying and also as the encapsulated protocols and also both UDP and TCP are used as transport protocols. In addition, measurements are taken with both 32-bit and 64-bit version of the MPT library. In all cases, the number of the physical channels is increased from 1 to 12 and the aggregated throughput is measured. Keywords—channel capacity aggregation, network layer multipath communication, performance analysis, TCP/IP protocol stack, tunneling


I. INTRODUCTION
The multiplier circuit is one of the fundamental components used in digital signal processing (DSP) [1] [2] [3] [4].The field of DSP has always been driven by the advancements in scaled very-large-scale-integration (VLSI) technologies.The goal of digital design is to maximize the performance while keeping the cost down [5].In the context of general digital design, performance is measured in terms of the amount of hardware circuitry and resources required; the speed of execution (throughput and clock rate); and the amount of power dissipated.There is always an applicationdriven tradeoff between these parameters.It is, therefore, desirable to have an efficient realization of these circuits for use in different DSP systems [6] [7].
DSP algorithms have traditionally been implemented using general purpose processors or DSP processors.
However, with current trend moving back towards hardware intensive processing it becomes important for the designers to give a serious thought to the underlying implementation platform [8].Applications demanding an increased performance mainly use application integrated circuits (ASIC) or structural ASICs [2].The main attraction with ASICs is that the architecture can be developed specifically to meet the performance requirement.However, the nonrecurring engineering (NRE) costs associated with ASICs have cornered their use only to high-volume markets.Field programmable gate arrays (FPGAs) provide an alternate approach to ASICs.They avoid the high NRE costs by giving the user the flexibility to configure the device in field [4], [9].Some other advantages include large-scale integration [4], [10], lower energy requirements [11], [12] availability of several on-board intellectual property (IP) cores [13] etc.
Design for FPGAs differs dramatically from general VLSI design [14].The design process proceeds through phases like design entry, synthesis, translation, mapping and place & route (PAR).Design entry is the only manual phase in the entire design flow.Therefore, using FPGAs as an implementation platform requires programming of the desired functionality using some hardware descriptive language (HDL), as it is the most widely used design entry method [15].The rest of the design process is automated and there is a strong computer aided design (CAD) support for synthesis and implementation.However, sophisticated CAD tools are often not good enough to meet some design constraint if an arbitrary coding style is used [16].A popular guideline that has been followed for writing functional synthesizable HDL codes is the RTL guideline, where RTL stands for register transfer level, signifying that data transfer should occur through registers only.These guidelines adhere to synchronous design practices and signify the regulation of data flow, and how data is being processed [17] rather than what part of the FPGA fabric processes the data.In effect, such codes are purely inferential and strongly rely on the software environment that distributes the logic as per the design goal.Thus, in order to effectively use embedded primitive and macro resources the design entry needs to be modified.
There has been subsequent work regarding implementation of multipliers on FPGAs [17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33].These mainly focus on modifying the multiplier architecture to achieve performance improvement.However, there has been very limited effort in improving the performance by using embedded FPGA resources [34][35][36].In this paper we carry out technology dependent optimizations of fixed-point multipliers by modifying the coding strategy at the design entry phase.This is achieved by writing functional and synthesizable codes that involve direct primitive and macro instantiations.This requires detailed information about the FPGA target family that is being used and the primitives that are supported.In our study different primitives have been used and system functionality has been distributed in a way that utilizes these components with perfect mappings rather than writing a functional code and allowing the synthesizer to distribute the logic through inferences.The study focuses on Spartan-6, Virtex-4 and Virtex-5 families.Detailed analysis is carried out and it is concluded that by using primitive instantiations a subsequent improvement in performance can be achieved.This is achieved without having to alter the data-time relation of the algorithm under consideration.The only tradeoff is that the design entry gets complicated.
The rest of the paper is as follows.Section II briefly discusses the fixed-point bit-parallel multipliers that have been considered in this work.Section III lists the primitives that have been used in this work.A brief description about each primitive is provided.Section IV carries out the actual synthesis and implementation.Conclusions are drawn in section V and references are listed at last.

II. BIT-PARALLEL MULTIPLIERS
In parallel multipliers number of partial products to be added is the main parameter that determines the performance of the multiplier.Bit-parallel multipliers process one whole word of the input sample each clock cycle and are ideal for high-speed applications.The multiplication process is carried out as shown in figure 1.In this paper three different bit-parallel multipliers are considered viz.Parallel ripple-carry array multipliers; Parallel carry-save array multipliers and Baugh-Wooley multipliers.The details of these multipliers could be found in [5].The operands in each case are assumed to be in fixedpoint 2's complement representation.Such a representation ensures a correct final result even if there is an intermediate overflow [5].

III. FPGA PRIMITIVES
Primitives are the components that make an FPGA.The exact nature of a primitive may vary from family to family.In this section we briefly describe the primitives that are used in this work.These belong to the Spartan-6, Virtex-4 and Virtes-5 families.

A. BUFG [38]
This design element is a high-fan-out global clock buffer that connects signals to the global routing resources for low skew distribution of the signal.BUFGs are typically used on clock nets as well other high fan-out nets like sets/resets and clock enables.The primitive is supported by all the three families under consideration.

B. FDSE [38]
FDSE is a single D-type flip-flop with clock enable and synchronous set.The synchronous set input, when high, overrides the clock enable input and sets the output high during the low-to-high clock transition.The data is loaded into the flip-flop when set is low and clock enable is high during the low-to-high clock transition.The primitive is supported by all the three families under consideration.

C. LUT4_L [38]
This design element is a 4-bit look-up table (LUT) with a local output that is used to connect to another output within the same configurable logic block (CLB).The primitive is supported by all the three families under consideration.

D. LUT6_2 [38]
This design element is a 6-input, 2-output LUT that can implement any two 5-input logic functions with shared inputs, or implement a 6-input logic function and a 5-input logic function with shared inputs and shared logic values.The primitive is not supported by the Virtex-4 logic family.

E. CARRY4 [38]
This primitive represents the fast carry logic for a slice.The carry chain consists of a series of four multiplexers and four XOR gates that connect to the other LUTs in the slice via dedicated routes to form more complex functions.The fast carry logic is useful for building arithmetic functions like adders, counters, subtractors etc.The primitive is not supported by the Virtex-4 logic family.

F. MULT_AND [38]
MULT_AND is an AND component used exclusively for building fast and smaller multipliers.The primitive is only supported by the Virtex-4 logic family.

G. MUXCY_L [38]
This primitive is a 2-to-1 multiplexer for carry logic and is used to implement a 1-bit high-speed carry propagate function.The primitive is only supported by the Virtex-4 logic family.

H. XORCY [38]
XORCY is a special XOR element with general output that generates faster and smaller arithmetic functions.The primitive is only supported by the Virtex-4 logic family.

I. DSP48 [38]
This design element is a versatile, scalable, hard IP block that allows for the creation of compact, high-speed, arithmetic-intensive operations, such as those seen for many DSP algorithms.Some of the functions capable within the block include multiplication, addition, subtraction, accumulation, shifting, logical operations, and pattern detection.The primitive is supported by all the three families under consideration.

A. Methodology
The implementation in this work targets three different FPGA families viz.Spartan-6, Virtex-4 and Virtex-5.Only LX series has been considered as it is apt for general logic applications.The implementation is carried out for an input operand length varying from 4 to 32 bits.The parameters considered are resource utilization, timing and dynamic power dissipation.Resource utilization is considered in terms of on chip FPGA components used.Timing refers to the clock speed of a design and is limited by the setup time of the input/output registers, propagation and routing delays associated with the critical path, clock to output time associated with the flip flops and the skew between the launch (input) register and the capture (output) register.Timing analysis is done to provide information about the speed/throughput of the system.Dynamic power dissipation is related to charging and discharging of node capacitances along the different switching elements.To ensure a fair comparison, similar test benches have been used for all the implemented designs i.e. the input statistics remain the same in each case.The initial design entry is done using VHDL.The coding strategy is based on instantiation of different primitives listed in section III.However, for comparison we have also followed the conventional inferential approach.The constraints relating to the period and offsets are duly provided and a complete timing closure is ensured.The design synthesis, mapping and translation are carried out in Xilinx ISE 12.1 and the simulator database is then analyzed for on-chip resources, throughput and timing metrics.Power metrics are obtained using Xpower analyzer.

B. Experimental results
As mentioned earlier, for each implementation a traditional inferential coding strategy is followed.Synthesis based on this coding strategy utilizes the FPGA resources as general logic elements.This will serve as a standard against which other implementations will be compared.Metrics associated with the instantiation of various primitives are named as per the primitives used.Tables 1, 2 and 3 give a comparison of the on chip resources utilized by different primitives for an input word-length of 16 bits.The architectures considered are the bit-parallel RCA, CSA and BW multipliers.The target device is XC6SLX16 from Spartan-6.
It is observed that by instantiating primitives and macro blocks there is a subsequent reduction in the on-chip resources being utilized by a particular structure.This is achieved without having to modify any architectural details.The most area efficient structure is obtained with LUT6_2 primitive because of its ability to implement both sum and carry in a single LUT.LUT4-L uses two different 4 input LUTs to implement the sum and the carry parts in each processing cell of the array.The CARRY4 and DSP48 primitives provide fast carry logic for each row.Their inclusion prominently will affect the timing properties of the structure.However, there is still some reduction in the slices being utilized when compared to the basic structure generated through inferential coding style.Further analysis is carried out for different multiplier structures for varying word-lengths and different target families.The metrics obtained from the synthesizer database are then plotted as a function of operand word-lengths and are presented in figures 2, 3 and 4. For simplicity we have considered only the occupied slices in each case.Virtex-4 family does not support the LUT6_2 primitive and hence does not appear in the plot.Further the fast carry logic in this family is implemented using a combination of MULT_AND and MUXCY_L primitives.It is observed that in each case there is a substantial reduction in the area when the structures are generated through instantiation of different primitive components.Also different primitives give different area performances depending upon the logic they implement.If area is the parameter of interest LUT6_2 gives the best performance.
The use of primitives LUT4_L and LUT6_2 although reduces the overall logic being used but the logic associated with the critical path of the structure is increased.This is indicated by the increase in the number of logic levels in the critical path.As a result the logic delay and the associated route delay increases.However, the fast carry logic associated with the CARRY4 primitive makes the addition process really fast resulting in reduced route delays.For Virtex-4 devices the fast carry logic is implemented using a combination of MULT_AND, MUXCY_L and XORCY primitives.The use of CARRY4 logic enhances the speed only in case of RCA multipliers as the critical path is limited by the rippling of the generated carry in each cell.However, with CSA and BW multipliers there is no rippling of the carry in the main structure.The only part of the multiplier that is enhanced using the fast carry logic is the vector merging adder (VMA).Tables 4, 5 and 6 provide a comparison of the maximum achievable clock rates post implementation for a word length of 16 bits.The target family is Spartan-6.The structures generated through instantiation of different primitives tend to have better timing closures in terms of the relationship between an external clock pad and its associated data-in or data-out pad.This is indicated by the offset-in and offset-out metrics from the timing database of the synthesizer.The values are included in the tables and are indicative of the fact that with primitive instantiations better timing behavior is achieved.The results also indicate that CSA and BW multipliers have higher operating frequencies when compared to the RCA multiplier structures.Further analysis is carried out by plotting the maximum achievable speed against the operand word lengths for different structures and for different target families.The results are shown in figures 5, 6 and 7. Again for simplicity only the maximum achievable speeds have been considered.
It is observed from the plots that the use of fast carry logic results in faster execution and thus higher clock frequencies are achievable.The effect is more prominent in RCA multiplier as the carry rippling is completely eliminated.
Finally dynamic power dissipation for different structures is considered.Because an FPGA is programmable, it is only natural to look into minimizing the power dissipated.The dynamic power dissipation in a CMOS circuit is a function of the input voltage (V 2 ), the clock frequency (f clk ), the switching activity (α), the total capacitance seen by a particular node (C L ) and the number of elements used (σ).The analysis was done for a constant supply voltage and at maximum operating frequency for each structure.To ensure a reasonable comparison the test vectors provided during post route simulation were selected to represent the worst case scenario for data coming into the multiplier block.Same test bench was used for all the synthesized structures.The design node activity from the simulator database along with the power constraint file (PCF) was used for power analysis in the Xpower analyzer tool.Table 7 shows the power dissipated in various resources for RCA multiplier for operand length of 16 bits.The targeted device is Spartan-6.Tables 8 and 9 show the same metrics for CSA and BW structures.The power dissipated in the clocking resources varies with the clock activity (clock frequency) as provided in the PCF.Since each structure is operated at its maximum operating frequency, the power dissipated by the clock varies accordingly and has a maximum value for the multiplier based on CARRY4 and DSP48 primitives.However, the capacitance C L, which needs to be driven at each toggling node, varies with the type, fan-out, and capacitance of the logic and routing resources used in the design.The use of primitives through instantiations has a soothing effect on the fan-out of the non-clocking nets.This is indicated in table 10 where the average fan-out of nonclocking nets for different multipliers using different primitives has been enlisted for a 16-bit operand wordlength.In addition, there is a reduction in the number of elements (σ) being utilized by different multiplier structures when designed using direct instantiation of primitives.Thus, the power dissipated in the logic is reduced and has a minimum value for CARRY4 and DSP48 primitives.The reduction in the power dissipation in the signals and I/Os is indicative of the fact that primitive instantiation also tends to relax the signal transition rates for the duration of operation.
Further analysis is carried out by plotting the total dynamic power dissipation as a function of input word-length for different multiplier structures and for different FPGA families.The results are shown in figures 8, 9 and 10.For DSP systems it is more appropriate to quantify the power efficiency through energy analysis [39].This gives idea about the power requirements of a design at a lower level.Three energy related parameters are defined for different multiplier designs.These include Energy per operation (EOP), which is the average amount of energy required to compute one operation; Energy throughput (ET) which is the energy dissipated for every output bit processed and Energy density (ED) which is the energy dissipated per FPGA slice.Tables 11, 12 and 13 provide these metrics for different designs.The input operand length in 16 bits and the target device is from Spartan-6.In each case the critical path delay is taken as the approximate time to complete one operation.Further analysis is carried out by plotting the energy metrics as a function of operand word length for different multipliers.The plots appear in figures 11, 12 and 13.The target device in each case is XC6SLX16 from Spartan-6.The plots clearly reveal that the structures based on primitive instantiations have high power efficiency.The energy requirement is minimum for the structures based on CARRY4 and DSP48 primitives.Also, note that the effect is more prominent for RCA multipliers as the entire structure is synthesized using the CARRY4 primitive, where as in the CSA and BW multipliers only the VMA part is based on the fast carry logic.

V. CONCLUSIONS AND FUTURE SCOPE
This paper implemented the bit-parallel fixed-point multipliers in three different structures.The hardware implementations presented in this paper were based on the use of various in built primitives and macro blocks inherent to modern FPGAs.The analysis and the experimental results carried out in this paper clearly indicate that a considerable improvement in performance is indeed achievable by using these primitives.Further the design entry used in this paper was based on instantiations rather than inferences.By using a coding strategy based on instantiations the on-chip FPGA components can be used in a manner that fully utilizes their potential.Also a judicious choice of primitives will ensure that a particular performance parameter is enhanced as may be required by any particular application.This paper deliberately ruled out any architectural modification that may be carried out at the top level of the design.The idea was to present a clear cut analysis that will provide an insight about the performance speed-up that may be achieved by utilizing the huge primitive support provided by FPGA families.Currently the authors are working on achieving a performance speed-up by using a combination of architectural modifications and embedded primitives in FPGAs.
REFERENCES [1] Abstract-The MPT network layer multipath communication library is a novel solution for several problems including IPv6 transition, reliable data transmission using TCP, real-time transmission using UDP and also wireless network layer routing problems.MPT can provide an IPv4 or an IPv6 tunnel over one or more IPv4 or IPv6 communication channels.MPT can also aggregate the capacity of multiple physical channels.In this paper, the channel aggregation capability of the MPT library is measured up to twelve 100Mbps speed channels.Different scenarios are used: both IPv4 and IPv6 are used as the underlying and also as the encapsulated protocols and also both UDP and TCP are used as transport protocols.In addition, measurements are taken with both 32-bit and 64-bit version of the MPT library.In all cases, the number of the physical channels is increased from 1 to 12 and the aggregated throughput is measured.
Keywords-channel capacity aggregation, network layer multipath communication, performance analysis, TCP/IP protocol stack, tunneling

I. INTRODUCTION
Multipath communication is a hot research topic today.There were different solutions invented: the multipath technology can be used in different layers (link layer, network layer, transport layer) see our little survey in the next section.Now, we focus on the MPT network layer multipath communication library [1], which one was developed at the Faculty of Informatics, University of Debrecen, Debrecen, Hungary.It can be freely downloaded for 32-bit and 64-bit Linux operating systems as well for Raspberry Pi from [2].It makes possible to aggregate the transmission capacity of multiple interfaces of a device.Its performance, especially its channel aggregation capability for two channels was analyzed in [3] and for four channels in [4] using serial links with the speed of a few megabits per second.
We measured the channel aggregation capability of the MPT network layer multipath communication library using significantly increased number of physical channels and transmission speed compared to the earlier test of other researchers [3] and [4].Our preliminary results concerning the 32-bit version of the MPT library measured by the industrial standard iperf tool using the UDP transport layer protocol were published in our conference paper [5], which one is now extended with the Manuscript received February 26, 2015, revised May 9, 2015.G. Lencse is with the Department Telecommuications, Széchenyi István University, Győr, Hungary (phone: +36-96-613-665, fax: +36-96-613-646, email: lencse@sze.hu)Á. Kovács is with the Department Telecommuications, Széchenyi István University, Győr, Hungary (e-mail: kovacs.akos@sze.hu)HTTP measurements (using TCP) and with the testing of the 64-bit version of the MPT library.
The remainder of this paper is organized as follows.First, the different multipath solutions are surveyed in a nutshell.Second, a brief introduction is given to the MPT network layer multipath communication library.Third, our test environment is described.Fourth, our experiments are described, the results of our high number of measurements are presented and discussed.Fifth, the directions of our future research are outlined.Finally, our conclusion is given.

II. A SHORT SURVEY OF MULTIPATH SOLUTIONS A. Multipath TCP -a Transmission Layer Solution
Multipath TCP [6] is probably the most well-known multipath solution.MPTCP uses multiple TCP sub-flows on the top of potentially disjoint paths, see Fig. 1.Therefore it can be used for the aggregation of the transmission capacity of the underlying paths.Its channel aggregation can be very efficient: a single data-stream was transmitted at the rate of 50Gbps over six 10Gbps Ethernet Links using MPTCP [7].MTPCP is actively researched and analyzed from different viewpoints see e.g.[8] and its references or count the Google Scholar hits for "Multipath TCP".
However, multipath TCP has its limitations and drawbacks, too.TCP provides a reliable byte stream transmission, which one is appropriate for several applications such as web browsing, sending or downloading e-mails, etc.However, its retransmission mechanism is undesirable for other applications such as IP telephony, video conference or other real-time communications where some packet loss (with low ratio) can be better tolerated than high delays caused by TCP retransmissions.Consequently, multipath TCP is not suitable for these types of applications.

B. MPT Library -the Only Network Layer Solution
The MPT network layer multipath communication library [1] uses UDP/IP protocols on the top of each link layer connection and creates an IP tunnel over them.Thus both TCP and UDP can be used over the IP tunnel, see Fig. 2. Therefore retransmissions can be omitted if they are not required.This design makes MPT more general than MPTCP thus permitting MPT more areas of applications.
The MPT library may be used for many different purposes including file and stream transmission [4], cognitive infocommunication [9], wireless network layer roaming problems [10] and changing the communication interfaces (using different transmission technologies) without packet loss [11] (it is also called vertical handover between 3G and WiFi).For further publications about MPT, see [12] and [13].
As far as we know, MPT is the only network layer multipath communication solution.

C. OLiMPS -a Link Layer Solution
The Openflow Link-Layer Multipath Swithcing [14] is a novel solution, which uses the logic of the link-layer, that is, it calculates routes as if the nodes were connected with LANs, however, it can also operate over WANs [15].

D. Other Similar Solutions
There are some other solutions, which deal with multiple interfaces, however they are not always real multipath solutions.
The Multiple Interfaces Working Group of IETF has already produced many useful documents [16].They focus on the problem that a host has multiple interfaces which are connected to different provisioning domains [17] and the interfaces can be simultaneously used for communication.It is not necessarily a multipath solution: for example, one application may use the first interface, and another one may use the second one.
Proxy Mobile IPv6 [18] allows a mobile node to connect to the same PMIPv6 domain through different interfaces.The NETEXT Working Group of IETF proposed a draft RFC [19] which specifies protocol extensions to PMIPv6 to distribute specific traffic flows on different physical interfaces.

III. MPT IN A NUTSHELL
A. The Architecture of MPT Fig. 2 shows the layered architecture of the MPT network layer multipath communication library.The most important difference from MPTCP is that MPT creates a new logical interface on the endpoint host, through which the applications can communicate, therefore the applications can use any transport layer protocol: either TCP or UDP, whichever is appropriate for them.The MPT software processes the packets from the tunnel interface.MPT makes a packet-by-packet decision about which path to choose and then encapsulates the packet into a new UDP/IP packet and finally sends it out through the appropriate link-layer interface [1].

B. The Configuration and Usage of the MPT Library
The MPT library distribution contains an easy to follow user guide [20].To be able to use MPT between two computers, the software must be installed on both of them.One of them should be configured as server and the other one as client, but the applications see it completely symmetrical.The MPT library has simple and straight forward configuration files where the different parameters (e.g. the number of physical connections, the Linux network interface names and IP addresses for each channel, the name of the tunnel interface, etc.) can be set.When both sides are configured and the MPT Fig. 2. The layered architecture of the MPT software [3] software is started on both computers, the applications can use the tunnel interfaces for communication in the usual way.The MPT library distributes the user's traffic for all the configured physical channels thus the user can take the advantage of the multiple network interfaces.

IV. TEST ENVIRONMENT A. Hardware and Basic Configuration
Two DELL Precision Workstation 490 computers were used for our tests.Their basic configuration was: • DELL 0GU083 motherboard with Intel 5000X chipset • Two Intel Xeon 5140 2.33GHz dual core processors • 8x2GB 533MHz DDR2 SDRAM (accessed quad channel) • Broadcom NetXtreme BCM5752 Gigabit Ethernet controller (PCI Express, integrated) Three Intel PT Quad 1000 type four port Gigabit Ethernet controllers were added to each computers.The 3x4=12 Gigabit Ethernet ports were used for the measurements and the integrated one was used for control purposes.The computers were interconnected by a Cisco Catalyst 2960 switch limiting the transmission speed to 100Mbps and separating the 12 physical connections by VLANs.
In our experiments, both IPv4 and IPv6 was used as the underlying and as the tunnel IP version (it means 2x2 series of experiments).Fig. 3 shows the topology and the IP address configuration of the test network used in the IPv4 tunnel over The version of the MPT library can be identified by the name of the file which contains the date in the YYYY-MM-DD format: mpt-lib-2014-03-25.tar.gz was used first.This version of the MPT library contained precompiled 32-bit executables with statically linked libraries thus we did not need to compile it.The contents of the following two configuration files were set as follows.(Their path is relative to the installation directory of MPT.)The beginning of the conf/interface.conffile was: # The number of the interfaces 65020 # The local cmd port number  And it was similar for all the other interfaces, which we do not list to save space.The different types of tunnels were specified in separate connection files.The IPv4 tunnel over IPv4 paths was defined in the conf/connections/IPv4overIPv4.conffile:  It was also set in the same manner for all the other paths of this connection and for the other connections as well.Note that the configuration files followed strict format, even the comment only lines had to be present.We recommended this to be changed for the commonly used free style configuration files with keyword parsing in [5].The authors of MPT responded quickly and keyword parsing is provided in the most current version of MPT [2].

V. EXPERIMENTS AND RESULTS
The channel aggregation capability of the MPT library was measured with two different methods: using the industrial de facto standard iperf, and file transfer by the wget Linux program over the HTTP1 protocol.These two methods were selected because iperf uses UDP and wget uses TCP as transport layer protocols.As it was mentioned before, both IPv4 and IPv6 were used as the IP protocol for the tunnel and also as the IP protocol for the underlying channels.In addition to that, both 32-bit and 64-bit versions of the MPT library were tested.It means altogether 2x2x2x2=16 series of measurements, were the number of physical channels were increased from 1 to 12. Thus we performed 16x12=192 different tests.The tests were automated by scripts.Due to space limitations, we cannot include the complete measurement scripts, but the key commands only.The ones below belong to the IPv4 tunnel over IPv4 measurements.The iperf command was: This command downloaded the file but did not write it on the hard disk rather disposed it in /dev/null so that the disk writing speed would not influence our measurement results.And also the file named 1GB was put on RAM drive at the server computer to eliminate the reading from the hard disk.
The results of our measurements using the 32-bit MPT library are discussed first in details and the 64-bit results are presented later.And within the 32-bit results, we begin with the results of the iperf measurements; now they are presented and then discussed.

A. Results of the Iperf Measurements
The results of the iperf test are shown in Fig. 4. Whereas two of them (IPv4 over IPv4 and IPv6 over IPv4) are nearly linear in the whole range, the two other ones (IPv4 over IPv6 and IPv6 over IPv6) are nearly linear until 7 NICs and then they show saturation or even a small degradation until the end of the range.Our results suggest that only the version of the underlying IP protocol makes a significant difference in the channel capacity aggregation performance of the MPT library and the version of the encapsulated IP has only a minor influence on it.
When the underlying protocol was IPv4, the throughput was linear up to 12 NICs, which means that the throughput aggregation capability of the MPT library proved to be very good, and we could not reach the limits of MPT library.(These When the underlying protocol was IPv6, the performance limit of the system was reached at 7 NICs.The maximum values were 74MB/s and 72MB/s in the case of the IPv4 over IPv6 and IPv6 over IPv6 tests, respectively.(The further increase of the number of NICs resulted in some degradation of the throughput, their respective values were 70MB/s and 67MB/s at 12 NICs.)Note that this is the performance of our system composed of the above described hardware and software.We asked ourselves whether it was a built-in limit of the MPT library or it was the performance limit of the hardware that we used for testing?The version of the upper IP protocol made no significant difference, therefore we include only two significantly different ones of them.The CPU utilization of the MPT client during the IPv4 over IPv4 measurements is shown in Fig. 5.Even though the time scale is not presented (because no timestamps were logged with the CPU utilization values), the 12 measurements can be easily identified: they are separated by gaps with 0% CPU usage between them.The CPU utilization shows some fluctuations, but its near linear growth can be well observed.It reached the 160-180% interval at 12 NICs.It was checked that the CPU utilization of the iperf program was always under 50% thus there was free CPU capacity available from the 400% of the four CPU cores.The CPU utilization of the MPT client during the IPv6 over IPv6 measurements is shown in Fig. 6.It reached 160% at 7 NICs and it fluctuated around 160% for higher number of NICs.There is a visible correspondence between the CPU utilization and the throughput, see Fig. 4.
2) Measurements with faster CPUs: The Intel Xeon 5140 2.33GHz dual core processors of the test computers were replaced by Intel Xeon 5160 3GHz dual core processors.The IPv6 tunnel over IPv6 paths experiments were repeated with the faster CPUSs.Fig. 7 shows the throughput results.It can be observed that the faster CPUs made it possible to fully utilize the capacity of 8 NICs and the degradation started from 9 NICs.This result convinced us that the aggregation capability of MPT does not have a built-in limit, rather it depends on the performance of the CPUs.However, a question now arises: why could not MPT increase its CPU utilization above 180% while there was still free CPU capacity?The answer is that MPT was written as a serial program and thus it is not able to fully utilize the available processing power of the multiple CPU cores.(The higher than 100% utilization is probably achieved by the overlapping of sending and receiving packets.)We believe that it would be worth improving MPT in this field, because the current trend of the evolution of the CPUs is that the number of cores is increased instead of the clock speed.
After the completion of these measurements, the original Intel Xeon 5140 2.33GHz dual core processors were put back into the test computers and they were used in all the following experiments.

C. Investigation of the IPv4 Performance Limit
As it can be seen in Fig. 4, the throughput scaled up nearly linearly up to 12 NICs when the underlying protocol was IPv4.We were interested in the performance limit of the system, but we could not insert more NICs into our Dell computers as they had only 3 PCI Express slots.Therefore, we increased The results are shown in Fig. 8.In both tests, the throughput reached its maximum value (of 158MB/s and 151MB/s when the tunnel protocol was IPv4 and IPv6, respectively) at 2 NICs and it degraded for higher number of NICs (down to 118MB/s and 120MB/s at 8 NICs), but it remained still higher than the throughput of a single NIC.This is in correspondence with the values of the CPU utilization in Fig. 9. (The graph actually shows the CPU utilization of the IPv4 over IPv4 case, but the CPU utilization of the IPv6 over IPv4 case looked the same, thus we did not included it.)

D. Results of the Wget Measurements
The results are shown in Fig. 10.Unlike with the iperf, performance limits can be observed in each graph, and there are also differences between the first two graphs.The HTTP performance of the IPv4 tunnel over IPv4 shows somewhat saturation at 11 and 12 NICs, but the performance is still growing.The HTTP performance of the IPv6 tunnel over IPv4 shows not only saturation but even it definitely degrades at the end of the graph (from 100MB/s at 10 NICs to 90 MB/s at 12 NICs).The HTTP throughput of the IPv4 tunnel over IPv6 reaches its maximum value of 70MB/s at 7 NICs, and it degrades for higher number of NICs (its value is 60MB/s Measurement Time/Number Our HTTP throughput results confirm that the version of the underlying IP protocol makes the major difference in the channel capacity aggregation performance of the MPT library, but they indicate that the version of the encapsulated IP may also have a minor influence on it.However, the results of the wget measurements differ from the results of the iperf measurements because now we could reach the performance limits of our test system even when the underlying protocol was IPv4.Very likely it is caused by the higher CPU usage of the TCP protocol stack than that of the much simpler UDP.When the underlying protocol was IPv6, we reached the HTTP performance limit of the system at 7 NICs.The further increase of the number of NICs resulted in some degradation of the throughput.

E. Results with the 64-bit MPT Library
The authors of MPT library published the precompiled 64bit version after the completion of our measurements for [5].There we mentioned our intention of testing the 64-bit version to see if there is a difference in the performance of the 32bit and the 64-bit version of the MPT library.We expected that the 64-bit version may more effectively handle the 128 bits long IPv6 addresses.The 64-bit results are presented in the same order as the 32-bit ones: first the iperf results and then the wget results.1) Results of the iperf measurements: The results of the 64-bit iperf test are shown in Fig. 11.When IPv4 was used as the underlying protocol, the throughput scaled up nearly linearly up to 12 NICs, as we expected.When IPv6 was used as the underlying protocol, the throughput reached its maximum value of at 8 NICs.In the IPv4 over IPv6 case, the maximum value of the throughput was 81MB/s at 8 NICs, which is only by 7MB/s higher than that for the 32-bit case, where maximum value of 74MB/s (see Fig. 4) in throughput has been reached already at 7NICs.
The 64-bit library did not result in the convincing performance improvement that we expected before.
2) Results of the wget measurements: The results of the 64-bit wget test are shown in Fig. 12.The graphs are rather similar to graphs of the 32-bit case (see Fig. 10), though the throughput results are somewhat better here.The HTTP perfomance of the IPv4 over IPv4 is linear up to 11 NICs (instead of 10).The HTTP performance of the IPv6 tunnel over IPv4 shows no performance degradation for 11 and 12 NICs, what is an advantage of the 64-bit version over the 32bit version of the MPT library.The HTTP performance of the IPv4 tunnel over IPv6 reaches its maximum value at 7 NICs.The maximum place of the throughput result curve of the 32bit test is the same (Fig. 10), but here the maximum value is a little bit higher: 74.4MB/s instead of 70MB/s.And the linear degradation here is bit better than the degradation was in the 32-bit case.The HTTP performance of the IPv6 tunnel over IPv6 is also somewhat better, but rather similar to that of the 32-bit case.
Though the 64-bit version of the MPT library did not fulfill our performance expectations, but the 64-bit results are definitely never worse than those of the 32-bit version, and in many cases the 64-bit version brings some slight performance increase.

VI. DIRECTIONS OF OUR FUTURE RESEARCH
So far, we have tested the performance and throughput aggregation capability of the MPT library in itself.We also plan to compare them with that of the standard MPTCP.
As the most important advantage of MPT over MPTCP is that MPT uses UDP/IP and therefore it is much suitable for use with real-time applications because of the elimination of TCP retransmissions, we also plan to test it with real-time applications.
We also intend to test MPT as a tunneling tool.MPT seems to be a universal tunnel software in the context of IPv6 transition since it can be used as either of an IPv4 or an IPv6 tunnel over either of IPv4 or IPv6 connections.

VII. CONCLUSION
The throughput aggregation performance of the MPT network layer multipath communication library was examined up to twelve 100Mbps link layer connections.Measurements were taken with both iperf (over UDP) and wget (over TCP) using both 32-bit and 64-bit MPT libraries.
As for the 32-bit MPT library and iperf measurements, when the underlying protocol was IPv4, the throughput scaled up linearly up to 12 NICs (exceeding 120MB/s) regardless of the version of the encapsulated IP (IPv4 or IPv6).When the underlying protocol was IPv6, the throughput scaled up linearly up to 7 NICs (exceeding 70MB/s) regardless of the version of the encapsulated IP, but it could not increase more for higher number of NICs rather it showed a small degradation.
It was proved that the above performance limit depends on the computing power of the CPUs and it is not a fixed built in feature of the MPT library.
MPT was also tested with 12 Gigabit Ethernet connections to find the performance limit of our system when the underlying protocol was IPv4.It was reached at two NICs having the values of 158MB/s and 151MB/s when the tunnel protocol was IPv4 and IPv6, respectively.
As for the 32-bit MPT library and wget measurements, the results were similar to those of the iperf measurements with the exception, that we could reach the performance limit of the system even when the underlying protocol was IPv4 due to the higher CPU usage of the TCP protocol stack than that of the much simpler UDP.
As for the measurements with the 64-bit MPT library (using both iperf and wget), the results were close to the results of the measurements with the 32-bit MPT library, producing only usually a little performance benefit depending on the given test but the 64-bit results were never worse than the 32-bit ones.
We conclude the MPT network layer multipath communication library proved to be a good tool for the aggregation of the capacity of several high speed channels.

I. Introduction
Machine-Type Communication (MTC) represents the way how to enable the connectivity between several (from tens to hundreds) nodes (sensors or actuators) without or with minimal human interaction e.g.Internet of Things (IoT) or smart power grids [1].Following the information given in [2], [3], the amount of mobile data traffic is predicted to increase by around six times in the period 2014 -2019.The data traffic is distinguished into two main categories: Human-to-Human (H2H) and Machineto-Machine (M2M) communication.In comparison with the traditional conception of data traffic represented by H2H (services as voice, web streaming etc.), M2M comes with different requirements on a communication system [4] P. Masek, K. Zeman, D. Uhlir, and J. Hosek are with Department of Telecommunications, Brno University of Technology, Brno, Czech Republic (e-mails: xmasek12@phd.feec.vutbr.cz,xzeman43@stud.feec.vutbr.cz,xuhlir15@stud.feec.vutbr.cz,hosek@feec.vutbr.cz).
Manuscript where the M2M applications should have minimal impact on existing H2H services [5].The key differences between both communication types are shown in Table I.
The key idea of the M2M communication network is to connect a server with millions of devices deployed worldwide (interacting with other sensors, different environments and people).With the rapid development of cellular networks, M2M communication via the Long Term Evolution (LTE) network is expected to play a significant role in M2M scenarios.Today, the cellular networks represent the common data access to public network (Internet); as a consequence they are under pressure trying to handle unprecedented data flows from the side of mobile devices.The dramatic increase of transmitted data via cellular networks is a burning question for telecommunication operators with the limited resources of radio spectrum [6].The complex scenario of M2M architecture is shown in Fig. 1.The depicted architecture considers two different ways for managing connection of M2M devices to the core part of LTE network [8]:

Core Network
• Cellular connectivity: connection through access network to core networks where each single device has its own Subscriber Identity Module (SIM) card for cellular connectivity.
• M2M networks: M2M devices may create M2M area networks using short range technologies represented by the standards IEEE 802.15.6, IEEE 802.15.4(e), or IEEE 802.11.These M2M area networks can be then connected to the core networks via M2M gateways [9], [10], [11].As a possible way how to deal with the overloading of Random Access Network (RAN) of LTE network, the offloading techniques can be used; offloading mechanisms refer to using alternative network infrastructure for transmitting data originally targeted for cellular network when this network becomes overloaded [7] 1 .Depending on delay (content delivery time) it is possible to divide offloading techniques into two categories: nondelayed offloading and delayed offloading [7].
In this paper we address the specific type of delayed offloading where the Machine Type Communication Gateway (MTCG) act as a hybrid node which interconnects two different networks (in case of this paper, WiFi and LTE network are considered as heterogeneous networks).The attention is also paid to the implementation of Quality of Service (QoS) for H2H and M2M communication where the high prioritized traffic is represented by the H2H communication (e.g.Voice over IP (VoIP)); QoS is implemented on MTCG node.Furthermore, QoS requirements of M2M services depend on the MTC service features: group-based communication, mobility, timecontrolled / time-tolerant, amount of transmitted data, power consumption [12]. 1 Current visions from analytical claim that by 2019, 54 percent of total mobile data traffic will be offloaded over WiFi networks.

Machine-to-Machine (M2M)
Human-to-Human (H2H) Traffic Direction Uplink data; data received from sensors.For a specific type of the applications, the symmetric uplink and downlink is needed to fulfill the requirements for the dynamic interaction.
Downlink data; although during last few years the amount of uploaded data is growing fast, in case of H2H, download still represents the main part of data traffic.
Message Size Size of data from sensors is in general very small (e.g.data size of Wireless M-BUS data unit is usually max.50 B).
Using multimedia and realtime applications, the size of data units is several times higher in comparison with the M2M.

Access Delay
For the dynamic interaction between sensors and actuators, delays should be very short.
In case of H2H communication, longer access delays are usually tolerated.

Transmission Periodicity
The range of transmitting period can be from units of seconds (e.g.alarm systems) up to tens of minutes (e.g.energy meters).
Nature of human based traffic is mostly random and bursty.Therefore, the often sending of control information is required (to ensure QoS).

Mobility
For the main group of sensors, mobility does not represent a big issue (sensors are mostly located at the stable position).
For humans, mobility management represents a key requirement for ensuring seamless connectivity and roaming.

Data Importance
Some of the M2M sensors can transmit critical data (e.g.status of alarm system).Following this fact, M2M data requires high priority.
There are no big differences between users.The differences could be found between the applications for individual users (with respect to QoS and QoE).

Amount of devices
Hundreds or thousands of devices connected via one access point to the network.
Typically tens of devices which are connected via access point to the network.

Lifetime; Energy Efficiency
Using specific energy profiles, devices are able to operate for years of decades without human maintenance.
In case of devices used by humans, it is common to recharge batteries in a daily manner (smartphones, laptops).[8] We performed extensive simulations to evaluate the role of MTCG node in LTE architecture with M2M communication.For modeling WiFi and LTE networks, data traffic and logic of MTCG node, the simulation tool Network Simulator 3 (NS-3) [13] with the framework LTE / EPC Network Simulator (LENA) [14] was used.

TABLE I Differences between H2H and M2M communication
The rest of the paper is organized as follows.Section II presents the description of MTCD-Related communications in LTE network.Section III deals with the selected simulation environment NS-3 together with the LENA framework.In section IV the description of created simulation scenario is given.Section V presents the obtained results and finally, in section VI we draw a conclusion with our future plans in this research area.

II. LTE Network and M2M Communication
The Current RAN for LTE network consists of eNodeB (eNB) that provides the user plane and control plane protocol stack for the User Equipment (UE).LTE represents the fully distributed radio access network architecture, where the eNB can be interconnected with other eNBs by the X2 interface.The eNBs are then connected to the core part of LTE network through the S1 interface, see Fig. 1.Each eNB includes layers below that implement the functionality of user plane, header compression and encryption [12]: Following the fact that the current 3G cellular networks are designed only for H2H communications, the introduction of M2M communications introduce the new requirements on LTE networks; the network architecture needs to be improved to fulfill M2M services without sacrificing the current H2H applications.
In this section, the attention will be given to a description of types of M2M communication (especially a description of connection of the MTCD and MTCG nodes to LTE network will be described).

A. Machine Type Communication
To enable M2M communication in cellular networks (3G / 4G), the two new types of nodes Machine Type Communication Devices (MTCD) and Machine Type Communication Gateway (MTCG) were introduced.The MTCD represents the UE which is supposed to work as a sensor which communicates through the cellular network with the remote MTC node (e.g.database server) or (and) with other MTCDs in range.As was proven in [15], the high number of MTCDs connected at the same time to one eNB may cause overloading of this network entity.Therefore, the cellular network requires an MTCG node to facilitate communications among a great number of MTCDs.The MTCG will enable the intelligent way how to manage power consumption of MTCDs and provide an efficient path for communication between MTCDs without the need of connection to the LTE network.Three different M2M communication methods were introduced during last few years, see Fig. 2 [12].These methods are described (in the text) below.

1) Direct Transmission Between MTCD and eNB:
The first method is similar to the classic UE where the MTCD is able to establish the direct connection to the eNB; therefore similarities between eNB-to-UE and eNBto-MTCD exist.On the other hand, the MTCDs are represented in a large amount of sensors / UEs; in certain time period, intensive competition for radio resources may occur.Therefore, the additional efforts have to be covered by the telecommunication operators to solve the problems, when the large number of MTCDs communicate with the eNB directly [12].2) Multihop Transmission using MTCG: With respect to mitigate or eliminate negative effect of M2M communication on H2H communication in cellular network, the MTCG node can be deployed as a hybrid node, where all MTCDs are connected to the eNB indirectly using the MTCG node as a gateway.The eNB-to-MTCG connection is based on the 3GPP (Third Generation Partnership Project) LTE specifications.The MTCG-to-MTCDs and MTCD-to-MTCD communications can be established via 3GPP LTE specifications or via the non-3GPP communication technologies such as IEEE 802.11,IEEE 802.15.x [12], [16].
3) P2P Transmission Between MTCDs: An MTCD may communicate in local area with other MTCDs and with the eNB.Compared to other non-3GPP local connectivity solutions (IEEE 802.11,IEEE 802.15.x), direct communication between MTCDs is done by cellular network which can broadcast data within a much wider coverage area.For service discovery, the MTCDs do not have to scan all the time for the available access point (APs) as in the case of standard IEEE 802.11 [12], [16], [17].

III. LENA Framework in NS-3
During the last years, several network simulation platforms have been developed as a tool available for networking research: OPNET Modeler [19], OMNET++ [18], NS-2 [13], NS-3 [13].Based on the fact that this paper deals with the M2M communication in LTE network, the simulation environment NS-3 together with the LENA framework [21] were used.In our work, we used NS-3 in version 3.21 together with the LENA framework in version 8. Using the LENA inside NS-3 provides for us the way for design and performance evaluation of Heterogeneous Networks (HetNets).Fig. 3 shows the implementation of the end-to-end LTE-EPC data plane protocol stack of LENA framework.The biggest change in comparison with the standard implementation of data plane protocol stack of LTE is the merge of the Serving Gateway (SGW) and PDN Gateway (PGW) functionality within one single (SGW)/(PGW) node in NS-3.This change causes that there is no need to have S5 and S8 interfaces which are specified by 3GPP.The S1-U protocol stack and the LTE radio protocol stack specified by 3GPP, are also described in Fig. 3.

IV. Model of M2M Communication in LTE Network
As described in Section II, the created scenario includes the MTCD nodes together with the MTCG node which enables the interconnection between the local network and the public network (represented by the remote host which is accessible through the LTE network).The local side of the implemented scenario is represented by sensors / UEs using IEEE 802.11g,IEEE 802.11ah [23], [24] and Wireless M-BUS [22] which represent the most preferred technologies for M2M.The data is sent through the hybrid node (MTCG) to a remote host which is accessible via the LTE network.The overall structure of the created scenario is depicted in Fig. 4.

A. Parameters of Simulation Scenario
The key parameters of created simulation model are shown in Table II (a list of the parameters of created LTE network using the LENA framework).
UEs were created as wireless nodes using the IEEE 802.11 g and IEEE 802.11 ah for connection to MTCG node.The sensors implemented the Wireless M-BUS communication protocol (868 MHz) where the sensors were set to Mode T1 (one-directional communication) and MTCG was set to the Mode T2 (bi-directional communication).

B. IP Address Scheme
The address scheme for two groups of nodes is depicted in Fig. 5.For the WiFi nodes (IEEE 802.11 g/ah) the address space 10.3.0.0 with prefix 24 was used.In case of Wireless M-BUS nodes, unique addressing scheme is implemented following the [25].The address of each WM-BUS node is represented by the serial number of sensor.The transmission of data from sensors is performed as a broadcast communication when only the MTCG node (in T2 mode) can receive the information from sensors.Data from MTCG node goes through the core part of LTE network (7.0.0.0 / 24) to the destination node (remote host; 1.0.0.0 / 24).

C. Parameters of Data Traffic
Data traffic is generated independently by a group of UEs (H2H) and sensors (M2M).Data traffic from UEs represents the voice service defined as follows [26]: UDP transport protocol; packet size 160 B; Maximal Transfer Unit (MTU) 1500 B. Traffic from sensors was generated with these attributes: WM-BUS communication protocol (do not follow the TCP/IP reference model), packet size 50 B, transmission interval 30 seconds.Both groups of devices (UEs and sensors) generate traffic during the whole simulation; simulation time was set to 10 minutes.

V. In-depth Results Discussion
From the implementation point of view, the two active interfaces on one node in NS-3 represent a challenging task.This task is going to be more complex when one System for Mobile Communications (GSM), Universal Mobile Telecommunication System (UMTS) and LTE core networks.
The correct handling with the data traffic is depicted for the UE(0) in Fig. 6 and Fig. 7.

B. Enabling QoS for H2H Traffic in Created Model
The support of QoS for VoIP data traffic (originated from WiFi nodes (UEs)) was implemented on MTCG node.The situation with and without the implemented QoS features is depicted in Fig. 8.The values of delay were originally for H2H traffic (VoIP) 4123 ms and 40 ms for M2M traffic.It is clearly visible that without the implemented QoS, the VoIP services can not be used with respect to fulfill users expectation.Therefore, the QoS was implemented on MTCG node and the delay decreased to 780 ms (this means an improvement of 81.08 % in comparison with the original delay for VoIP).

VI. Conclusion
M2M communications represent an emerging technology which illustrates the principles of the IoT.Therefore, it has gained an increasing attention in LTE / LTE-A cellular network design.In this paper, we give the overview of the required network architectural improvements with the description of the various transmission schemes / types for MTCDs.We chose the multihop transmission from described transmission schemes, see Fig. 2.This type can be represented by the MTCG node which acts as a hybrid node between several heterogeneous networks.In this paper we implemented three types of networks: WiFi, Wireless M-BUS and LTE.Between these networks the MTCG node was deployed in a role of the bridge where the incoming data traffic is routed towards the destination node (e.g. a remote server located in Internet).
The implementation was done using the simulation environment NS-3 with the LENA framework, see section III.The simulation results, see section V, confirm the correct handling of data traffic with respect to meet the QoS and QoE requirements for the H2H traffic in case when the M2M services are deployed in parallel with the H2H.Although we have achieved an improvement of delay of 84 % (from 4123 ms to 780 ms) for VoIP services, it is evident that further investigation of aggregation scheme on MTCG node is still needed.

Multi Service Proxy: Mobile Web Traffic
Entitlement Point in 4G Core Network Dalibor Uhlir, Dominik Kovac, Jiri Hosek Abstract-Core part of state-of-the-art mobile networks is composed of several standard elements like GGSN (Gateway General Packet Radio Service Support Node), SGSN (Serving GPRS Support Node), F5 or MSP (Multi Service Proxy).Each node handles network traffic from a slightly different perspective, and with various goals.In this article we will focus only on the MSP, its key features and especially on related security issues.MSP handles all HTTP traffic in the mobile network and therefore it is a suitable point for the implementation of different optimization functions, e.g. to reduce the volume of data generated by YouTube or similar HTTP-based service.This article will introduce basic features and functions of MSP as well as ways of remote access and security mechanisms of this key element in state-of-the-art mobile networks.

I. INTRODUCTION
Mobile networks went throw a huge development during last 30 years.From NMT (Nordic Mobile Telephony, 1G) based on analog technology, over GSM (Groupe Special Mobile, 2G) with both data and voice signals sent over circuit switched network and 3G, where voice is sent via circuit switched and classic data is sent as IP (Internet Protocol) based flow, to LTE (Long Term Evolution, 4G) network where both data and voice (VoLTE -Voice over LTE) use IP channel.Current mobile network is complex communication system composed of large number of nodes.However, the emerging LTE deployment includes three key parts: 1) RAN (Radio Access Network), 2) core network and 3) IMS (IP Multimedia Subsystem) as optional but very common component (see Fig. 1).
The current situation in utilization of cellular networks and growing demands of their users introduce several challenges to which the mobile operators will have to face sooner or later.Especially the following facts need to be taken into a consideration: • Traffic from wireless and mobile devices will exceed traffic from wired devices by 2016 [1].
• HTTP traffic is taking the pole position in residential broadband Internet traffic [2], [3].
• Around 34% of HTTP traffic was found to be multimedia [4] and one of dominant multimedia server is YouTube.This paper is addressing the MSP server located in LTE core network and describes mainly security access for its managing via TSL (Transport Layer Security) and also a protection of user's secure entitlement traffic.The article discusses D. Uhlir, D. Kovac and J. Hosek are with the Department of Telecommunications, Brno University of Technology, Czech Republic e-mail: xuh-lir15@stud.feec.vutbr.cz,xkovac23@phd.feec.vutbr.cz,hosek@feec.vutbr.cz the system of certifications and their renewal used by MSP and moreover offers the way how to increase security using current solution and shows alternative ways for certification.The article also critically analyzes the key weaknesses of core network components in today's environment and explain potential security holes.

II. MULTI SERVICE PROXY MSP (Multi Service Proxy
) is the element in mobile core network which contains several types of nodes.The key part is the database system composed of several servers which handle network traffic.The other parts of MSP include the traffic servers (TS), administration nodes and the jump start server.Figure 1 shows logical position of MSP in mobile network and its interconnections to other core nodes.
MSP network elements are deployed as several chassis (each chassis has several blades) within one rack.Each chassis is de facto the UNIX machine with web server, database server and NetBackup solution installed.MSP is located between SDG and Internet so when a user using his smartphone, laptop or other mobile device sends HTTP message, the SDG forwards it automatically to MSP.Then, the MSP processes this kind of traffic and sends the message back to SDG.After that, SDG sends traffic to Internet.The HTTP traffic is recognized by a source or destination port 80.Besides HTTP, MSP processes also secure entitlement traffic which is utilizing HTTPS.However, it represents smaller percentage share of all web traffic.The detailed topology of LTE core network is depicted in Fig. 2).The entitlement client is configured with an entitlement URL and communicates with the MSP secure entitlement server.Entitlement server (daemon located on traffic servers) contains workflow scripts to handle incoming HTTP requests from mobile devices requesting entitlement status of a service.There are two kinds of entitlements -simple one utilizing HTTP and secure one which is carried over HTTPS.The location of entitlement server and client is shown in Fig. 2 as green boxes.

A. Secure Entitlement
In case of secure entitlement traffic, the load balancer (SDG) forwards TCP:443 HTTPS POST Entitlement requests (included in the body of HTTPS POST request) to MSP.In secure entitlement, two requests are supported: getEntitlement and getPhoneNumber.The getEntitlement is used by the device to query the entitlement server for the entitlement status of services that this subscriber should or should not be allowed to use.On the other side, the getPhoneNumber is used by the device to request the entitlement server to inform about the MSISDN (Mobile Subscriber/Station Integrated Services Digital Network) and corresponding signature.In order to generate such signature, a certificate and entitlement server's private key files need to be configured in the secure entitlement parameters subgroup in MSA.The algorithm used to encrypt the message is SHA1 (Secure Hash Algorithm).More detailed description is provided in the next section V.
When a GetPhoneNumber action is received within the secure entitlement request, the entitlement server will generate a signature that will be sent back to the client as part of the response.The whole secure entitlement workflow process is shown in Fig. 6 [5].
Secure entitlement traffic can be created by different applications.One example is the FaceTime which is videotelephony and voice over IP application from Apple [6].At the beginning, the FaceTime worked only via WiFi networks, however starting with iOS 6 also the support for mobile networks is implemented.To be able to operate FaceTime over mobile Then the UE sends a HTTPS POST entitlement request to one of the secure entitlement servers (selected by load balancing process on SDG).The MSP secure entitlement server parses and validates the entitlement request.For valid requests, the secure entitlement server uses the IP source address of the UE to check the MSP LDAP DDC (Lightweight Directory Access Protocol Distributed Data Cache) cache for the profile of the subscriber.MSP retrieves the subscriber profile and based on it, the secure entitlement server compiles the entitlement response.After that, MSP sends a HTTP 200 back to the UE for valid requests.The HTTP message body contains the complete entitlement response.

B. Simple Entitlement Architecture
In a simple entitlement architecture, the requests are sent in form of HTTP GET request with an empty body.The workflow scripts that support simple entitlement requests are configured as subscriber plan specific scripts.Here are two possible response codes as an answer to a simple entitlement request: • 200 OK -if the client is entitled to the requested service.
• Pre-configured 4xx response -in case the client is not entitled or an internal error occurs.The entitlement server can be then configured to add, remove or modify entitlement services on a URL basis (entitlement URL) for both secure and simple entitlements.Each entitlement URL is associated with an entitlement protocol, an entitlement service, LDAP attribute and entitlement value.Example of simple entitlement is the determination of whether a subscriber is allowed to use tethering with their rate plan.The entitlement enforcement is responsibility of each client.
V. SSL CERTIFICATES SSL (Secure Socket Layer) is cryptographic protocol that performs a security related functions and applies secure communication.SSL encrypts data of network connections in the application layer of OSI model and uses both symmetric (communication between client and server -AES, DES) and asymmetric key (to authenticate and change symmetric key) [7].There are three types of SSL certificates: extended, organization and domain validation.In mobile network we need to authorize network engineers on servers in network and carrier's clients to access services [8], therefore SSL is also used for the connection to maintenance interfaces of MSP.

A. MSA SSL Certificate Implementation
MSA SSL certificate is installed on database servers to allow a secure connection to the MSA GUI.Client (network engineer) connects to MSA via HTTPS (to requests secure page).The database server sends back to client its public key and certificate (generated with keytool command and received with the signature from Certification Authority).A client checks whether the certificate was issued by a trusted Certificate Authority (CA), if the certificate has valid date and if it is related to MSA.Then client uses this public key (which was sent from database server together with the certificate) to encrypt a random symmetric encryption key and sends it to the server (together with request for the web page).The server decrypts the symmetric encryption key using its private key and uses the symmetric key to decrypt requested URL and sends requested page (HTML) to client.
Based on information from [9], in order to receive the certificate for MSA and get it implemented, the VeriSign Managed PKI Server Certificate Registration Request is sent via Security Request Center (SRC) as first.This is an internal process in any carrier's communication system.
When approved, the Certificate Signing Request is generated (CSR file) on database server using UNIX command "keytool -genkey" (a new pair of private and public keys is generated) and "ketytool -certreq" (generates the certificate request using the private key created in previous step).In other words, when public and private keys are generated the certificate request is sent to CA. File request.cer is generated and uploaded to Authority Servers Managed PKI for SSL Subscriber Services.After that the certificate (.cert file) together with primary, secondary and root certificates are received back from CA and implemented using UNIX command "keytollimport".

B. SE SSL Certificate Implementation
When it is approved to request certificate via Security Request Center (following the same internal process as described above), RSA (Rivest Shamir Adleman) private key is created using "openssl genrsa -out file.key"command, see   Container file server.pem(contains public certificate) is used to sign the response for entitlement request if user is entitled for the services.The getPhoneNumber request signed with the public key from secure entitlement server can be sent (after using certificate to make sure that the key is valid) and the response sent from secure entitlement server is decoded by mobile devices utilizing the already mentioned built-in function.
As signature algorithm, the SHA1 is used and signature of the issued key (x509) in server.pemhas to match signature of the certificate (RSA) in the server.keyfile.

VI. ATTACKS AGAINST MSP
Mobile carriers need to have a good level of security to protect especially the nodes and systems used for billing and users' data privacy purposes.Without those security mechanisms, any technically skilled users would be able to manage the billing of his service profile their data to another user's account.Another potential risk (when no security is implemented) is a sniffing and modifying mobile network traffic and so get access to users private information like for example their accounts' credentials.Therefore, the SSL and other security algorithms utilized in mobile networks play crucial role.However, even that the security is implemented there are several recognized types of attack against the SSL: • beast TLS attack [10], • renegotiation attack [11], • version rollback [12], • poodle attack [13], • RC4 attack [14], • Heartbleed [15], Besides the above listed SSL attacks examples, there are other possible security issues related to HTTP traffic and its processing in mobile network.Some of them is e.g. the CA issue.Nowadays, there are too many CAs and some of them could be compromised or they can be corrupted so they issue certificates even for addresses that are banned to require a certificate (e.g.localhost) or CA can issue even a fake certificate.Another well-known vulnerability point is a carrier's own employee.
In order to avoid the security problems, there are several standard ways how to improve it: • Increase the length of private key.
• Change the fundamental principles of security system.
Currently, the web browsers expect that server sends SSL certificate and browser then validates it against the set of root CA integrated in a browser or operating system.Assuming this we can implement different models: 1) DNSEC (Domain Name System Security Extension) -mapping public key on DNS 2) Web of trust -everyone can generate own PGP (Pretty Good Privacy) key 3) Perspective project -new approach to help computers communicate securely on the Internet [?] VII. CONCLUSION In this article, the MSP element as one of key components of state-of-the-art mobile networks has been introduced.The focus has been given especially to its mechanisms for HTTP / HTTP traffic processing and adoption.Also the implemented security algorithms and means to maintain the MSP have been introduced.We have also discussed some potential security risks and offers solutions which decrease possibility of successful attack.The network traffic optimization is currently highly discussed issue in 4G networks and therefore the development of MSP and similar solutions is very active.
The key contribution of this paper lies in uncovering the internal procedures and mechanisms used in the core part of cellular network in order to relieve the network load.Such information is usually protected by vendors and operators, however, to develop high quality mobile service, it is important to understand inner processes employed by the network nodes.The Conference is held in Kalashnikov Izhevsk State Technical University annually since 2004.We invite scientists, specialists, postgraduate and undergraduate students to discuss their achievements in field of electronic instrument engineering, to strenghten creative communications, increase the efficiency of universities, research organisations and enterprises scientific potential implementation in addressing the priorities of instrument making scientific and practical problems.

Figure 1
Figure 1 Tabular form for Parallel Array multiplication

Figure 5 Figure 6 Figure 7
Figure 5 Maximum clock frequency comparisons for RCA multiplier on different FPGA families

Figure 8 Figure 9
Figure 8 Dynamic Power dissipation comparisons for RCA multiplier on different FPGA families

Fig. 1 .
Fig.1.The architecture of the MPTCP protocol stack[6] 10.1.1.1/24fd00:de:202::1/64 iperf -c 192.168.200.1 -t 100 -f M This command performed a 100 seconds long test and printed the throughput in MB/s units.This is called the client side in iperf terminology.On the other side, the server was started with the following command line: iperf -s A file of 1GiB size was downloaded using HTTP with the following command line: wget -O /dev/null http://192.168.200.1/1GB

B
. Investigation of the Reason of the IPv6 Performance Limit 1) Checking the CPU utilization: We measured the CPU utilization of the MPT software during the experiments on both the client and on the server during all the 4 series of experiments thus we got 2x4=8 graphs.The CPU usage of the MPT client and of the MPT server was practically the same.

Fig. 7 .Fig. 8 .
Fig. 7.The throughput results of the iperf test of an IPv6 tunnel over IPv6 using 3GHz CPUs

Gábor
Lencse received his MSc in electrical engineering and computer systems at the Technical University of Budapest in 1994, and his PhD in 2001.He has been working for the Department of Telecommunications, Széchenyi István University in Győr since 1997.He teaches Computer networks, Computer architectures, IP-based telecommunication systems and the Linux operating system.Now, he is an Associate Professor.He is responsible for the specialization of the information and communication technology of the BSc level electrical engineering education.He is a founding member of the Multidisciplinary Doctoral School of Engineering Sciences, Széchenyi István University.The area of his research includes discrete-event simulation methodology, performance analysis of computer networks and IPv6 transition technologies.Dr. Lencse has been working part time for the Department of Networked Systems and Services, Budapest University of Technology and Economics (the former Technical University of Budapest) since 2005.There he teaches Computer architectures and Computer networks.Ákos Kovács received MSc in electrical engineering with specialization in infocommunication systems and services at the Széchenyi István University in 2013.He started working as laboratory engineer at the Department of Telecommunications in 2008.During this time he got familiar with high-end computer systems, virtualization and cloud computing.He also has high skills in the field of computer networks, and networking security.He teaches computer networks and virtualization technology in BSc and holds practical lessons in MSc in the field of IP-based telecommunication.Multi-Radio Mobile Device: Evaluation of Hybrid Node Between WiFi and LTE Networks Pavel Masek, Krystof Zeman, Dalibor Uhlir, Jan Masek, Chris Bougiouklis, and Jiri Hosek Abstract-With the ubiquitous wireless network coverage, Machine-Type Communications (MTC) is emerging to enable data transfers using devices / sensors without need for human interaction.In this paper we introduce a comprehensive simulation scenario for modeling and analysis heterogeneous MTC.We demonstrate the most expected scenario of MTC communication using the IEEE 802.11 standard for direct communication between sensors and for transmitting data between individual sensor and Machine-Type Communication Gateway (MTCG).The MTCG represents the hybrid node serving as a bridge between two heterogeneous networks (WiFi and LTE).Following the idea of hybrid node, two active interfaces must be implemented on this node together with mechanism for handling the incoming traffic (from WiFi network) to LTE network.As a simulation tool, the Network Simulator 3 (NS-3) with implemented LTE/EPC Network Simulator (LENA) framework was used.The major contribution of this paper therefore lies in the implementation of logic for interconnection of two heterogeneous networks in simulation environment NS-3.Keywords-LTE, MTC Communication, MTCG, Network Simulator 3, WiFi.

TABLE 1 RESOURCE
UTILIZATION FOR RCA MULTIPLIER ON SPARTAN-6

TABLE 2 RESOURCE
UTILIZATION FOR CSA MULTIPLIER ON SPARTAN-6

TABLE 3
Figure 4 Resource utilization for BW multiplier on different FPGA families

TABLE 4 TIMING
ANALYSIS FOR RCA MULTIPLIER ON SPARTAN-6

TABLE 5 TIMING
ANALYSIS FOR CSA MULTIPLIER ON SPARTAN-6

TABLE 12
Figure 11 Energy analyses for RCA multiplier on Spartan-6 FPGA family Figure 12 Energy analyses for CSA multiplier on Spartan-6 FPGA family Figure 13 Energy analyses for BW multiplier on Spartan-6 FPGA family